# YouTube Trending Video Statistics

## Table of Contents
* [1. Introduction](#1.-Introduction)
   * [1.1. Problem Statement](#1.1.-Problem-Statement)
   * [1.2. Objectives](#1.2.-Objectives)
   * [1.3. Data](#1.3.-Data)
* [2. Data Cleaning](#2.-Data-Cleaning)
* [3. Data Exploration](#3.-Data-Exploration)
* [4. Data Transformation](#4.-Data-Transformation)
* [5. Select and Training Models](#5.-Select-and-Training-Models)
* [6. Model 1](#6.-Model-1)
* [7. Model 2](#7.-Model-2)
* [8. Model 3](#8.-Model-3)
* [9. Results Summary](#9.-Results-Summary)
* [10. Conclusion](#10.-Conclusion)

## 1. Introduction

### 1.1. Problem Statement

(From coursework brief)
1. Define the objective in business terms.
2. How will your analysis/solution be used?
3. What are the current solutions/workarounds (if any)?
4. How should you frame this problem (supervised/unsupervised, online/offline, etc.)?
5. How should performance be measured?
6. Is the performance measure aligned with the business objective?
7. What would be the minimum performance needed to reach the business objective?
8. What are comparable problems? Can you reuse experience or tools?
9. Is human expertise available?
10. How would you solve the problem manually?
11. List the assumptions you (or others) have made so far.
12. Verify assumptions if possible.

### 1.2. Objectives

In [1]:
# Importing modules
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import os, re, requests, csv, time

import matplotlib.pyplot as plt
#import missingno as msno
from six.moves import urllib
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier

In [2]:
# Visualization imports
from IPython.display import display, Markdown, Latex
figNo = 1; #variable to store Figure numbers
from pylab import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

#Math Library
import math
from math import log

### 1.3. Data

In [4]:
# Uploading files to the workbook
DATA_PATH = "./"
US_trending_2020 = "US_trending_2020.csv"
World_Top_500 = "World_top_500.csv"
US_Top_500 = "US_top_500.csv"
# Random_videos = "Random_Videos.csv"

def load_data(data_path, file_name):
    csv_path = os.path.join(data_path, file_name)
    return pd.read_csv(csv_path)


US_Trending_2020 = load_data(DATA_PATH, US_trending_2020)
WW_Top_500_Channels = load_data(DATA_PATH, World_Top_500)
US_Top_500_Channels = load_data(DATA_PATH, US_Top_500)
# Random_videos = load_data(DATA_PATH, Random_videos)


# Viewing the first few rows of each dataset
US_Trending_2020.head()
WW_Top_500_Channels.head()
US_Top_500_Channels.head()
# Random_videos.head()

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
0,3C66w5Z0ixs,I ASKED HER TO BE MY GIRLFRIEND...,2020-08-11T19:20:14Z,UCvtRTOMP2TqYqu51xNrqAzg,Brawadis,22,2020-08-12T00:00:00Z,brawadis|prank|basketball|skits|ghost|funny vi...,1514614,156908,5855,35313,https://i.ytimg.com/vi/3C66w5Z0ixs/default.jpg,False,False,SUBSCRIBE to BRAWADIS ▶ http://bit.ly/Subscrib...
1,M9Pmf9AB4Mo,Apex Legends | Stories from the Outlands – “Th...,2020-08-11T17:00:10Z,UC0ZV6M2THA81QT9hrVWJG3A,Apex Legends,20,2020-08-12T00:00:00Z,Apex Legends|Apex Legends characters|new Apex ...,2381688,146739,2794,16549,https://i.ytimg.com/vi/M9Pmf9AB4Mo/default.jpg,False,False,"While running her own modding shop, Ramya Pare..."
2,J78aPJ3VyNs,I left youtube for a month and THIS is what ha...,2020-08-11T16:34:06Z,UCYzPXprvl5Y-Sf0g4vX-m6g,jacksepticeye,24,2020-08-12T00:00:00Z,jacksepticeye|funny|funny meme|memes|jacksepti...,2038853,353787,2628,40221,https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg,False,False,I left youtube for a month and this is what ha...
3,kXLn3HkpjaA,XXL 2020 Freshman Class Revealed - Official An...,2020-08-11T16:38:55Z,UCbg_UMjlHJg_19SZckaKajg,XXL,10,2020-08-12T00:00:00Z,xxl freshman|xxl freshmen|2020 xxl freshman|20...,496771,23251,1856,7647,https://i.ytimg.com/vi/kXLn3HkpjaA/default.jpg,False,False,Subscribe to XXL → http://bit.ly/subscribe-xxl...
4,VIUo6yapDbc,Ultimate DIY Home Movie Theater for The LaBran...,2020-08-11T15:10:05Z,UCDVPcEbVLQgLZX0Rt6jo34A,Mr. Kate,26,2020-08-12T00:00:00Z,The LaBrant Family|DIY|Interior Design|Makeove...,1123889,45802,964,2196,https://i.ytimg.com/vi/VIUo6yapDbc/default.jpg,False,False,Transforming The LaBrant Family's empty white ...


Unnamed: 0,Rank,Grade,Ch_name,Uploads,Subscriptions,Views
0,1st,A++,T-Series,14297,135M,104724369854
1,2nd,A++,Cocomelon - Nursery Rhymes,517,78.2M,57054290512
2,3rd,A++,✿ Kids Diana Show,691,50.9M,24157678368
3,4th,A++,Like Nastya,400,52.2M,30591257306
4,5th,A++,SET India,37017,69.3M,52149505781


Unnamed: 0,Rank,Grade,Ch_name,Uploads,Subscriptions,Views
0,1st,A++,Cocomelon - Nursery Rhymes,518,78.2M,57088982878
1,2nd,A++,✿ Kids Diana Show,692,50.9M,24179259602
2,3rd,A++,Like Nastya,401,52.3M,30609490114
3,4th,A++,Movieclips,35216,36.3M,35071065049
4,5th,A++,Vlad and Nikita,219,37.7M,18086626003


In [6]:
# Viewing the information of each dataset
US_Trending_2020.info()
WW_Top_500_Channels.info()
US_Top_500_Channels.info()
# Random_videos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19598 entries, 0 to 19597
Data columns (total 16 columns):
video_id             19598 non-null object
title                19598 non-null object
publishedAt          19598 non-null object
channelId            19598 non-null object
channelTitle         19598 non-null object
categoryId           19598 non-null int64
trending_date        19598 non-null object
tags                 19598 non-null object
view_count           19598 non-null int64
likes                19598 non-null int64
dislikes             19598 non-null int64
comment_count        19598 non-null int64
thumbnail_link       19598 non-null object
comments_disabled    19598 non-null bool
ratings_disabled     19598 non-null bool
description          19473 non-null object
dtypes: bool(2), int64(5), object(9)
memory usage: 2.1+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
Rank             500 non-null object
Grade       

## 2. Data Cleaning

- added country columns to each dataset and then concatenated them into 1

In [37]:
# put the after_cleaning_datas to new folder ,otherwise it's terriable and can't recognize it!
path = 'after_data_cleaning'
if not os.path.exists(path):
    os.mkdir(path)

In [38]:
files = [US_Trending_2020,WW_Top_500_Channels,US_Top_500_Channels]
file_names = ["US_trending_2020","World_Top_500","US_Top_500"]
for i in range(len(files)):
    data = pd.DataFrame(files[i])
    # 删除表中任何含有NaN的行
    data = data.dropna(axis=0,how='any')
    # 删除表中任何含有Nan的列
    data = data.dropna(axis=1, how='any')
    data.to_csv(path + "/" + "{}.csv".format(file_names[i]),encoding='utf-8-sig')
    

In [30]:
# 查看清洗后的数据情况
for file in file_names:
    str_now = "{}.csv".format(file)
    data = pd.read_csv(path + "/" + str_now)
    data.isnull()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19473 entries, 0 to 19472
Data columns (total 17 columns):
Unnamed: 0           19473 non-null int64
video_id             19473 non-null object
title                19473 non-null object
publishedAt          19473 non-null object
channelId            19473 non-null object
channelTitle         19473 non-null object
categoryId           19473 non-null int64
trending_date        19473 non-null object
tags                 19473 non-null object
view_count           19473 non-null int64
likes                19473 non-null int64
dislikes             19473 non-null int64
comment_count        19473 non-null int64
thumbnail_link       19473 non-null object
comments_disabled    19473 non-null bool
ratings_disabled     19473 non-null bool
description          19473 non-null object
dtypes: bool(2), int64(6), object(9)
memory usage: 2.3+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 495 entries, 0 to 494
Data columns (total 7 columns):
Unnamed

## 3. Data Exploration

- histogram views for all variables?
- histogram showing channels / top channels and how many days trending total/ consecutive?
- histogram most popular trending categories (by country)

## 4. Data Transformation

Insert Notes Here

## 5. Select and Training Models

- how to make it into trending?
- decision tree to predict id video will trend 

## 6. Model 1 - 

Insert Notes Here

## 7. Model 2 - 

Insert Notes Here

## 8. Model 3 - 

Insert Notes Here

## LIMITATIONS CHAPTER

## 9. Results Summary

Insert Notes Here

## 10. Conclusion

Insert Notes Here