## YouTube Trending Project
Analyzing data from the YouTube trending page in english speaking countries
over the span of a few days (10/23-27/2020)

Goal: 
* To understand common characteristics of trending videos in different countries

* To predict engagement (likes or comments) on a video in english speaking countries

## Table of Contents:
* 1. Data Overview
    * 1.1 Data Analysis 
* 2. Cleaning
* 3. Modeling

In [49]:
import pandas as pd
import numpy as np
import datetime as dt

# Encoding and Data Split
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
# import category_encoders as ce

# Plotting Modules
import matplotlib.pyplot as plt

# Reading the stitched data
trend_data = pd.read_csv("../YouTube-Trending/Data/INT_10.23-27.20.csv")
# Set seed for reproducibility
np.random.seed(0)

df = trend_data.copy()
df.head(10)

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description,duration,country
0,bPiofmZGb8o,Second 2020 Presidential Debate between Donald...,2020-10-23T02:49:33Z,UCb--64Gl51jIEVE-GLDAVTg,C-SPAN,25,20.23.10,C-SPAN|CSPAN|2020|Donald Trump|Republican|Whit...,6641600,94601,6209,59293,https://i.ytimg.com/vi/bPiofmZGb8o/default.jpg,False,False,President Donald Trump and former Vice Preside...,1H59M15S,US
1,tcYodQoapMg,Ariana Grande - positions (official video),2020-10-23T04:00:10Z,UC0VOyT2OCBKdQhF3BAbZ-1g,ArianaGrandeVevo,10,20.23.10,ariana grande positions|positions ariana grand...,7516529,1485130,10810,140549,https://i.ytimg.com/vi/tcYodQoapMg/default.jpg,False,False,The official “positions” music video by Ariana...,2M58S,US
2,np9Ub1LilKU,Jack Harlow - Tyler Herro [Official Video],2020-10-22T19:00:14Z,UC6vZl7Qj7JglLDmN_7Or-ZQ,Jack Harlow,10,20.23.10,jack harlow|jack rapper|harlow rapper|private ...,1499338,153028,2006,11013,https://i.ytimg.com/vi/np9Ub1LilKU/default.jpg,False,False,Jack Harlow - Tyler HerroListen now: https://J...,3M,US
3,5S4bm3bAt9Y,SURPRISING BEST FRIEND WITH BORAT!!,2020-10-21T19:56:24Z,UCef29bYGgUSoJjVkqhcAPkw,David Dobrik Too,22,20.23.10,[none],5320147,596894,7044,33648,https://i.ytimg.com/vi/5S4bm3bAt9Y/default.jpg,False,False,Thank you Borat for coming over!! I like youWa...,5M55S,US
4,GuEkHIgR46k,Bryson Tiller - Always Forever (Official Video),2020-10-22T16:00:08Z,UCwhe-6skwaZxLomc-U6Wy1w,BrysonTillerVEVO,10,20.23.10,Bryson Tiller 2020|Bryson Tiller Serenity|Brys...,862087,82059,657,4459,https://i.ytimg.com/vi/GuEkHIgR46k/default.jpg,False,False,A N N I V E R S A R Y OUT NOW!Stream/Download:...,2M59S,US
5,0tn6nWYNK3Q,Machine Gun Kelly ft. Halsey - forget me too [...,2020-10-22T16:00:10Z,UC2a9zmrdjvLsrRgi4G4blWw,MGKVEVO,10,20.23.10,Machine|Gun|Kelly|Halsey|forget|too|Bad|Boy/In...,2143537,186923,2948,10481,https://i.ytimg.com/vi/0tn6nWYNK3Q/default.jpg,False,False,Machine Gun Kelly - Tickets To My Downfall is ...,2M58S,US
6,2dLE4Nn6-ug,WE WENT TO THERAPY...,2020-10-22T01:23:18Z,UCWwWOFsW68TqXE-HZLC3WIA,The ACE Family,22,20.23.10,we went to therapy|the ace family we went to t...,3052198,186673,8850,12714,https://i.ytimg.com/vi/2dLE4Nn6-ug/default.jpg,False,False,WE WENT TO THERAPY...LAST VIDEO: https://www.y...,21M28S,US
7,sz1ovZUA4nQ,Matthew McConaughey Grunts it Out While Eating...,2020-10-22T15:00:11Z,UCPD_bxCRGpmmeQcbe2kpPaA,First We Feast,24,20.23.10,First we feast|fwf|firstwefeast|food|food porn...,1246718,53162,665,4582,https://i.ytimg.com/vi/sz1ovZUA4nQ/default.jpg,False,False,Matthew McConaughey is an Academy Award-winnin...,27M59S,US
8,B2J3kLJ8PQk,nba youngboy - the story of O.J. (Top Version),2020-10-21T07:56:28Z,UClW4jraMKz6Qj69lJf-tODA,YoungBoy Never Broke Again,10,20.23.10,YoungBoy Never Broke Again|NBA YoungBoy|YoungB...,3789523,216158,8785,24388,https://i.ytimg.com/vi/B2J3kLJ8PQk/default.jpg,False,False,YoungBoy Never Broke Again – ‘TOP’ OUT NOW: h...,4M20S,US
9,hi1zMqY2goU,The Dixie D'Amelio Show with Trippie Redd,2020-10-21T16:30:11Z,UCLOEGprmycLLbyzBj2jozLg,Dixie D'Amelio,22,20.23.10,dixie damelio|dixie d'amelio|dixie|damelio|be ...,3201605,195100,5529,13568,https://i.ytimg.com/vi/hi1zMqY2goU/default.jpg,False,False,"Hey guys! This week on the Early Late Show, Tr...",11M48S,US


### 2. Cleaning


In [50]:
# Adding Like/Dislike Ratio Column
df['likeRatio'] = (df['likes']-df['dislikes'])/(df['likes']+df['dislikes'])

# Changing the 'publishedAt' and 'trending_date' type from string to datetime type
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['trending_date'] = pd.to_datetime(df['trending_date'], format="%y.%d.%m")

# Breaking down 'duration' into Hour, Minutes, and Seconds
df['durationHr'] = df['duration'].str.extract('(\d+)H').fillna(0).astype(int)
df['durationMin'] = df['duration'].str.extract('(\d+)M').fillna(0).astype(int)
df['durationSec'] = df['duration'].str.extract('M(\d+)S').fillna(0).astype(int)

# Adding 'titleLength' Column
df['titleLength'] = df['title'].apply(lambda x: len(str(x)))

# Adding 'tagCount' Column
df.loc[df['tags'].str.count("\|") != 0, 'tagCount'] = df['tags'].str.count("\|") + 1
df.loc[df['tags'].str.count("\|") == 0, 'tagCount'] = 0
df['tagCount'] = df['tagCount'].astype(int)

In [51]:
# Dropping unneeded columns
df.drop(columns=['channelTitle', 'channelId', 'video_id', 'title', 'description',
'channelTitle', 'tags', 'thumbnail_link', 'comments_disabled', 'duration',
'ratings_disabled'], axis=0, inplace=True)

In [52]:

df.head()

Unnamed: 0,publishedAt,categoryId,trending_date,view_count,likes,dislikes,comment_count,country,likeRatio,durationHr,durationMin,durationSec,titleLength,tagCount
0,2020-10-23 02:49:33+00:00,25,2020-10-23,6641600,94601,6209,59293,US,0.876818,1,59,15,66,12
1,2020-10-23 04:00:10+00:00,10,2020-10-23,7516529,1485130,10810,140549,US,0.985548,0,2,58,42,22
2,2020-10-22 19:00:14+00:00,10,2020-10-23,1499338,153028,2006,11013,US,0.974122,0,3,0,42,26
3,2020-10-21 19:56:24+00:00,22,2020-10-23,5320147,596894,7044,33648,US,0.976673,0,5,55,35,0
4,2020-10-22 16:00:08+00:00,10,2020-10-23,862087,82059,657,4459,US,0.984114,0,2,59,47,22


In [53]:
df.dtypes

publishedAt      datetime64[ns, UTC]
categoryId                     int64
trending_date         datetime64[ns]
view_count                     int64
likes                          int64
dislikes                       int64
comment_count                  int64
country                       object
likeRatio                    float64
durationHr                     int64
durationMin                    int64
durationSec                    int64
titleLength                    int64
tagCount                       int64
dtype: object

### One Hot Encoding Country
Mapping each date to a vector consisting of 0s and 1s 
that denote the absence or presence of the feature

### Dictionary:
* 0 - Canada
* 1 - Great Britain
* 2 - United States

In [54]:
encoder = OneHotEncoder()

enc_df = pd.DataFrame(encoder.fit_transform(df[['country']]).toarray().astype(int))

df = df.join(enc_df)

df.drop(columns=['country'], axis=0, inplace=True)
df.head()

Unnamed: 0,publishedAt,categoryId,trending_date,view_count,likes,dislikes,comment_count,likeRatio,durationHr,durationMin,durationSec,titleLength,tagCount,0,1,2
0,2020-10-23 02:49:33+00:00,25,2020-10-23,6641600,94601,6209,59293,0.876818,1,59,15,66,12,0,0,1
1,2020-10-23 04:00:10+00:00,10,2020-10-23,7516529,1485130,10810,140549,0.985548,0,2,58,42,22,0,0,1
2,2020-10-22 19:00:14+00:00,10,2020-10-23,1499338,153028,2006,11013,0.974122,0,3,0,42,26,0,0,1
3,2020-10-21 19:56:24+00:00,22,2020-10-23,5320147,596894,7044,33648,0.976673,0,5,55,35,0,0,0,1
4,2020-10-22 16:00:08+00:00,10,2020-10-23,862087,82059,657,4459,0.984114,0,2,59,47,22,0,0,1
