## YouTube Trending Project
* ### Data Cleaning and Curation

### Table of Contents:
* 1.Exploratory Data Analysis
* 2.Data Cleaning
    * 2.1 Preprocessing Data
        * 2.1.1 Drop Duplicate Rows
        * 2.1.2 Drop Columns
        * 2.1.3 Handling Missing Data
    * 2.2 Post-Processed Data
        * 2.2.1 Column Information
        * 2.2.2 Exporting Curated Data
* 3.Modeling

### 2. Data Cleaning
##### Loading Data and Libraries

In [None]:
import helpers
import pandas as pd
import numpy as np
import datetime as dt

# Encoding and Data Split
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Modeling
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Plotting Modules
import matplotlib.pyplot as plt

# Reading the stitched data
trend_data = helpers.load_df("../YouTube-Trending/Data/US_Data.csv")

df = trend_data.copy()
df.head()

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description,duration,country
0,bPiofmZGb8o,Second 2020 Presidential Debate between Donald...,2020-10-23T02:49:33Z,UCb--64Gl51jIEVE-GLDAVTg,C-SPAN,25,20.23.10,C-SPAN|CSPAN|2020|Donald Trump|Republican|Whit...,6641600,94601,6209,59293,https://i.ytimg.com/vi/bPiofmZGb8o/default.jpg,False,False,President Donald Trump and former Vice Preside...,1H59M15S,US
1,tcYodQoapMg,Ariana Grande - positions (official video),2020-10-23T04:00:10Z,UC0VOyT2OCBKdQhF3BAbZ-1g,ArianaGrandeVevo,10,20.23.10,ariana grande positions|positions ariana grand...,7516529,1485130,10810,140549,https://i.ytimg.com/vi/tcYodQoapMg/default.jpg,False,False,The official “positions” music video by Ariana...,2M58S,US
2,np9Ub1LilKU,Jack Harlow - Tyler Herro [Official Video],2020-10-22T19:00:14Z,UC6vZl7Qj7JglLDmN_7Or-ZQ,Jack Harlow,10,20.23.10,jack harlow|jack rapper|harlow rapper|private ...,1499338,153028,2006,11013,https://i.ytimg.com/vi/np9Ub1LilKU/default.jpg,False,False,Jack Harlow - Tyler HerroListen now: https://J...,3M,US
3,5S4bm3bAt9Y,SURPRISING BEST FRIEND WITH BORAT!!,2020-10-21T19:56:24Z,UCef29bYGgUSoJjVkqhcAPkw,David Dobrik Too,22,20.23.10,[none],5320147,596894,7044,33648,https://i.ytimg.com/vi/5S4bm3bAt9Y/default.jpg,False,False,Thank you Borat for coming over!! I like youWa...,5M55S,US
4,GuEkHIgR46k,Bryson Tiller - Always Forever (Official Video),2020-10-22T16:00:08Z,UCwhe-6skwaZxLomc-U6Wy1w,BrysonTillerVEVO,10,20.23.10,Bryson Tiller 2020|Bryson Tiller Serenity|Brys...,862087,82059,657,4459,https://i.ytimg.com/vi/GuEkHIgR46k/default.jpg,False,False,A N N I V E R S A R Y OUT NOW!Stream/Download:...,2M59S,US


##### Checking shape of data frame and applying feature engineering

In [None]:
np.shape(df)

(2200, 18)

In [None]:
helpers.featureEng(df)

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,...,dislikes_log,comment_log,new_date_published,new_date_trending,days_lapse,durationHr,durationMin,durationSec,titleLength,tagCount
0,bPiofmZGb8o,Second 2020 Presidential Debate between Donald...,2020-10-23 02:49:33,UCb--64Gl51jIEVE-GLDAVTg,C-SPAN,25,2020-10-23,C-SPAN|CSPAN|2020|Donald Trump|Republican|Whit...,6641600,94601,...,8.733755,10.990247,2020-10-23,2020-10-23,0.0,1,59,15,66,12
1,tcYodQoapMg,Ariana Grande - positions (official video),2020-10-23 04:00:10,UC0VOyT2OCBKdQhF3BAbZ-1g,ArianaGrandeVevo,10,2020-10-23,ariana grande positions|positions ariana grand...,7516529,1485130,...,9.288227,11.853311,2020-10-23,2020-10-23,0.0,0,2,58,42,22
2,np9Ub1LilKU,Jack Harlow - Tyler Herro [Official Video],2020-10-22 19:00:14,UC6vZl7Qj7JglLDmN_7Or-ZQ,Jack Harlow,10,2020-10-23,jack harlow|jack rapper|harlow rapper|private ...,1499338,153028,...,7.603898,9.306832,2020-10-22,2020-10-23,1440.0,0,3,0,42,26
3,5S4bm3bAt9Y,SURPRISING BEST FRIEND WITH BORAT!!,2020-10-21 19:56:24,UCef29bYGgUSoJjVkqhcAPkw,David Dobrik Too,22,2020-10-23,[none],5320147,596894,...,8.859931,10.423709,2020-10-21,2020-10-23,2880.0,0,5,55,35,0
4,GuEkHIgR46k,Bryson Tiller - Always Forever (Official Video),2020-10-22 16:00:08,UCwhe-6skwaZxLomc-U6Wy1w,BrysonTillerVEVO,10,2020-10-23,Bryson Tiller 2020|Bryson Tiller Serenity|Brys...,862087,82059,...,6.487684,8.402680,2020-10-22,2020-10-23,1440.0,0,2,59,47,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2195,2Up7Jbtg37A,FINDING OUT WE'RE PREGNANT WITH OUR RAINBOW BA...,2020-12-19 23:01:13,UCkoZyNqTZyBZ8nu8E_OSkWw,Lauren & Arie,24,2020-12-27,THE BACHELOR|ARIE AND LAUREN|LAUREN AND ARIE|P...,549407,10621,...,5.402677,7.449498,2020-12-19,2020-12-27,11520.0,0,9,31,74,7
2196,Gnoc-XuCT9Y,Making Raising Cane's Chicken Finger Combo At ...,2020-12-20 15:00:04,UChBEbMKI1eCcejTtmI32UEw,Joshua Weissman,26,2020-12-27,raising cane's|raising canes sauce|canes sauce...,960486,51994,...,7.426549,8.313362,2020-12-20,2020-12-27,10080.0,0,10,9,63,23
2197,3niVHCdB1wA,Singapur (Remix) - El Alfa El Jefe x Farruko x...,2020-12-18 16:10:59,UCEU5ZK7DwN9ppqPFJiGah3A,ElAlfaElJefeTV,10,2020-12-27,Music|Musica|Latino|Latin|Urban|Urbano|Trap|Re...,14043659,349090,...,9.303102,10.105938,2020-12-18,2020-12-27,12960.0,0,4,51,93,19
2198,eqLuPPTZquo,CANELO ALVAREZ EXCLUSIVE! SPEAKS ENGLISH SAYS ...,2020-12-20 06:06:06,UCms2Ifa9owy2CK3ToKGkaQw,Behind The Gloves,17,2020-12-27,BOXING|BOXEO|NEWS|EXCLUSIVE|HIGHLIGHTS|SPORTS|...,336800,5199,...,4.905275,6.877296,2020-12-20,2020-12-27,10080.0,0,0,0,88,23


In [None]:
# Must drop duplicate video observations collected over multiple days
df[df.video_id == 'tcYodQoapMg']

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,...,dislikes_log,comment_log,new_date_published,new_date_trending,days_lapse,durationHr,durationMin,durationSec,titleLength,tagCount
1,tcYodQoapMg,Ariana Grande - positions (official video),2020-10-23 04:00:10,UC0VOyT2OCBKdQhF3BAbZ-1g,ArianaGrandeVevo,10,2020-10-23,ariana grande positions|positions ariana grand...,7516529,1485130,...,9.288227,11.853311,2020-10-23,2020-10-23,0.0,0,2,58,42,22
201,tcYodQoapMg,Ariana Grande - positions (official video),2020-10-23 04:00:10,UC0VOyT2OCBKdQhF3BAbZ-1g,ArianaGrandeVevo,10,2020-10-24,ariana grande positions|positions ariana grand...,25585327,2512982,...,10.447061,12.198529,2020-10-23,2020-10-24,1440.0,0,2,58,42,22
409,tcYodQoapMg,Ariana Grande - positions (official video),2020-10-23 04:00:10,UC0VOyT2OCBKdQhF3BAbZ-1g,ArianaGrandeVevo,10,2020-10-25,ariana grande positions|positions ariana grand...,30951274,2669799,...,10.623934,12.244918,2020-10-23,2020-10-25,2880.0,0,2,58,42,22
635,tcYodQoapMg,Ariana Grande - positions (official video),2020-10-23 04:00:10,UC0VOyT2OCBKdQhF3BAbZ-1g,ArianaGrandeVevo,10,2020-10-26,ariana grande positions|positions ariana grand...,35639675,2805619,...,10.751328,12.284408,2020-10-23,2020-10-26,4320.0,0,2,58,42,22
865,tcYodQoapMg,Ariana Grande - positions (official video),2020-10-23 04:00:10,UC0VOyT2OCBKdQhF3BAbZ-1g,ArianaGrandeVevo,10,2020-10-27,ariana grande positions|positions ariana grand...,39680390,2934736,...,10.840796,12.309869,2020-10-23,2020-10-27,5760.0,0,2,58,42,22
1095,tcYodQoapMg,Ariana Grande - positions (official video),2020-10-23 04:00:10,UC0VOyT2OCBKdQhF3BAbZ-1g,ArianaGrandeVevo,10,2020-10-28,ariana grande positions|positions ariana grand...,43246851,3033311,...,10.902703,12.33129,2020-10-23,2020-10-28,7200.0,0,2,58,42,22


### 2.1 Preprocessing Data
* ##### 2.1.1 Drop Duplicate Rows
* ##### 2.1.2 Drop Columns - Unnecessary or low correlation
* ##### 2.1.3 Handling Missing Data - Drop rows with missing values

In [None]:
helpers.preprocess(df)

Unnamed: 0,categoryId,likeRatio,likes_log,views_log,dislikes_log,comment_log,days_lapse,durationHr,durationMin,durationSec,titleLength,tagCount
0,25,0.876818,11.457423,15.708863,8.733755,10.990247,0.0,1,59,15,66,12
1,10,0.985548,14.211013,15.832615,9.288227,11.853311,0.0,0,2,58,42,22
2,10,0.974122,11.938376,14.220534,7.603898,9.306832,1440.0,0,3,0,42,26
3,22,0.976673,13.299495,15.487011,8.859931,10.423709,2880.0,0,5,55,35,0
4,10,0.984114,11.315194,13.667111,6.487684,8.402680,1440.0,0,2,59,47,22
...,...,...,...,...,...,...,...,...,...,...,...,...
2195,24,0.959052,9.270588,13.216595,5.402677,7.449498,11520.0,0,9,31,74,7
2196,26,0.937400,10.858884,13.775195,7.426549,8.313362,10080.0,0,10,9,63,23
2197,10,0.939055,12.763085,16.457682,9.303102,10.105938,12960.0,0,4,51,93,19
2198,17,0.949381,8.556222,12.727245,4.905275,6.877296,10080.0,0,0,0,88,23


### 2.2 Post-Processed Data
* ##### 2.2.1 Column Information
* ##### 2.2.2 Exporting Curated Data

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2156 entries, 0 to 2199
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   categoryId    2156 non-null   int64  
 1   likeRatio     2156 non-null   float64
 2   likes_log     2156 non-null   float64
 3   views_log     2156 non-null   float64
 4   dislikes_log  2156 non-null   float64
 5   comment_log   2156 non-null   float64
 6   days_lapse    2156 non-null   float64
 7   durationHr    2156 non-null   int64  
 8   durationMin   2156 non-null   int64  
 9   durationSec   2156 non-null   int64  
 10  titleLength   2156 non-null   int64  
 11  tagCount      2156 non-null   int64  
dtypes: float64(6), int64(6)
memory usage: 299.0 KB


In [None]:
df.to_csv("../YouTube-Trending/Data/Curated_US_Data.csv", index=False)