## YouTube Trending Project
* ### Machine Learning Models

### Table of Contents:
* 1.Exploratory Data Analysis
* 2.Data Cleaning
* 3.Modeling
    * 3.1 Predicting Likes
        * 3.1.1 Train-Test Split (80:20)
        * 3.1.2 Linear Regreission
        * 3.1.3 Decision Trees
        * 3.1.4 Random Forest
    * 3.2 Predicting Views
        * 3.2.1 Train-Test Split (80:20)
        * 3.2.2 Linear Regreission
        * 3.2.3 Decision Trees
        * 3.2.4 Random Forest
    * 3.3 Predicting Comment Count
        * 3.3.1 Train-Test Split (80:20)
        * 3.3.2 Linear Regreission
        * 3.3.3 Decision Trees
        * 3.3.4 Random Forest

### 3. Machine Learning Models
##### Loading Data and Libraries

In [1]:
import helpers
import pandas as pd
import numpy as np


# Encoding and Data Split
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Modeling
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import scipy.stats as stats
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Reading the stitched data
df = helpers.load_df("../YouTube-Trending/Data/Curated_US_Data.csv")


### 3.1 Predicting Likes
#### 3.1.1 Train-Test Split (80:20)
Splitting the data into train and test sets in a 80:20 ratio

In [2]:
X = df.drop(columns=['likes_log'])
y = df['likes_log']

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#### 3.1.2 Linear Regression

In [4]:
linreg = LinearRegression()
linreg.fit(X_train, y_train)

LinearRegression()

In [5]:
y_pred = linreg.predict(X_test)

compare = pd.DataFrame({'Actual': y_test.values.flatten(), 'Predicted': y_pred.flatten()})

compare.head(10)

Unnamed: 0,Actual,Predicted
0,13.257265,13.361596
1,11.235392,11.594916
2,10.135075,10.018972
3,12.647182,12.627763
4,9.260463,10.309262
5,10.427654,10.091805
6,10.308386,9.975471
7,10.614278,10.564399
8,10.011849,9.94192
9,9.093582,9.315039


In [6]:
mae = metrics.mean_absolute_error(y_test,linreg.predict(X_test))
mse = metrics.mean_squared_error(y_test,linreg.predict(X_test))
rmse = np.sqrt(metrics.mean_squared_error(y_test,linreg.predict(X_test)))
r2 = metrics.r2_score(y_test, linreg.predict(X_test))

print("mae: ", mae)
print("mse: ", mse)
print("rmse: ", rmse)
print("r2: ", r2)

mae:  0.35918157973707876
mse:  0.22415761079807903
rmse:  0.47345286016464094
r2:  0.8890759189582456


#### 3.1.2 Decisions Tree

In [7]:
decTreeReg = DecisionTreeRegressor()
decTreeReg.fit(X_train,y_train)

DecisionTreeRegressor()

In [8]:
mae = metrics.mean_absolute_error(y_test, decTreeReg.predict(X_test))
mse = metrics.mean_squared_error(y_test, decTreeReg.predict(X_test))
rmse = np.sqrt(metrics.mean_squared_error(y_test, decTreeReg.predict(X_test)))
r2 = metrics.r2_score(y_test, decTreeReg.predict(X_test))

print("mae: ", mae)
print("mse: ", mse)
print("rmse: ", rmse)
print("r2: ", r2)

mae:  0.17432874864886252
mse:  0.09042438614343767
rmse:  0.3007064783862125
r2:  0.955253618643532


#### 3.2.3 Random Forest

In [9]:
randomForestReg = RandomForestRegressor()
randomForestReg.fit(X_train,y_train)

RandomForestRegressor()

In [10]:
mae = metrics.mean_absolute_error(y_test, randomForestReg.predict(X_test))
mse = metrics.mean_squared_error(y_test, randomForestReg.predict(X_test))
rmse = np.sqrt(metrics.mean_squared_error(y_test, randomForestReg.predict(X_test)))
r2 = metrics.r2_score(y_test, randomForestReg.predict(X_test))

print("mae: ", mae)
print("mse: ", mse)
print("rmse: ", rmse)
print("r2: ", r2)

mae:  0.1356406425675881
mse:  0.044291835553718135
rmse:  0.21045625567732154
r2:  0.9780822469558067


#### 3.2.3.1 Feature Importance

In [11]:
pd.DataFrame({'Feature':X_test.columns, 
              'Importance':randomForestReg.feature_importances_}).sort_values(by='Importance',ascending=False)

Unnamed: 0,Feature,Importance
4,comment_log,0.661596
2,views_log,0.130392
1,likeRatio,0.118153
3,dislikes_log,0.062585
9,titleLength,0.007015
7,durationMin,0.005868
10,tagCount,0.005677
8,durationSec,0.004184
0,categoryId,0.00263
5,days_lapse,0.001617
