## YouTube Trending Project
* ### Machine Learning Models

### Table of Contents:
* 1.Exploratory Data Analysis
* 2.Data Cleaning
* 3.Modeling
    * 3.1 Predicting Likes
        * 3.1.1 Train-Test Split (80:20)
        * 3.1.2 Linear Regreission
        * 3.1.3 Decision Trees
        * 3.1.4 Random Forest
    * 3.2 Predicting Views
        * 3.2.1 Train-Test Split (80:20)
        * 3.2.2 Linear Regreission
        * 3.2.3 Decision Trees
        * 3.2.4 Random Forest
    * 3.3 Predicting Comment Count
        * 3.3.1 Train-Test Split (80:20)
        * 3.3.2 Linear Regreission
        * 3.3.3 Decision Trees
        * 3.3.4 Random Forest

### 3. Machine Learning Models
##### Loading Data and Libraries

In [1]:
import helpers
import pandas as pd
import numpy as np


# Encoding and Data Split
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Modeling
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import scipy.stats as stats
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Reading the stitched data
df = helpers.load_df("../YouTube-Trending/Data/Curated_US_Data.csv")


### 3.1 Predicting Likes
#### 3.1.1 Train-Test Split (80:20)
Splitting the data into train and test sets in a 80:20 ratio

In [2]:
X = df.drop(columns=['likes_log'])
y = df['likes_log']

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#### 3.1.2 Linear Regression

In [4]:
linreg = LinearRegression()
linreg.fit(X_train, y_train)

LinearRegression()

In [5]:
y_pred = linreg.predict(X_test)

compare = pd.DataFrame({'Actual': y_test.values.flatten(), 'Predicted': y_pred.flatten()})

compare.head(10)

Unnamed: 0,Actual,Predicted
0,11.820785,12.132594
1,9.087721,9.946228
2,10.354053,10.080516
3,10.271389,10.855188
4,11.637185,11.456772
5,12.09092,12.337641
6,13.084304,12.779471
7,10.272047,10.884827
8,11.998384,11.735399
9,9.833119,9.537033


In [6]:
mae = metrics.mean_absolute_error(y_test,linreg.predict(X_test))
mse = metrics.mean_squared_error(y_test,linreg.predict(X_test))
rmse = np.sqrt(metrics.mean_squared_error(y_test,linreg.predict(X_test)))
r2 = metrics.r2_score(y_test, linreg.predict(X_test))

print("mae: ", mae)
print("mse: ", mse)
print("rmse: ", rmse)
print("r2: ", r2)

mae:  0.3872050852893101
mse:  0.24979684714847952
rmse:  0.49979680586062125
r2:  0.8859949193095084


#### 3.1.2 Decisions Tree

In [7]:
decTreeReg = DecisionTreeRegressor()
decTreeReg.fit(X_train,y_train)

DecisionTreeRegressor()

In [8]:
mae = metrics.mean_absolute_error(y_test, decTreeReg.predict(X_test))
mse = metrics.mean_squared_error(y_test, decTreeReg.predict(X_test))
rmse = np.sqrt(metrics.mean_squared_error(y_test, decTreeReg.predict(X_test)))
r2 = metrics.r2_score(y_test, decTreeReg.predict(X_test))

print("mae: ", mae)
print("mse: ", mse)
print("rmse: ", rmse)
print("r2: ", r2)

mae:  0.1422110764462419
mse:  0.062362250059451715
rmse:  0.24972434815101974
r2:  0.9715384183938781


#### 3.2.3 Random Forest

In [9]:
randomForestReg = RandomForestRegressor()
randomForestReg.fit(X_train,y_train)

RandomForestRegressor()

In [10]:
mae = metrics.mean_absolute_error(y_test, randomForestReg.predict(X_test))
mse = metrics.mean_squared_error(y_test, randomForestReg.predict(X_test))
rmse = np.sqrt(metrics.mean_squared_error(y_test, randomForestReg.predict(X_test)))
r2 = metrics.r2_score(y_test, randomForestReg.predict(X_test))

print("mae: ", mae)
print("mse: ", mse)
print("rmse: ", rmse)
print("r2: ", r2)

mae:  0.0948148574625263
mse:  0.02410332330958596
rmse:  0.15525245025308285
r2:  0.988999455556839


#### 3.2.3.1 Feature Importance

In [11]:
pd.DataFrame({'Feature':X_test.columns, 
              'Importance':randomForestReg.feature_importances_}).sort_values(by='Importance',ascending=False)

Unnamed: 0,Feature,Importance
4,comment_log,0.646573
1,likeRatio,0.124834
2,views_log,0.118658
3,dislikes_log,0.089749
9,titleLength,0.005408
7,durationMin,0.004546
8,durationSec,0.003198
10,tagCount,0.003132
0,categoryId,0.002067
5,days_lapse,0.001105
