## YouTube Trending Project
* ### Machine Learning Models

### Table of Contents:
* 1.Exploratory Data Analysis
* 2.Data Cleaning
* 3.Modeling
    * 3.1 Predicting Likes
        * 3.1.1 Pre-processing Data
            * 3.1.1.1 Train-Test Split (80:20)
            * 3.1.1.2 Initializing Pre-processing Pipeline
        * 3.1.2 Hyperparameter Tuning (Gridsearch)
        * 3.1.3 Regressors
            * 3.1.3.1 Linear Regression
            * 3.1.3.2 Random Forest
            * 3.1.3.3 XGBoost
        * 3.1.4 Random Forest
            * 3.1.4.1 Feature Importance
        * 3.1.5 Likes Evaluation
    * 3.2 Predicting Views
        * 3.2.1 Pre-processing Data
            * 3.2.1.1 Train-Test Split (80:20)
            * 3.2.1.2 Initializing Pre-processing Pipeline
        * 3.2.2 Hyperparameter Tuning (Gridsearch)
        * 3.2.3 Regressors
            * 3.2.3.1 Linear Regression
            * 3.2.3.2 Random Forest
            * 3.2.3.3 XGBoost
        * 3.2.4 Random Forest
            * 3.2.4.1 Feature Importance
        * 3.2.5 Views Evaluation
    * 3.3 Predicting Comment Count
        * 3.3.1 Pre-processing Data
            * 3.3.1.1 Train-Test Split (80:20)
            * 3.3.1.2 Initializing Pre-processing Pipeline
        * 3.3.2 Hyperparameter Tuning (Gridsearch)
        * 3.3.3 Regressors
            * 3.3.3.1 Linear Regression
            * 3.3.3.2 Decision Trees
            * 3.3.3.3 Random Forest
        * 3.3.4 Random Forest
            * 3.1.4.1 Feature Importance

### 3. Machine Learning Models
##### Loading Data and Libraries

In [4]:
import helpers
import pandas as pd
import numpy as np
import seaborn as sns


# Encoding and Data Split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Modeling
from sklearn import metrics
import xgboost as xgb
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Tuning
import optuna
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

# Reading the stitched data
df = helpers.load_df("Data/Curated_US_Data.csv")

df.head()

Unnamed: 0,categoryId,likeRatio,likes_log,views_log,dislikes_log,comment_log,days_lapse,durationHr,durationMin,durationSec,titleLength,tagCount
0,25,0.876818,11.457423,15.708863,8.733755,10.990247,0.0,1,59,15,66,12
1,10,0.985548,14.211013,15.832615,9.288227,11.853311,0.0,0,2,58,42,22
2,10,0.974122,11.938376,14.220534,7.603898,9.306832,1440.0,0,3,0,42,26
3,22,0.976673,13.299495,15.487011,8.859931,10.423709,2880.0,0,5,55,35,0
4,10,0.984114,11.315194,13.667111,6.487684,8.40268,1440.0,0,2,59,47,22


### 3.1 Predicting Likes
#### 3.1.1 Preprocessing Data
##### 3.1.1.1 Train-Test Split (80:20)
Splitting the data into train and test sets in a 80:20 ratio

In [None]:
X = df.drop(columns=['likes_log'])
y = df['likes_log']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

##### 3.1.1.2 Initializing Preprocessing Pipeline
Scaling numercal data and encoding categorical data

In [None]:
numeric_features = X.select_dtypes(include=['int64', 'float64']).drop(['durationHr','durationMin','durationSec', 'categoryId'],axis=1).columns
categorical_features = list(X.select_dtypes(include=['object']).columns) + ['durationHr','durationMin','durationSec', 'categoryId']

preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', StandardScaler(), numeric_features),
        ('categorical', OneHotEncoder(handle_unknown = "ignore"), categorical_features)])

y

0       11.457423
1       14.211013
2       11.938376
3       13.299495
4       11.315194
          ...    
2542    10.416820
2543     8.392990
2544    11.840941
2545    10.822415
2546     9.692643
Name: likes_log, Length: 2547, dtype: float64

In [None]:
print('Numeric Features:', numeric_features)
print('Categorical Features:', categorical_features)

Numeric Features: Index(['categoryId', 'likeRatio', 'views_log', 'dislikes_log', 'comment_log',
       'days_lapse', 'titleLength', 'tagCount'],
      dtype='object')
Categorical Features: ['durationHr', 'durationMin', 'durationSec']


#### 3.1.2 Hyperparameter Tuning (Optuna)
Using bayesian hyperparameter optimization to find optimal parameters

In [None]:
def rfObjective(trial):
    rfParams={
        'n_estimators' : trial.suggest_int('n_estimators', 100, 500),
        'max_depth' : trial.suggest_int('max_depth', 1, 50),
        'min_samples_leaf' : trial.suggest_int('min_samples_leaf', 1,15),
        'min_samples_split' : trial.suggest_int('min_samples_split', 2,15)
    }
    
    
    rfPipe = Pipeline(steps=[('preprocessor', preprocessor),
                        ('regressor', RandomForestRegressor(
        **rfParams
    ))])

    return cross_val_score(rfPipe, X, y, n_jobs = -1).mean()

rfStudy = optuna.create_study(direction='maximize')
rfStudy.optimize(rfObjective, n_trials=100)

rfTrial = rfStudy.best_trial

print('Accuracy: {}'.format(rfTrial.value))
print("Best hyperparameters: {}".format(rfTrial.params))

[32m[I 2021-01-20 16:24:44,140][0m A new study created in memory with name: no-name-8f0dc63a-16b3-484a-8b5f-f1027cf47f5c[0m
[32m[I 2021-01-20 16:25:09,658][0m Trial 0 finished with value: 0.9637888646136876 and parameters: {'n_estimators': 454, 'max_depth': 37, 'min_samples_leaf': 10, 'min_samples_split': 13}. Best is trial 0 with value: 0.9637888646136876.[0m
[32m[I 2021-01-20 16:25:24,988][0m Trial 1 finished with value: 0.9735153043512039 and parameters: {'n_estimators': 232, 'max_depth': 19, 'min_samples_leaf': 5, 'min_samples_split': 14}. Best is trial 1 with value: 0.9735153043512039.[0m
[32m[I 2021-01-20 16:25:31,616][0m Trial 2 finished with value: 0.9487165421480107 and parameters: {'n_estimators': 172, 'max_depth': 6, 'min_samples_leaf': 8, 'min_samples_split': 12}. Best is trial 1 with value: 0.9735153043512039.[0m
[32m[I 2021-01-20 16:25:50,764][0m Trial 3 finished with value: 0.9729347580150588 and parameters: {'n_estimators': 291, 'max_depth': 41, 'min_sampl

In [None]:
def xgbObjective(trial):
    xgbParams = {
        'n_estimators' : trial.suggest_int('n_estimators', 100,500),
        'max_depth' : trial.suggest_int('max_depth', 1, 20),
        'eta' : trial.suggest_uniform('eta', 0.01, 1), # learning_rate
        'subsample': trial.suggest_uniform('subsample', 0.1, 1),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.1, 1),
        'gamma': trial.suggest_int('gamma', 0, 10), # min_split_loss
        'min_child_weight' : trial.suggest_uniform('min_child_weight', 0.1, 1.0)
    }

    xgbPipe = Pipeline(steps=[('preprocessor', preprocessor),
                              ('regressor', xgb.XGBRegressor(
        **xgbParams
    ))])
    
    return cross_val_score(xgbPipe, X, y, n_jobs = -1).mean()

xgbStudy = optuna.create_study(direction='maximize')
xgbStudy.optimize(xgbObjective, n_trials=100)

xgbTrial = xgbStudy.best_trial

print('Accuracy: {}'.format(xgbTrial.value))
print("Best hyperparameters: {}".format(xgbTrial.params))
    

[32m[I 2021-01-20 17:08:35,813][0m A new study created in memory with name: no-name-e0d40076-5bfa-491e-b431-dfb4bf68b971[0m
[32m[I 2021-01-20 17:09:39,284][0m Trial 0 finished with value: 0.9155393329911383 and parameters: {'n_estimators': 129, 'max_depth': 11, 'eta': 0.21074438775116847, 'subsample': 0.15082033078308843, 'colsample_bytree': 0.9336611458554654, 'gamma': 7, 'min_child_weight': 0.6033670326503459}. Best is trial 0 with value: 0.9155393329911383.[0m
[32m[I 2021-01-20 17:17:10,005][0m Trial 1 finished with value: 0.9516743454657002 and parameters: {'n_estimators': 375, 'max_depth': 12, 'eta': 0.1711429760262995, 'subsample': 0.9647689475781496, 'colsample_bytree': 0.6430206736007188, 'gamma': 7, 'min_child_weight': 0.63920205323985}. Best is trial 1 with value: 0.9516743454657002.[0m
[32m[I 2021-01-20 17:18:40,553][0m Trial 2 finished with value: 0.90179830481093 and parameters: {'n_estimators': 161, 'max_depth': 6, 'eta': 0.5242098548309986, 'subsample': 0.2127

[32m[I 2021-01-20 20:21:56,880][0m Trial 38 finished with value: 0.9575750708032489 and parameters: {'n_estimators': 477, 'max_depth': 15, 'eta': 0.25160424382386776, 'subsample': 0.5131572831129348, 'colsample_bytree': 0.598016050018849, 'gamma': 4, 'min_child_weight': 0.4975463866755369}. Best is trial 33 with value: 0.9831162861853345.[0m
[32m[I 2021-01-20 20:25:08,762][0m Trial 39 finished with value: 0.9607448695206887 and parameters: {'n_estimators': 308, 'max_depth': 9, 'eta': 0.5684224634523176, 'subsample': 0.6066726209145208, 'colsample_bytree': 0.46672938120219915, 'gamma': 2, 'min_child_weight': 0.9943459709547866}. Best is trial 33 with value: 0.9831162861853345.[0m
[32m[I 2021-01-20 20:29:03,353][0m Trial 40 finished with value: 0.9546218424728565 and parameters: {'n_estimators': 418, 'max_depth': 11, 'eta': 0.1953988283142195, 'subsample': 0.6662724989985026, 'colsample_bytree': 0.3895509665193412, 'gamma': 6, 'min_child_weight': 0.6795324426598636}. Best is tria

[32m[I 2021-01-20 20:38:18,288][0m Trial 43 finished with value: 0.9809469147390871 and parameters: {'n_estimators': 359, 'max_depth': 5, 'eta': 0.4820960538992174, 'subsample': 0.6553357852262193, 'colsample_bytree': 0.9524083941745269, 'gamma': 0, 'min_child_weight': 0.5598424974772418}. Best is trial 33 with value: 0.9831162861853345.[0m
[32m[I 2021-01-20 20:40:30,768][0m Trial 44 finished with value: 0.982119366436374 and parameters: {'n_estimators': 315, 'max_depth': 5, 'eta': 0.45445457726984556, 'subsample': 0.5864229939070947, 'colsample_bytree': 0.9155358249442423, 'gamma': 0, 'min_child_weight': 0.5445151450488493}. Best is trial 33 with value: 0.9831162861853345.[0m
[32m[I 2021-01-20 20:42:22,261][0m Trial 45 finished with value: 0.9652381558783253 and parameters: {'n_estimators': 317, 'max_depth': 5, 'eta': 0.6615806190484514, 'subsample': 0.40859766313102996, 'colsample_bytree': 0.9049195915875266, 'gamma': 1, 'min_child_weight': 0.5987238310936274}. Best is trial 

#### 3.1.3 Regressors
* ##### 3.1.3.1 Linear Regression
* ##### 3.1.3.2 Random Forest
* ##### 3.1.3.3 XGBoost


In [None]:
regressors = [
        LinearRegression(),
        RandomForestRegressor(**rfTrial.params),
        xgb.XGBRegressor(**xgbTrial.params),
    ]

for regressor in regressors:
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', regressor)])
    pipe.fit(X_train, y_train)   
    print(regressor)
    
    # y_pred = pipe.predict(X_test)

    # d1 = {'True Labels': y_test, 'Predicted Labels': y_pred}
    # SK = pd.DataFrame(data = d1)
    # print(SK)

    print("Model Score: %.3f" % pipe.score(X_test, y_test))

    mae = metrics.mean_absolute_error(y_test, pipe.predict(X_test))
    mse = metrics.mean_squared_error(y_test,pipe.predict(X_test))
    rmse = np.sqrt(metrics.mean_squared_error(y_test, pipe.predict(X_test)))
    print("mae: ", mae)
    print("mse: ", mse)
    print("rmse: ", rmse, "\n")

    
    # lm1 = sns.lmplot(x="True Labels", y="Predicted Labels", data = SK, size = 10)
    # fig1 = lm1.fig 
    # fig1.suptitle("Sklearn ", fontsize=18)
    # sns.set(font_scale = 1.5)


LinearRegression()
Model Score: 0.882
mae:  0.36979917847899707
mse:  0.2297293236136488
rmse:  0.47930086961495155 

RandomForestRegressor(max_depth=19, n_estimators=362)
Model Score: 0.987
mae:  0.09466032492980873
mse:  0.025650006697925148
rmse:  0.16015619469107384 

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.8875531287549862,
             eta=0.2652603675315999, gamma=0, gpu_id=-1, importance_type='gain',
             interaction_constraints='', learning_rate=0.265260369,
             max_delta_step=0, max_depth=2, min_child_weight=0.4879239150544281,
             missing=nan, monotone_constraints='()', n_estimators=312, n_jobs=4,
             num_parallel_tree=1, random_state=0, reg_alpha=0, reg_lambda=1,
             scale_pos_weight=1, subsample=0.29606192287744726,
             tree_method='exact', validate_parameters=1, verbosity=None)
Model Score: 0.992
mae:  0.09416308965091333
mse:  0.01546723549

#### 3.1.4 Random Forest Regressor

In [None]:
reg = RandomForestRegressor(**rfTrial.params, oob_score=True)

pipe = Pipeline(steps=[('preprocessor', preprocessor),
              ('regressor', reg)])
pipe.fit(X_train, y_train)   
print(reg)

print("Model Train Score: %.3f" % pipe.score(X_train, y_train))
print("Model OOB Score: %.3f" % reg.oob_score_)
print("Model Test Score: %.3f" % pipe.score(X_test, y_test))

RandomForestRegressor(max_depth=19, n_estimators=362, oob_score=True)
Model Train Score: 0.998
Model OOB Score: 0.988
Model Test Score: 0.987


##### 3.1.4.1 Feature Importance

In [None]:
pd.DataFrame(zip(X.columns,reg.feature_importances_),columns=['feature','importance']).sort_values(by='importance',ascending=False)


Unnamed: 0,feature,importance
4,comment_log,0.59161
2,views_log,0.148967
1,likeRatio,0.14621
3,dislikes_log,0.098877
6,durationHr,0.003406
7,durationMin,0.002466
0,categoryId,0.002083
5,days_lapse,0.00114
8,durationSec,0.000156
9,titleLength,0.000113


#### 3.1.5 Likes Evaluation

In [None]:
eval = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor', xgb.XGBRegressor(**xgbTrial.params))])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test) 

mae = metrics.mean_absolute_error(y_test,y_pred)
mse = metrics.mean_squared_error(y_test,y_pred)
rmse = np.sqrt(metrics.mean_squared_error(y_test,y_pred))
r2 = metrics.r2_score(y_test, y_pred)

print("mae: ", mae)
print("mse: ", mse)
print("rmse: ", rmse)
print("r2: ", r2)

mae:  0.0953155452433225
mse:  0.025795705576908474
rmse:  0.16061041553058902
r2:  0.9867018591500877


In [None]:
df = pd.DataFrame(data=list(zip(list(y_test), list(y_pred))),columns=['actual','predicted'])

#Unlog Values
for col in df.columns:
    df[col] = df[col].apply(lambda x: np.e**x)

df

Unnamed: 0,actual,predicted
0,50904.0,49880.275700
1,110968.0,132031.793138
2,379066.0,379071.562850
3,31259.0,32698.710047
4,79593.0,81551.776849
...,...,...
505,32344.0,35015.929269
506,85314.0,77748.820992
507,39482.0,32787.582714
508,52914.0,56161.145865


### 3.2 Predicting Views
#### 3.2.1 Preprocessing Data
##### 3.2.1.1 Train-Test Split (80:20)
Splitting the data into train and test sets in a 80:20 ratio

In [5]:
df = helpers.load_df("Data/Curated_US_Data.csv")

X = df.drop(columns=['views_log'])
y = df['views_log']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

##### 3.2.1.2 Initializing Preprocessing Pipeline
Scaling numercal data and encoding categorical data

In [7]:
numeric_features = X.select_dtypes(include=['int64', 'float64']).drop(['durationHr','durationMin','durationSec'],axis=1).columns
categorical_features = list(X.select_dtypes(include=['object']).columns) + ['durationHr','durationMin','durationSec']

preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', StandardScaler(), numeric_features),
        ('categorical', OneHotEncoder(handle_unknown = "ignore"), categorical_features)])

y

0       15.708863
1       15.832615
2       14.220534
3       15.487011
4       13.667111
          ...    
2542    13.256200
2543    12.765811
2544    15.190235
2545    13.919811
2546    14.270649
Name: views_log, Length: 2547, dtype: float64

In [8]:
print('Numeric Features:', numeric_features)
print('Categorical Features:', categorical_features)

Numeric Features: Index(['categoryId', 'likeRatio', 'likes_log', 'dislikes_log', 'comment_log',
       'days_lapse', 'titleLength', 'tagCount'],
      dtype='object')
Categorical Features: ['durationHr', 'durationMin', 'durationSec']


#### 3.2.2 Hyperparameter Tuning (Optuna)
Using bayesian hyperparameter optimization to find optimal parameters

In [None]:
def rfObjective(trial):
    rfParams={
        'n_estimators' : trial.suggest_int('n_estimators', 100, 500),
        'max_depth' : trial.suggest_int('max_depth', 1, 50),
        'min_samples_leaf' : trial.suggest_int('min_samples_leaf', 1,15),
        'min_samples_split' : trial.suggest_int('min_samples_split', 2,15)
    }
    
    
    rfPipe = Pipeline(steps=[('preprocessor', preprocessor),
                        ('regressor', RandomForestRegressor(
        **rfParams
    ))])

    return cross_val_score(rfPipe, X, y, n_jobs = -1).mean()

rfStudy = optuna.create_study(direction='maximize')
rfStudy.optimize(rfObjective, n_trials=100)

rfTrial = rfStudy.best_trial

print('Accuracy: {}'.format(rfTrial.value))
print("Best hyperparameters: {}".format(rfTrial.params))

[32m[I 2021-01-20 22:02:47,278][0m A new study created in memory with name: no-name-1bfbb4b2-d972-43c4-b8d4-2f98b6fca43b[0m
[32m[I 2021-01-20 22:03:06,025][0m Trial 0 finished with value: 0.8846957206653006 and parameters: {'n_estimators': 215, 'max_depth': 35, 'min_samples_leaf': 3, 'min_samples_split': 11}. Best is trial 0 with value: 0.8846957206653006.[0m
[32m[I 2021-01-20 22:03:35,568][0m Trial 1 finished with value: 0.8766528799850091 and parameters: {'n_estimators': 421, 'max_depth': 46, 'min_samples_leaf': 4, 'min_samples_split': 15}. Best is trial 0 with value: 0.8846957206653006.[0m
[32m[I 2021-01-20 22:03:45,203][0m Trial 2 finished with value: 0.8665812146678494 and parameters: {'n_estimators': 166, 'max_depth': 46, 'min_samples_leaf': 9, 'min_samples_split': 14}. Best is trial 0 with value: 0.8846957206653006.[0m
[32m[I 2021-01-20 22:04:13,891][0m Trial 3 finished with value: 0.8684414263740358 and parameters: {'n_estimators': 464, 'max_depth': 38, 'min_sampl

[32m[I 2021-01-20 22:21:00,526][0m Trial 46 finished with value: 0.8587832102126622 and parameters: {'n_estimators': 148, 'max_depth': 29, 'min_samples_leaf': 13, 'min_samples_split': 4}. Best is trial 42 with value: 0.9041019878901355.[0m
[32m[I 2021-01-20 22:21:19,206][0m Trial 47 finished with value: 0.8897047051672404 and parameters: {'n_estimators': 202, 'max_depth': 26, 'min_samples_leaf': 3, 'min_samples_split': 2}. Best is trial 42 with value: 0.9041019878901355.[0m
[32m[I 2021-01-20 22:21:31,245][0m Trial 48 finished with value: 0.9011891088062278 and parameters: {'n_estimators': 100, 'max_depth': 36, 'min_samples_leaf': 1, 'min_samples_split': 4}. Best is trial 42 with value: 0.9041019878901355.[0m
[32m[I 2021-01-20 22:21:43,287][0m Trial 49 finished with value: 0.8980750545200582 and parameters: {'n_estimators': 110, 'max_depth': 37, 'min_samples_leaf': 2, 'min_samples_split': 3}. Best is trial 42 with value: 0.9041019878901355.[0m


[32m[I 2021-01-20 22:22:03,143][0m Trial 50 finished with value: 0.9022219944113576 and parameters: {'n_estimators': 164, 'max_depth': 22, 'min_samples_leaf': 1, 'min_samples_split': 4}. Best is trial 42 with value: 0.9041019878901355.[0m
[32m[I 2021-01-20 22:22:22,517][0m Trial 51 finished with value: 0.9018403737384375 and parameters: {'n_estimators': 163, 'max_depth': 22, 'min_samples_leaf': 1, 'min_samples_split': 4}. Best is trial 42 with value: 0.9041019878901355.[0m
[32m[I 2021-01-20 22:22:35,142][0m Trial 52 finished with value: 0.8861664208766573 and parameters: {'n_estimators': 162, 'max_depth': 21, 'min_samples_leaf': 2, 'min_samples_split': 15}. Best is trial 42 with value: 0.9041019878901355.[0m
[32m[I 2021-01-20 22:22:50,655][0m Trial 53 finished with value: 0.8999803709216565 and parameters: {'n_estimators': 139, 'max_depth': 23, 'min_samples_leaf': 1, 'min_samples_split': 5}. Best is trial 42 with value: 0.9041019878901355.[0m
[32m[I 2021-01-20 22:23:11,554

In [None]:
def xgbObjective(trial):
    xgbParams = {
        'n_estimators' : trial.suggest_int('n_estimators', 100,500),
        'max_depth' : trial.suggest_int('max_depth', 1, 20),
        'eta' : trial.suggest_uniform('eta', 0.01, 1), # learning_rate
        'subsample': trial.suggest_uniform('subsample', 0.1, 1),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.1, 1),
        'gamma': trial.suggest_int('gamma', 0, 10), # min_split_loss
        'min_child_weight' : trial.suggest_uniform('min_child_weight', 0.1, 1.0)
    }

    xgbPipe = Pipeline(steps=[('preprocessor', preprocessor),
                              ('regressor', xgb.XGBRegressor(
        **xgbParams
    ))])
    
    return cross_val_score(xgbPipe, X, y, n_jobs = -1).mean()

xgbStudy = optuna.create_study(direction='maximize')
xgbStudy.optimize(xgbObjective, n_trials=100)

xgbTrial = xgbStudy.best_trial

print('Accuracy: {}'.format(xgbTrial.value))
print("Best hyperparameters: {}".format(xgbTrial.params))
    

[32m[I 2021-01-20 22:41:32,605][0m A new study created in memory with name: no-name-549e37ad-0740-42fe-8859-942be90cd518[0m
[32m[I 2021-01-20 22:44:58,869][0m Trial 0 finished with value: 0.815211718184156 and parameters: {'n_estimators': 348, 'max_depth': 6, 'eta': 0.9640296874102909, 'subsample': 0.5930878446912282, 'colsample_bytree': 0.7742343213712475, 'gamma': 0, 'min_child_weight': 0.9094992488689098}. Best is trial 0 with value: 0.815211718184156.[0m
[32m[I 2021-01-20 22:47:59,931][0m Trial 1 finished with value: 0.8329284753270452 and parameters: {'n_estimators': 456, 'max_depth': 3, 'eta': 0.035028643890250295, 'subsample': 0.9432725810040122, 'colsample_bytree': 0.6149395992387502, 'gamma': 8, 'min_child_weight': 0.7756050792585426}. Best is trial 1 with value: 0.8329284753270452.[0m
[32m[I 2021-01-20 22:49:19,461][0m Trial 2 finished with value: 0.8039258279683894 and parameters: {'n_estimators': 112, 'max_depth': 7, 'eta': 0.9210550067361367, 'subsample': 0.5971

#### 3.2.3 Regressors
* ##### 3.2.3.1 Linear Regression
* ##### 3.2.3.2 Random Forest
* ##### 3.2.3.3 XGBoost


In [10]:
regressors = [
        LinearRegression(),
        RandomForestRegressor(**rfTrial.params),
        xgb.XGBRegressor(**xgbTrial.params),
    ]

for regressor in regressors:
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', regressor)])
    pipe.fit(X_train, y_train)   
    print(regressor)
    
    # y_pred = pipe.predict(X_test)

    # d1 = {'True Labels': y_test, 'Predicted Labels': y_pred}
    # SK = pd.DataFrame(data = d1)
    # print(SK)

    print("Model Score: %.3f" % pipe.score(X_test, y_test))

    mae = metrics.mean_absolute_error(y_test, pipe.predict(X_test))
    mse = metrics.mean_squared_error(y_test,pipe.predict(X_test))
    rmse = np.sqrt(metrics.mean_squared_error(y_test, pipe.predict(X_test)))
    print("mae: ", mae)
    print("mse: ", mse)
    print("rmse: ", rmse, "\n")

    
    # lm1 = sns.lmplot(x="True Labels", y="Predicted Labels", data = SK, size = 10)
    # fig1 = lm1.fig 
    # fig1.suptitle("Sklearn ", fontsize=18)
    # sns.set(font_scale = 1.5)


LinearRegression()
Model Score: 0.828
mae:  0.34291823372238145
mse:  0.20370368365536418
rmse:  0.45133544471419945 

RandomForestRegressor(max_depth=33, n_estimators=243)
Model Score: 0.934
mae:  0.19826799233969616
mse:  0.07840980768856694
rmse:  0.2800175131818846 

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.5571521622677783,
             eta=0.07075191327247428, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.0707519129, max_delta_step=0, max_depth=17,
             min_child_weight=0.9022274864869919, missing=nan,
             monotone_constraints='()', n_estimators=199, n_jobs=4,
             num_parallel_tree=1, random_state=0, reg_alpha=0, reg_lambda=1,
             scale_pos_weight=1, subsample=0.8706412367118741,
             tree_method='exact', validate_parameters=1, verbosity=None)
Model Score: 0.946
mae:  0.16346300872006603
mse: 

#### 3.2.4 Random Forest Regressor

In [12]:
reg = RandomForestRegressor(**rfTrial.params, oob_score=True)

pipe = Pipeline(steps=[('preprocessor', preprocessor),
              ('regressor', reg)])
pipe.fit(X_train, y_train)   
print(reg)

print("Model Train Score: %.3f" % pipe.score(X_train, y_train))
print("Model OOB Score: %.3f" % reg.oob_score_)
print("Model Test Score: %.3f" % pipe.score(X_test, y_test))

RandomForestRegressor(max_depth=33, n_estimators=243, oob_score=True)
Model Train Score: 0.990
Model OOB Score: 0.926
Model Test Score: 0.933


##### 3.2.4.1 Feature Importance

In [13]:
pd.DataFrame(zip(X.columns,reg.feature_importances_),columns=['feature','importance']).sort_values(by='importance',ascending=False)


Unnamed: 0,feature,importance
3,dislikes_log,0.643037
2,likes_log,0.217882
6,durationHr,0.019182
1,likeRatio,0.016195
4,comment_log,0.016142
7,durationMin,0.015831
0,categoryId,0.015492
5,days_lapse,0.009847
8,durationSec,0.000472
9,titleLength,0.000272


#### 3.2.5 Views Evaluation

In [15]:
eval = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor', xgb.XGBRegressor(**xgbTrial.params))])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test) 

mae = metrics.mean_absolute_error(y_test,y_pred)
mse = metrics.mean_squared_error(y_test,y_pred)
rmse = np.sqrt(metrics.mean_squared_error(y_test,y_pred))
r2 = metrics.r2_score(y_test, y_pred)

print("mae: ", mae)
print("mse: ", mse)
print("rmse: ", rmse)
print("r2: ", r2)

mae:  0.20048203007302157
mse:  0.07988965523261009
rmse:  0.2826475813316118
r2:  0.932460527892858


In [16]:
df = pd.DataFrame(data=list(zip(list(y_test), list(y_pred))),columns=['actual','predicted'])

#Unlog Values
for col in df.columns:
    df[col] = df[col].apply(lambda x: np.e**x)

df