In this notebook, we will create and compare several models. Our target feature for this notebooks is LOG(VIEWS/SUBSCRIBERS))

We are looking at the log to mitigate the effect of viral videos which are extreme outliers. 

We consider the following models:
1) A baseline model that always outputs the average value of the target variable
2) A basic linear regression model
3) A linear regression model fit with lasso regression
4) A model that includes all pairwise interaction terms
5) A model that includeds all pairwise interaction terms fit with lasso regression.

Almost as expected, the model with the most features (i.e., the model that includes all interaction terms) performed the best. This is this could just be because every time a linear model has more features it will perform better, in the sense that it will have a lower rmse. Interestingly, our lasso model slightly outperformed our non-lasso model. 
   
Please note that we are not really creating a "model" of our data. That is, we do not believe our targets are fully explained by our features or any features that can be extracted from our data set. This makes sense - the views and engagement of a youtube short is primarily explained by the video content, which is not part of our data set, though things like the title and description have some effect. This explains the poor fit. However, we hope to understand something about the data from looking at these models. For example, we found that our best model had an $R^2$ score of .0990 and an explained variance score of .0992. These numbers tell us to what extend our targets are explainable with our features. 

In [4]:
import pandas as pd
import numpy as np
import math 

df = pd.read_csv('../data/new/no_early_dates_all_features_train.csv')
df.columns

#PLEASE NOTE THAT THE HASHTAGS COLUMN CURRENTLY HAS THE NUMBER OF HASHTAGS USED, AND IS NOT A CATEGORICAL VARIABLE. 
#WHEN CONSIDERING INTERACTION TERMS PLEASE ONLY INCLUDE PAIRWISE INTERACTION TERMS. More interaction terms than this would create extremely small and nonexistent categories which we do not want.

Index(['Unnamed: 0', 'commentsCount', 'isChannelVerified', 'likes',
       'numberOfSubscribers', 'text', 'title', 'viewCount',
       'views_per_subscriber', 'duration_in_seconds', 'date',
       'hashtag_indicator', 'has_any_affiliate', 'hasAdinTitle', 'hasAdinText',
       'Engagement_per_Subscriber', 'Engagement_per_View', 'popular_brand',
       'prime_hour', 'product', 'skills/teach', 'speed', 'comparing_products',
       'self_ref', 'budget', 'korean'],
      dtype='object')

In [5]:
#Don't run this cell more than once

features = ["popular_brand", "has_any_affiliate", "product", "budget", "self_ref", "korean", "speed", 
            "skills/teach", "comparing_products", "prime_hour", "hashtag_indicator", "hasAdinTitle", "hasAdinText"]

#Create the target column $y$ here:
df["y"] = df["views_per_subscriber"].apply( math.log )

#We don't need a lot of the noise columns
df = df[ features + ["y"] ] 

In [6]:
#Import everything

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error, r2_score
from sklearn.model_selection import train_test_split


In [7]:
#Do an 80-20 Train Test Split Here. Never ever touch the testing set please!
cat_features = ["popular_brand", "has_any_affiliate", "product", "budget", "self_ref", "acronym", "korean", "speed", "skills/teach", "skincare", "comparing_products", "prime_hour", "hasAdinTitle", "hasAdinText"]
#The above is just everything except "hashtags"

df_train, df_test = train_test_split(df, shuffle = True, test_size = .2, random_state = 42) #We can't stratify because we have too many categorical features. I hope this is ok
#DO NOT TOUCH THE ABOVE X_TEST VARIABLE FOR ANY REASON

#We want a very basic idea of the MSE for each model, before we do proper cross-validation. We use a secondary split for this.
df_tt, df_ho = train_test_split(df_train, shuffle = True, test_size = .2, random_state = 42)


In [8]:
#Create the baseline model here

class BaseMeanModel():
    def __init__(self):
        self.mean_value = None
    
    def fit(self, values : pd.Series):
        self.mean_value = values.mean()

    def predict(self, inputs=None):
        if inputs is None:
            return self.mean_value
        return len(inputs) * [self.mean_value]
    
model = BaseMeanModel()
model.fit(df_tt["y"])

# R2 is negative because training set and the hold out set have different average values
y_pred = model.predict(df_ho[features])
rmse = root_mean_squared_error(df_ho["y"], y_pred)
r2 = r2_score(df_ho["y"], y_pred)
print(f"Root Mean Squared Error: {rmse:.6f}")
print(f"R-squared: {r2:.4f}")

Root Mean Squared Error: 1.606709
R-squared: -0.0002


In [9]:
#Create the basic linear regression model here 
model = LinearRegression()
model.fit(df_tt[features], df_tt["y"])
# Evaluate the model
y_pred = model.predict(df_ho[features])
rmse = root_mean_squared_error(df_ho["y"], y_pred)
r2 = r2_score(df_ho["y"], y_pred)
print(f"Root Mean Squared Error (Log Views): {rmse:.6f}")
print(f"R-squared: {r2:.4f}")

Root Mean Squared Error (Log Views): 1.530634
R-squared: 0.0922


In [10]:
#Create the basic linear regression model here with lasso regression. 

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV, Lasso
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, explained_variance_score
from sklearn.model_selection import cross_val_score


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(df_tt[features] )
X_test_scaled = scaler.transform( df_ho[features])


alpha = 0.0001

lasso = Lasso(alpha=alpha, random_state=42, max_iter=10000)


lasso.fit(X_train_scaled, df_tt["y"])

y_pred = lasso.predict(X_test_scaled)

# Calculate metrics
mse = mean_squared_error(df_ho["y"], y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(df_ho["y"] ,y_pred)
r2 = r2_score(df_ho["y"], y_pred)
exp_var = explained_variance_score(df_ho["y"], y_pred)

print(f"\nAlpha: {alpha}")
print(f"Test MSE: {mse:.6f}, RMSE: {rmse:.6f}, MAE: {mae:.6f}")
print(f"R² Score: {r2:.6f}, Explained Variance: {exp_var:.6f}")


Alpha: 0.0001
Test MSE: 2.342785, RMSE: 1.530616, MAE: 1.187705
R² Score: 0.092251, Explained Variance: 0.092339


In [11]:
#Create a model whose features include all interaction terms 
pipe = Pipeline([  ("interaction terms", PolynomialFeatures(degree = 2, interaction_only = True, include_bias = False) ),
                   ("linear model", LinearRegression())
]) 
#setting degree = 2 creates all pairwise interaction terms. 

pipe.fit( df_tt[features], df_tt["y"]) 
pred = pipe.predict( df_ho[features] )

mse = mean_squared_error(df_ho["y"], pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(df_ho["y"], pred)
r2 = r2_score(df_ho["y"], pred)
exp_var = explained_variance_score(df_ho["y"], pred)


print(f"Test MSE: {mse:.6f}, RMSE: {rmse:.6f}, MAE: {mae:.6f}")
print(f"R² Score: {r2:.6f}, Explained Variance: {exp_var:.6f}")

Test MSE: 2.326234, RMSE: 1.525200, MAE: 1.179644
R² Score: 0.098663, Explained Variance: 0.098919


In [12]:
#Create a model with all interaction terms and lasso regression 

#Create a model with all interaction terms and lasso regression 
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_df_tt = scaler.fit_transform( df_tt[features] )
scaled_df_ho = scaler.transform( df_ho[features] )

# using lasso cv to find the best alpha
# lasso = LassoCV(cv=5, random_state=42, max_iter=10000, alphas=np.logspace(-4, 1, 30))
# lasso.fit(df_tt[features], df_tt["y"])
# pred = lasso.predict(df_ho[features])
# print("Lasso CV MSE:", root_mean_squared_error(df_ho["y"], pred))
# print("Optimal alpha:", lasso.alpha_)

pipe = Pipeline([
    ("interaction terms", PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
    ("lasso", Lasso(alpha=0.0001, max_iter=10000))
])
pipe.fit(scaled_df_tt, df_tt["y"])
pred = pipe.predict(scaled_df_ho)

# Get lasso coefficients
lasso_coeffs = pd.Series(pipe.named_steps['lasso'].coef_, index=pipe.named_steps['interaction terms'].get_feature_names_out(features))
lasso_coeffs = lasso_coeffs[lasso_coeffs != 0]
# print(lasso_coeffs)

mse = mean_squared_error(df_ho["y"], pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(df_ho["y"], pred)
r2 = r2_score(df_ho["y"], pred)
exp_var = explained_variance_score(df_ho["y"], pred)

print(f"Test MSE: {mse:.6f}, RMSE: {rmse:.6f}, MAE: {mae:.6f}")
print(f"R² Score: {r2:.6f}, Explained Variance: {exp_var:.6f}")

Test MSE: 2.325441, RMSE: 1.524940, MAE: 1.179512
R² Score: 0.098971, Explained Variance: 0.099227


In [13]:
#Do cross-validation to compare all models 

#Model 0: Baseline Average
#Model 1: Basic Linear Regression Model
#Model 2: Linear Regression with Lasso
#Model 3: Linear Regression with interaction terms
#Model 4: Linear Regression with interactions and lasso

from sklearn.model_selection import KFold
num_splits = 5
num_models = 5

kfold = KFold(num_splits,
              random_state = 42,
              shuffle=True)

rmses = np.zeros((num_models, num_splits))

for i, (train_index, test_index) in enumerate(kfold.split(df_train)): 

    df_tt = df_train.iloc[train_index]
    df_ho = df_train.iloc[test_index] 

    #Model 0: Baseline Average
    model = BaseMeanModel()
    model.fit(df_tt["y"])
    y_pred = model.predict(df_ho[features])
    rmses[0,i] = root_mean_squared_error(df_ho["y"], y_pred)

    #Model 1: Basic Linear Regression Model
    model = LinearRegression()
    model.fit(df_tt[features], df_tt["y"])
    y_pred = model.predict(df_ho[features])
    rmses[1,i] = root_mean_squared_error(df_ho["y"], y_pred)

    #Model 2: Linear Regression with Lasso
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(df_tt[features] )
    X_test_scaled = scaler.transform( df_ho[features])
    alpha = 0.0001
    lasso = Lasso(alpha=alpha, random_state=42, max_iter=10000)
    lasso.fit(X_train_scaled, df_tt["y"])
    y_pred = lasso.predict(X_test_scaled)
    rmses[2,i] = root_mean_squared_error(df_ho["y"], y_pred)

    #Model 3: Linear Regression with interaction terms
    pipe = Pipeline([  ("interaction terms", PolynomialFeatures(degree = 2, interaction_only = True, include_bias = False) ),
                   ("linear model", LinearRegression())
                    ]) 
    pipe.fit( df_tt[features], df_tt["y"]) 
    y_pred = pipe.predict( df_ho[features] )
    rmses[3,i] = root_mean_squared_error(df_ho["y"], y_pred)

    #Model 4: Linear Regression with interactions and lasso
    scaler = StandardScaler()
    scaled_df_tt = scaler.fit_transform( df_tt[features] )
    scaled_df_ho = scaler.transform( df_ho[features] )
    pipe = Pipeline([
    ("interaction terms", PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
    ("lasso", Lasso(alpha=0.0001, max_iter=10000))
                    ])
    pipe.fit(scaled_df_tt, df_tt["y"])
    y_pred = pipe.predict(scaled_df_ho)
    rmses[4,i] = root_mean_squared_error(df_ho["y"], y_pred)

print(rmses)

[[1.6067091  1.62643628 1.64651932 1.67372172 1.59652834]
 [1.53063439 1.55575122 1.56093855 1.59344457 1.50422675]
 [1.53061577 1.55576445 1.56094054 1.59343502 1.50424112]
 [1.5251996  1.52278023 1.54089224 1.58851722 1.4893488 ]
 [1.52493956 1.52266765 1.54073961 1.58831359 1.48927094]]


In [14]:
rmses.mean(axis = 1)

array([1.62998295, 1.5489991 , 1.54899938, 1.53334762, 1.53318627])

#Final interpretation. We'll look at coefficients for our best model and compare. 

---Final Interpretation---

Almost as expected, the model with the most features (i.e., the model that includes all interaction terms) performed the best. This is this could just be because every time a linear model has more features it will perform better, in the sense that it will have a lower rmse. 

Interestingly, our lasso model slightly outperformed our non-lasso model. 

We want to take the coefficients of our highest performing model and compare them. This is where our analysis was really headed.
Mathematically, the features with the highest coefficients are those that contribute the most to the trend line.

In [16]:
#A final fitting to the whole training data set 
scaler = StandardScaler()
scaled_df = scaler.fit_transform( df[features] )
pipe = Pipeline([
    ("interaction terms", PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
    ("lasso", Lasso(alpha=0.0001, max_iter=10000))
                    ])
pipe.fit(scaled_df, df["y"])

lasso_coeffs = pd.Series(pipe.named_steps['lasso'].coef_, index=pipe.named_steps['interaction terms'].get_feature_names_out(features))
sig_lasso_coeffs = lasso_coeffs[lasso_coeffs != 0]
print(sig_lasso_coeffs)

popular_brand                     0.057680
has_any_affiliate                -0.149690
product                           0.128541
budget                            0.038506
self_ref                         -0.078659
                                    ...   
prime_hour hasAdinTitle          -0.026373
prime_hour hasAdinText           -0.038124
hashtag_indicator hasAdinTitle   -0.046976
hashtag_indicator hasAdinText    -0.111520
hasAdinTitle hasAdinText         -0.017357
Length: 91, dtype: float64


In [17]:
sig_lasso_coeffs.sort_values(key=abs)

budget korean                      0.000166
comparing_products hasAdinTitle    0.000383
popular_brand hasAdinTitle         0.001020
comparing_products prime_hour      0.001604
popular_brand product              0.002075
                                     ...   
has_any_affiliate                 -0.149690
hasAdinText                       -0.164558
has_any_affiliate speed            0.173918
korean                             0.253364
hashtag_indicator                  0.368265
Length: 91, dtype: float64

In [18]:
new = sig_lasso_coeffs.sort_values(key=abs)
new.tail(10)

comparing_products               0.091376
has_any_affiliate budget         0.102200
hashtag_indicator hasAdinText   -0.111520
skills/teach hasAdinText         0.116552
product                          0.128541
has_any_affiliate               -0.149690
hasAdinText                     -0.164558
has_any_affiliate speed          0.173918
korean                           0.253364
hashtag_indicator                0.368265
dtype: float64