In this notebook, we will compare a linear model on both real and simulated linear data to a random forest regressor to see if our modelling approach was appropriate. 

Our best performing linear model based on cross-validation included all binary interaction terms and used lasso regression. (Note that technically including interaction terms makes it non-linear, so the terminology here is misleading. The main goal is to compare our best performing model to see if it is appropriate.) 

In [8]:
import pandas as pd

df = pd.read_csv('../data/new/no_early_dates_all_features_train.csv') #This is already the first 50% exploration side of the data. 

df_explore = df

In [10]:
import pandas as pd
import numpy as np

features = ["popular_brand", "has_any_affiliate", "product", "budget", "self_ref", "korean", "speed", "skills/teach", "comparing_products", "prime_hour", "hasAdinTitle", "hasAdinText",'hashtag_indicator']

#Create the target column $y$ here
df["y"] = (df["likes"] + df["commentsCount"])  / (df["viewCount"] + 1)

#get rid of noisy columns
df = df[ features + ["y"] ]

from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

scaler = StandardScaler()
X_explore_scaled = scaler.fit_transform(df_explore[features])

pipe_linear = Pipeline([
    ("interaction_terms", PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
    ("lasso", Lasso(alpha=0.0001, max_iter=10000))
])
pipe_linear.fit(X_explore_scaled, df_explore["y"])

linear_pred = pipe_linear.predict(X_explore_scaled)
linear_mse = mean_squared_error(df_explore["y"], linear_pred)


#RF
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_explore_scaled, df_explore["y"])
rf_pred = rf.predict(X_explore_scaled)
rf_mse = mean_squared_error(df_explore["y"], rf_pred)

# Calculate performance ratio on real data
real_ratio = linear_mse / rf_mse

print("Performance on real data:")
print(f"Linear model MSE: {linear_mse:.6f}")
print(f"Random Forest MSE: {rf_mse:.6f}")
print(f"Improvement ratio: {real_ratio:.2f}x (how much better RF is than linear)")


# Generate simulated data under hyphothesis 0
residuals = df_explore["y"] - linear_pred
std_residuals = np.std(residuals)
np.random.seed(42)
y_simulated = linear_pred + np.random.normal(0, std_residuals, len(linear_pred))

pipe_linear_sim = Pipeline([
    ("interaction_terms", PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)),
    ("lasso", Lasso(alpha=0.0001, max_iter=10000))
])
pipe_linear_sim.fit(X_explore_scaled, y_simulated)
linear_sim_pred = pipe_linear_sim.predict(X_explore_scaled)
linear_sim_mse = mean_squared_error(y_simulated, linear_sim_pred)

rf_sim = RandomForestRegressor(n_estimators=100, random_state=42)
rf_sim.fit(X_explore_scaled, y_simulated)
rf_sim_pred = rf_sim.predict(X_explore_scaled)
rf_sim_mse = mean_squared_error(y_simulated, rf_sim_pred)

sim_ratio = linear_sim_mse / rf_sim_mse

print("\nPerformance on simulated linear data:")
print(f"Linear model MSE: {linear_sim_mse:.6f}")
print(f"Random Forest MSE: {rf_sim_mse:.6f}")
print(f"Improvement ratio: {sim_ratio:.2f}x (how much better RF is than linear)")

Performance on real data:
Linear model MSE: 0.001018
Random Forest MSE: 0.000847
Improvement ratio: 1.20x (how much better RF is than linear)

Performance on simulated linear data:
Linear model MSE: 0.001019
Random Forest MSE: 0.000881
Improvement ratio: 1.16x (how much better RF is than linear)


As you can see, on our real data Random Forest has almost the same improvement over a linear model than what it does for simulated linear data.
Thus we conclude our real data being modelled with interaction terms and lasso regression was appropriate.