# We're Ready for the Runway

As a control, I'm just gonna linearly regress right quick.

In [3]:
import pandas as pd
import numpy as np

import pickle

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import Lasso
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score

from bayes_opt import BayesianOptimization

import warnings
warnings.filterwarnings('ignore')

In [4]:
with open('../data/processed/2 - Games DF - PreProcessed Features', 'rb') as file :
    X = pickle.load(file)

with open('../data/processed/2 - Games DF - PreProcessed Targets', 'rb') as file :
    y_universe = pickle.load(file)

%store -r relevant_langs

We're All Agnostics Now!
---

I'll choose the "agnostic" case for our initial analysis, and calculate basic vanilla lasso scores for all languages under that condition.

In [5]:
# First, let me get a persistent set of indexes for my train/test split

X_train, X_test, y_train, y_test = train_test_split(X, y_universe, test_size=0.2)

training_indexes = y_train.index
testing_indexes = y_test.index

# Lasso

Yee-haw

In [6]:
naive_scores = {}

for lang in relevant_langs :

    # Create the target variable for this lang
    y = {}
    for index, row in y_universe.iterrows() :
        y[index] = row['comment_diff_agnostic'][lang]
    y = pd.Series(y)

    # Generate our target splits based on the persistent train/test index split
    y_train = y[training_indexes]
    y_test = y[testing_indexes]

    

    # Run the model
    model = Lasso()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Score it and log it
    score = r2_score(y_test, y_pred)
    naive_scores[lang] = score
    print(f"{lang}: {round(score, 5)}")

german: 0.02015
french: 0.00197
spanish: 0.00465
brazilian: 0.00382
russian: 0.0435
italian: 0.02665
schinese: 0.04953
japanese: 0.02072
koreana: 0.01001
polish: 0.00562
english: 0.0148


The results are weak, but the vast majority did better than chance. That might mean our idea is not complete bunk.

Let's see if hyperparameter tuning helps.

In [7]:
# First, define the function we'll use to tune our alpha
def Lasso_eval(alpha) :
    params = {"alpha":alpha}
    model = Lasso(max_iter=100000, **params)
    score = cross_val_score(model, X, y, cv=5).mean()
    return score


In [10]:
# Now, do run the model again, optimizing on each lang

for lang in relevant_langs :

    # Create the target variable for this lang
    y = {}
    for index, row in y_universe.iterrows() :
        y[index] = row['comment_diff_agnostic'][lang]
    y = pd.Series(y)

    # Run some optimization
    param_bounds = {"alpha":(0, 100)}
    optimal = BayesianOptimization(Lasso_eval, param_bounds, verbose=False, allow_duplicate_points=True)
    optimal.maximize(60, 1)
    bayes_alpha = optimal.max['params']['alpha']

    # Generate our target splits based on the persistent train/test index split
    y_train = y[training_indexes]
    y_test = y[testing_indexes]

    # Run the model
    model = Lasso(max_iter=100000, alpha=bayes_alpha)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Score it and log it
    score = r2_score(y_test, y_pred)
    rounded_score = round(score, 5)
    improvement = round(naive_scores[lang]-score, 6)
    rounded_alpha = round(optimal.max['params']['alpha'], 2)
    print(f"{lang}: {rounded_score} (improvement: {improvement}, alpha: {rounded_alpha})")

KeyboardInterrupt: 

Nope.

I can only assume this is because the variability in the data makes the cv split scores so random that the tuned parameters end up essentially random as well.

So that's one strike AGAINST our idea, though I'd argue that the uniformly weak positive of the naive parameter model indicates that we are onto *something*, no matter how slight.

Perhaps the relationships between the tags can be better captured by other models. Let's try gradient boosting next, and then neural nets.

Huzzah! Onward!

# Let's Get Boosted

Honestly I feel like we might be too sparse for this to be super useful, but let's try try