# We're Ready for the Runway

As a control, I'm just gonna linearly regress right quick.

In [31]:
import pandas as pd
import numpy as np

import pickle

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import Lasso
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score

from bayes_opt import BayesianOptimization

import warnings
warnings.filterwarnings('ignore')

In [32]:
with open('../data/processed/2 - Games DF - PreProcessed Features', 'rb') as file :
    X = pickle.load(file)

with open('../data/processed/2 - Games DF - PreProcessed Targets', 'rb') as file :
    y_universe = pickle.load(file)

%store -r relevant_langs

In [33]:
X = X.drop(columns=['price'])

We're All Agnostics Now!
---

I'll choose the "agnostic" case for our initial analysis, and calculate basic vanilla lasso scores for all languages under that condition.

In [34]:
# First, let me get a persistent set of indexes for my train/test split

X_train, X_test, y_train, y_test = train_test_split(X, y_universe, test_size=0.2)

training_indexes = y_train.index
testing_indexes = y_test.index

In [37]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 572 entries, 553850 to 1962660
Data columns (total 54 columns):
 #   Column                   Non-Null Count  Dtype
---  ------                   --------------  -----
 0   2D                       572 non-null    int64
 1   3D                       572 non-null    int64
 2   Action                   572 non-null    int64
 3   Action-Adventure         572 non-null    int64
 4   Adventure                572 non-null    int64
 5   Anime                    572 non-null    int64
 6   Atmospheric              572 non-null    int64
 7   Building                 572 non-null    int64
 8   Casual                   572 non-null    int64
 9   Character Customization  572 non-null    int64
 10  Choices Matter           572 non-null    int64
 11  Co-op                    572 non-null    int64
 12  Colorful                 572 non-null    int64
 13  Comedy                   572 non-null    int64
 14  Controller               572 non-null    int64
 1

# Lasso

Yee-haw

In [36]:
naive_scores = {}

for lang in relevant_langs :

    # Create the target variable for this lang
    y = {}
    for index, row in y_universe.iterrows() :
        y[index] = row['comment_diff_agnostic'][lang]
    y = pd.Series(y)

    # Generate our target splits based on the persistent train/test index split
    y_train = y[training_indexes]
    y_test = y[testing_indexes]

    # Run the model
    model = Lasso()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Score it and log it
    score = r2_score(y_test, y_pred)
    naive_scores[lang] = score
    print(f"{lang}: {round(score, 5)}")

german: -0.01494
french: -0.00037
spanish: -0.05411
brazilian: -0.02398
russian: -0.00679
italian: -0.00807
schinese: -0.00217
japanese: -0.00047
koreana: -0.02825
polish: -0.00027
english: -0.00945


The results are weak, but the vast majority did better than chance. That might mean our idea is not complete bunk.

Let's see if hyperparameter tuning helps.

In [20]:
# First, define the function we'll use to tune our alpha
def Lasso_eval(alpha) :
    params = {"alpha":alpha}
    model = Lasso(max_iter=100000, **params)
    score = cross_val_score(model, X, y, cv=5).mean()
    return score


In [21]:
# Now, do run the model again, optimizing on each lang

for lang in relevant_langs :

    # Create the target variable for this lang
    y = {}
    for index, row in y_universe.iterrows() :
        y[index] = row['comment_diff_agnostic'][lang]
    y = pd.Series(y)

    # Run some optimization
    param_bounds = {"alpha":(0.01, 100)}
    optimal = BayesianOptimization(Lasso_eval, param_bounds, verbose=False, allow_duplicate_points=True)
    optimal.maximize(20, 20)
    bayes_alpha = optimal.max['params']['alpha']

    # Generate our target splits based on the persistent train/test index split
    y_train = y[training_indexes]
    y_test = y[testing_indexes]

    # Run the model
    model = Lasso(max_iter=100000, alpha=bayes_alpha)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Score it and log it
    score = r2_score(y_test, y_pred)
    rounded_score = round(score, 5)
    improvement = round(score - naive_scores[lang], 6)
    rounded_alpha = round(optimal.max['params']['alpha'], 2)
    print(f"{lang}: {rounded_score} (improvement: {improvement}, alpha: {rounded_alpha})")

german: -0.0023 (improvement: 0.0, alpha: 86.45)
french: -0.00027 (improvement: 0.0, alpha: 17.04)
spanish: -0.00481 (improvement: 0.0, alpha: 84.49)
brazilian: -0.00135 (improvement: 0.0, alpha: 67.34)
russian: -0.00141 (improvement: 0.001047, alpha: 0.01)
italian: -0.00065 (improvement: 0.0, alpha: 41.68)
schinese: 0.05513 (improvement: 0.055235, alpha: 0.01)
japanese: -0.0 (improvement: 0.0, alpha: 37.53)
koreana: -0.01662 (improvement: 0.0, alpha: 82.02)
polish: -0.00023 (improvement: 0.0, alpha: 29.98)
english: -0.00132 (improvement: 0.0, alpha: 4.28)


Nope.

I can only assume this is because the variability in the data makes the cv split scores so random that the tuned parameters end up essentially random as well.

So that's one strike AGAINST our idea, though I'd argue that the uniformly weak positive of the naive parameter model indicates that we are onto *something*, no matter how slight.

Perhaps the relationships between the tags can be better captured by other models. Let's try gradient boosting next, and then neural nets.

Huzzah! Onward!

# Let's Get Boosted

Honestly I feel like we might be too sparse for this to be super useful, but let's try try