Test implementation of grid search using scikit-learn algorithms before spending time testing it with cuML.

Although scikit-learn has [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) for performing a grid search across various algorithms, it doesn't offer the control I need in terms of trial data output (e.g. what I want is outputting the results of each trial in a denormalized row)

XGBoost note: https://stackoverflow.com/questions/63776921/you-are-running-32-bit-python-on-a-64-bit-osmac-and-xgboost-library-could-not


In [4]:
import sklearn
import xgboost
import polars as pl
import os

In [6]:
data_dir = "/Users/maxwoolf/Downloads"

df = pl.read_parquet(os.path.join(data_dir, "movie_embeds.parquet"))
df

tconst,startYear,averageRating,json,embeds
str,i64,f64,str,"array[f32, 768]"
"""tt0171517""",1975,6.2,"""{'title': 'Mechtat i zhit', 'g…","[-0.00642, -0.008363, … -0.095905]"
"""tt0034421""",1942,7.1,"""{'title': 'A 2000 pengös férfi…","[-0.027937, 0.038252, … -0.067883]"
"""tt0989642""",2006,5.6,"""{'title': 'Midnight Running', …","[-0.001267, 0.034308, … -0.027295]"
"""tt0273692""",1980,7.4,"""{'title': 'J.S. Brown, o Últim…","[-0.015184, 0.027515, … -0.077242]"
"""tt4470172""",2014,5.6,"""{'title': 'Guruldayan Kalpler'…","[-0.000246, 0.017736, … -0.053503]"
…,…,…,…,…
"""tt1326956""",2009,8.1,"""{'title': 'Our Summer in Tehra…","[-0.023254, 0.013627, … -0.036384]"
"""tt0175092""",1982,6.7,"""{'title': ""Puss 'n Boots"", 'ge…","[0.017054, 0.044542, … -0.076013]"
"""tt2739566""",1998,7.1,"""{'title': 'Jungle Love Story',…","[-0.049089, 0.011179, … -0.030443]"
"""tt15485264""",2021,7.0,"""{'title': 'Ring Wandering', 'g…","[-0.012844, -0.001891, … -0.021231]"


Extract out the components as numpy arrays for compatability.


In [7]:
embeds = df["embeds"].to_numpy()
print(embeds.shape)

release_years = df["startYear"].to_numpy()
ratings = df["averageRating"].to_numpy()

(1600, 768)


Do a train-test split. Normally we stratify but sklearn will throw an error if a group is too small.


In [10]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    embeds,
    ratings,
    test_size=0.1,
    random_state=42,
    # stratify=release_years
)

X_train.shape

(1440, 768)

Test a simple OLS.


In [14]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mae = sklearn.metrics.mean_absolute_error(y_test, y_pred)
mae

np.float64(1.664403998851776)

## Start Grid Search

- Create a parameter grid
- Run sequentually since this will evenatually be on a single GPU.
- Write the results ([time elapsed](https://stackoverflow.com/a/63232040), train MAE, test MAE) to a dict along with the metadata.
- Load all the data into polars and save as parquet.

Test with a Support Vector Machine since that has tweakable parameters.


In [81]:
from tqdm.auto import tqdm
from sklearn.svm import SVR
import time
import itertools
import numpy as np

In [20]:
np.arange(start=0.8, stop=1.2, step=0.1)

array([0.8, 0.9, 1. , 1.1])

Build the grid and [use numpy shenanigans](https://stackoverflow.com/a/35608701) to get all combinations of parameters.


In [71]:
param_grid = {
    "C": np.arange(start=0.8, stop=1.2, step=0.1),
    "epsilon": np.arange(start=0.8, stop=1.2, step=0.1),
}

In [72]:
list(param_grid.values())

[array([0.8, 0.9, 1. , 1.1]), array([0.8, 0.9, 1. , 1.1])]

In [73]:
param_values = list(param_grid.values())

combos = np.array(np.meshgrid(*param_values)).T.reshape(-1, len(param_values))
combos

array([[0.8, 0.8],
       [0.8, 0.9],
       [0.8, 1. ],
       [0.8, 1.1],
       [0.9, 0.8],
       [0.9, 0.9],
       [0.9, 1. ],
       [0.9, 1.1],
       [1. , 0.8],
       [1. , 0.9],
       [1. , 1. ],
       [1. , 1.1],
       [1.1, 0.8],
       [1.1, 0.9],
       [1.1, 1. ],
       [1.1, 1.1]])

[zip shenanigans](https://stackoverflow.com/a/33737067) to map back to a list of dicts.


In [74]:
param_dicts = [
    dict(zip(list(param_grid.keys()), combos[i].tolist()))
    for i in range(combos.shape[0])
]
param_dicts

[{'C': 0.8, 'epsilon': 0.8},
 {'C': 0.8, 'epsilon': 0.9},
 {'C': 0.8, 'epsilon': 1.0},
 {'C': 0.8, 'epsilon': 1.1},
 {'C': 0.9, 'epsilon': 0.8},
 {'C': 0.9, 'epsilon': 0.9},
 {'C': 0.9, 'epsilon': 1.0},
 {'C': 0.9, 'epsilon': 1.1},
 {'C': 1.0, 'epsilon': 0.8},
 {'C': 1.0, 'epsilon': 0.9},
 {'C': 1.0, 'epsilon': 1.0},
 {'C': 1.0, 'epsilon': 1.1},
 {'C': 1.1, 'epsilon': 0.8},
 {'C': 1.1, 'epsilon': 0.9},
 {'C': 1.1, 'epsilon': 1.0},
 {'C': 1.1, 'epsilon': 1.1}]

Functionalize.


In [78]:
def build_grid_dict(param_grid):
    param_values = list(param_grid.values())

    combos = np.array(np.meshgrid(*param_values)).T.reshape(-1, len(param_values))
    param_dicts = [
        dict(zip(list(param_grid.keys()), combos[i].tolist()))
        for i in range(combos.shape[0])
    ]
    return param_dicts


param_grid = {
    "C": np.arange(start=0.8, stop=1.2, step=0.1),
    "epsilon": np.arange(start=0.8, stop=1.2, step=0.1),
}

build_grid_dict(param_grid)[:5]

[{'C': 0.8, 'epsilon': 0.8},
 {'C': 0.8, 'epsilon': 0.9},
 {'C': 0.8, 'epsilon': 1.0},
 {'C': 0.8, 'epsilon': 1.1},
 {'C': 0.9, 'epsilon': 0.8}]

Do the grid search!


In [86]:
def grid_search(model_class, param_grid, model_name=None):
    param_dicts = build_grid_dict(param_grid)
    result_dicts = []

    for params in tqdm(param_dicts):
        model = model_class(**params)  # fresh model instantiation each run

        fit_start = time.time()
        model.fit(X_train, y_train)
        fit_end = time.time()

        y_pred = model.predict(X_train)
        train_mae = sklearn.metrics.mean_absolute_error(y_train, y_pred)

        y_pred = model.predict(X_test)
        test_mae = sklearn.metrics.mean_absolute_error(y_test, y_pred)

        result_dicts.append(
            {
                "model": model_name if model_name else model_class.__name__,
                "params": str(params),
                "fit_time_ms": (fit_end - fit_start) * 1000,
                "train_mae": train_mae,
                "test_mae": test_mae,
            }
        )

    return pl.from_dicts(result_dicts).sort("test_mae")

In [87]:
grid_search(SVR, param_grid, "Support Vector Regression")

  0%|          | 0/16 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""Support Vector Regression""","""{'C': 0.8, 'epsilon': 1.0}""",96.53616,0.893894,1.020952
"""Support Vector Regression""","""{'C': 0.9, 'epsilon': 1.0}""",96.125364,0.884697,1.021613
"""Support Vector Regression""","""{'C': 1.0, 'epsilon': 1.0}""",96.932173,0.875689,1.021996
"""Support Vector Regression""","""{'C': 1.1, 'epsilon': 1.0}""",96.145153,0.867216,1.022721
"""Support Vector Regression""","""{'C': 0.8, 'epsilon': 0.9}""",106.952906,0.882155,1.023533
…,…,…,…,…
"""Support Vector Regression""","""{'C': 0.8, 'epsilon': 1.1}""",89.452982,0.905437,1.024911
"""Support Vector Regression""","""{'C': 0.8, 'epsilon': 0.8}""",144.553185,0.869848,1.025
"""Support Vector Regression""","""{'C': 0.9, 'epsilon': 0.8}""",117.764235,0.859347,1.025492
"""Support Vector Regression""","""{'C': 1.0, 'epsilon': 0.8}""",118.632317,0.849815,1.025896


Try a few more linear models.


In [89]:
from sklearn.linear_model import Ridge

param_grid = {
    "alpha": np.arange(start=0.5, stop=2.0, step=0.1),
}

grid_search(Ridge, param_grid, "Ridge Regression")

  0%|          | 0/15 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""Ridge Regression""","""{'alpha': 1.4}""",4.689932,0.907464,1.007352
"""Ridge Regression""","""{'alpha': 1.4999999999999998}""",4.497051,0.910158,1.007356
"""Ridge Regression""","""{'alpha': 1.5999999999999996}""",4.810095,0.912735,1.007394
"""Ridge Regression""","""{'alpha': 1.2999999999999998}""",4.54402,0.904625,1.007445
"""Ridge Regression""","""{'alpha': 1.6999999999999997}""",4.598856,0.915173,1.00759
…,…,…,…,…
"""Ridge Regression""","""{'alpha': 0.8999999999999999}""",4.359961,0.89137,1.009741
"""Ridge Regression""","""{'alpha': 0.7999999999999999}""",4.914999,0.887434,1.0115
"""Ridge Regression""","""{'alpha': 0.7}""",6.845951,0.883294,1.014101
"""Ridge Regression""","""{'alpha': 0.6}""",11.749744,0.87891,1.017183


In [90]:
from sklearn.linear_model import Lasso

param_grid = {
    "alpha": np.arange(start=0.5, stop=2.0, step=0.1),
}

grid_search(Lasso, param_grid, "Lasso Regression")

  0%|          | 0/15 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""Lasso Regression""","""{'alpha': 0.5}""",4.85301,1.11784,1.15688
"""Lasso Regression""","""{'alpha': 0.6}""",4.200935,1.11784,1.15688
"""Lasso Regression""","""{'alpha': 0.7}""",4.968882,1.11784,1.15688
"""Lasso Regression""","""{'alpha': 0.7999999999999999}""",3.42989,1.11784,1.15688
"""Lasso Regression""","""{'alpha': 0.8999999999999999}""",5.286217,1.11784,1.15688
…,…,…,…,…
"""Lasso Regression""","""{'alpha': 1.4999999999999998}""",2.651215,1.11784,1.15688
"""Lasso Regression""","""{'alpha': 1.5999999999999996}""",2.414942,1.11784,1.15688
"""Lasso Regression""","""{'alpha': 1.6999999999999997}""",2.472162,1.11784,1.15688
"""Lasso Regression""","""{'alpha': 1.7999999999999998}""",2.363682,1.11784,1.15688


In [None]:
from sklearn.linear_model import HuberRegressor

param_grid = {
    "epsilon": np.arange(start=1.0, stop=2.0, step=0.1),
    "alpha": np.array([1e-4, 1e-3, 1e-2, 1e-1]),
    "max_iter": np.array(1000),
}

grid_search(HuberRegressor, param_grid)