Test implementation of grid search using scikit-learn algorithms before spending time testing it with cuML.

Although scikit-learn has [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) for performing a grid search across various algorithms, it doesn't offer the control I need in terms of trial data output (e.g. what I want is outputting the results of each trial in a denormalized row)

XGBoost note: https://stackoverflow.com/questions/63776921/you-are-running-32-bit-python-on-a-64-bit-osmac-and-xgboost-library-could-not


In [1]:
import os

# import xgboost
import polars as pl
import sklearn
from sklearn.model_selection import train_test_split

In [2]:
data_dir = "/Users/maxwoolf/Downloads"

df = pl.read_parquet(os.path.join(data_dir, "movie_embeds.parquet"))
df = df.with_columns(decade=pl.col("startYear").cut(list(range(1900, 2030, 10))))
df

tconst,startYear,averageRating,json,embeds,decade
str,i64,f64,str,"array[f32, 768]",cat
"""tt21937348""",2022,6.8,"""{""title"":""Le business du bonhe…","[-0.012958, 0.03246, … -0.037463]","""(2020, inf]"""
"""tt0425976""",2003,5.0,"""{""title"":""Excuses!"",""genres"":[…","[-0.009819, -0.025217, … -0.069853]","""(2000, 2010]"""
"""tt1581629""",2009,6.8,"""{""title"":""The Secret"",""genres""…","[-0.013838, 0.008304, … -0.051282]","""(2000, 2010]"""
"""tt1707240""",2010,6.2,"""{""title"":""Lys"",""genres"":[""Dram…","[0.022719, 0.042718, … -0.061914]","""(2000, 2010]"""
"""tt32378615""",2024,7.7,"""{""title"":""We Should Make Movie…","[0.041895, 0.016391, … -0.022129]","""(2020, inf]"""
…,…,…,…,…,…
"""tt0463960""",2013,3.3,"""{""title"":""The Devil You Know"",…","[0.01918, 0.006489, … -0.03445]","""(2010, 2020]"""
"""tt5865148""",2016,6.1,"""{""title"":""Brett Gelman's Dinne…","[0.015193, 0.030122, … -0.061412]","""(2010, 2020]"""
"""tt0185883""",1949,6.7,"""{""title"":""Aoi sanmyaku"",""genre…","[-0.020236, 0.034688, … -0.058068]","""(1940, 1950]"""
"""tt27436518""",2022,4.8,"""{""title"":""The Legacy"",""genres""…","[-0.019275, 0.007522, … -0.029518]","""(2020, inf]"""


Extract out the components as numpy arrays for compatability.


In [3]:
embeds = df["embeds"].to_numpy()
print(embeds.shape)

release_years = df["startYear"].to_numpy()
ratings = df["averageRating"].to_numpy()
decade = df["decade"].to_numpy()

(1600, 768)


Do a train-test split. Normally we stratify but sklearn will throw an error if a group is too small.


In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    embeds, ratings, test_size=0.1, random_state=42, stratify=decade
)

X_train.shape

(1440, 768)

Test a simple OLS.


In [5]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mae = sklearn.metrics.mean_absolute_error(y_test, y_pred)
mae

np.float64(1.3033771419525144)

## Start Grid Search

- Create a parameter grid
- Run sequentually since this will evenatually be on a single GPU.
- Write the results ([time elapsed](https://stackoverflow.com/a/63232040), train MAE, test MAE) to a dict along with the metadata.
- Load all the data into polars and save as parquet.

Test with a Support Vector Machine since that has tweakable parameters.


In [6]:
import itertools
import time

import numpy as np
from sklearn.svm import SVR
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


Build the grid to get all combinations of parameters. Normally numpy shenanigans would be used here for fast speed but that causes data type issues, e.g. several parameters must be `int`s but numpy shenanigans will force `float`s.


In [7]:
param_grid = {
    "C": np.linspace(start=0.8, stop=1.2, num=5),
    "epsilon": np.linspace(start=0.8, stop=1.2, num=5),
}

In [8]:
[x.tolist() for x in param_grid.values()]

[[0.8, 0.9, 1.0, 1.1, 1.2], [0.8, 0.9, 1.0, 1.1, 1.2]]

In [9]:
param_values = [x.tolist() for x in param_grid.values()]

combos = list(itertools.product(*param_values))
combos

[(0.8, 0.8),
 (0.8, 0.9),
 (0.8, 1.0),
 (0.8, 1.1),
 (0.8, 1.2),
 (0.9, 0.8),
 (0.9, 0.9),
 (0.9, 1.0),
 (0.9, 1.1),
 (0.9, 1.2),
 (1.0, 0.8),
 (1.0, 0.9),
 (1.0, 1.0),
 (1.0, 1.1),
 (1.0, 1.2),
 (1.1, 0.8),
 (1.1, 0.9),
 (1.1, 1.0),
 (1.1, 1.1),
 (1.1, 1.2),
 (1.2, 0.8),
 (1.2, 0.9),
 (1.2, 1.0),
 (1.2, 1.1),
 (1.2, 1.2)]

[zip shenanigans](https://stackoverflow.com/a/33737067) to map back to a list of dicts.


In [10]:
param_dicts = [
    dict(zip(list(param_grid.keys()), combos[i])) for i in range(len(combos))
]
param_dicts

[{'C': 0.8, 'epsilon': 0.8},
 {'C': 0.8, 'epsilon': 0.9},
 {'C': 0.8, 'epsilon': 1.0},
 {'C': 0.8, 'epsilon': 1.1},
 {'C': 0.8, 'epsilon': 1.2},
 {'C': 0.9, 'epsilon': 0.8},
 {'C': 0.9, 'epsilon': 0.9},
 {'C': 0.9, 'epsilon': 1.0},
 {'C': 0.9, 'epsilon': 1.1},
 {'C': 0.9, 'epsilon': 1.2},
 {'C': 1.0, 'epsilon': 0.8},
 {'C': 1.0, 'epsilon': 0.9},
 {'C': 1.0, 'epsilon': 1.0},
 {'C': 1.0, 'epsilon': 1.1},
 {'C': 1.0, 'epsilon': 1.2},
 {'C': 1.1, 'epsilon': 0.8},
 {'C': 1.1, 'epsilon': 0.9},
 {'C': 1.1, 'epsilon': 1.0},
 {'C': 1.1, 'epsilon': 1.1},
 {'C': 1.1, 'epsilon': 1.2},
 {'C': 1.2, 'epsilon': 0.8},
 {'C': 1.2, 'epsilon': 0.9},
 {'C': 1.2, 'epsilon': 1.0},
 {'C': 1.2, 'epsilon': 1.1},
 {'C': 1.2, 'epsilon': 1.2}]

Functionalize.


In [11]:
def build_grid_dict(param_grid):
    param_values = [x.tolist() for x in param_grid.values()]

    combos = list(itertools.product(*param_values))
    param_dicts = [
        dict(zip(list(param_grid.keys()), combos[i])) for i in range(len(combos))
    ]
    return param_dicts


param_grid = {
    "C": np.linspace(start=0.8, stop=1.2, num=5),
    "epsilon": np.linspace(start=0.8, stop=1.2, num=5),
}

build_grid_dict(param_grid)[:5]

[{'C': 0.8, 'epsilon': 0.8},
 {'C': 0.8, 'epsilon': 0.9},
 {'C': 0.8, 'epsilon': 1.0},
 {'C': 0.8, 'epsilon': 1.1},
 {'C': 0.8, 'epsilon': 1.2}]

Do the grid search!


In [12]:
# to guard against floating point precision issues when printing params
def round_param_floats(params):
    return {k: round(v, 4) if isinstance(v, float) else v for k, v in params.items()}


def grid_search(model_class, param_grid, model_name=None):
    param_dicts = build_grid_dict(param_grid)
    result_dicts = []

    for params in tqdm(param_dicts):
        model = model_class(**params)  # fresh model instantiation each run

        fit_start = time.time()
        model.fit(X_train, y_train)
        fit_end = time.time()

        y_pred = model.predict(X_train)
        train_mae = sklearn.metrics.mean_absolute_error(y_train, y_pred)

        y_pred = model.predict(X_test)
        test_mae = sklearn.metrics.mean_absolute_error(y_test, y_pred)

        result_dicts.append(
            {
                "model": model_name if model_name else model_class.__name__,
                "params": str(round_param_floats(params)),
                "fit_time_ms": (fit_end - fit_start) * 1000,
                "train_mae": train_mae,
                "test_mae": test_mae,
            }
        )

    return pl.from_dicts(result_dicts).sort("test_mae")

In [13]:
param_grid = {
    "C": np.linspace(start=0.8, stop=1.2, num=5),
    "epsilon": np.linspace(start=0.8, stop=1.2, num=5),
}

grid_search(SVR, param_grid)

100%|██████████| 25/25 [00:07<00:00,  3.44it/s]


model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""SVR""","""{'C': 0.8, 'epsilon': 0.9}""",102.361202,0.83669,1.009752
"""SVR""","""{'C': 0.9, 'epsilon': 0.9}""",102.497101,0.829269,1.012797
"""SVR""","""{'C': 0.8, 'epsilon': 1.0}""",88.521004,0.848227,1.013904
"""SVR""","""{'C': 0.8, 'epsilon': 0.8}""",146.610022,0.824501,1.014047
"""SVR""","""{'C': 0.9, 'epsilon': 0.8}""",114.247084,0.816272,1.015031
…,…,…,…,…
"""SVR""","""{'C': 1.0, 'epsilon': 1.2}""",74.265957,0.85727,1.026985
"""SVR""","""{'C': 1.2, 'epsilon': 1.0}""",90.191841,0.824332,1.027501
"""SVR""","""{'C': 1.2, 'epsilon': 1.1}""",79.972982,0.837239,1.029315
"""SVR""","""{'C': 1.1, 'epsilon': 1.2}""",69.023848,0.852295,1.02977


Try a few more linear models.


In [14]:
from sklearn.linear_model import Ridge

param_grid = {
    "alpha": np.linspace(start=0.5, stop=2.0, num=20),
}

grid_search(Ridge, param_grid)

100%|██████████| 20/20 [00:00<00:00, 135.19it/s]


model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""Ridge""","""{'alpha': 2.0}""",4.74906,0.865861,1.00979
"""Ridge""","""{'alpha': 1.9211}""",5.037069,0.864532,1.010208
"""Ridge""","""{'alpha': 1.8421}""",5.296946,0.86314,1.010651
"""Ridge""","""{'alpha': 1.7632}""",4.601717,0.861682,1.011122
"""Ridge""","""{'alpha': 1.6842}""",4.96006,0.860174,1.011621
…,…,…,…,…
"""Ridge""","""{'alpha': 0.8158}""",6.44803,0.838698,1.020519
"""Ridge""","""{'alpha': 0.7368}""",8.121014,0.835872,1.021753
"""Ridge""","""{'alpha': 0.6579}""",8.197069,0.832759,1.023096
"""Ridge""","""{'alpha': 0.5789}""",8.401155,0.829309,1.024607


In [15]:
from sklearn.linear_model import BayesianRidge

param_grid = {
    "alpha_1": np.array([1e-6, 1e-5, 1e-4]),
    "alpha_2": np.array([1e-6, 1e-5, 1e-4]),
    "lambda_1": np.array([1e-6, 1e-5, 1e-4]),
    "lambda_2": np.array([1e-6, 1e-5, 1e-4]),
}

grid_search(BayesianRidge, param_grid)

100%|██████████| 81/81 [00:04<00:00, 17.15it/s]


model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""BayesianRidge""","""{'alpha_1': 0.0001, 'alpha_2':…",56.353092,0.864185,1.010318
"""BayesianRidge""","""{'alpha_1': 0.0001, 'alpha_2':…",55.272341,0.864185,1.010318
"""BayesianRidge""","""{'alpha_1': 0.0, 'alpha_2': 0.…",57.148218,0.864185,1.010318
"""BayesianRidge""","""{'alpha_1': 0.0, 'alpha_2': 0.…",55.795908,0.864185,1.010318
"""BayesianRidge""","""{'alpha_1': 0.0001, 'alpha_2':…",57.08909,0.864185,1.010318
…,…,…,…,…
"""BayesianRidge""","""{'alpha_1': 0.0, 'alpha_2': 0.…",55.951118,0.864185,1.010318
"""BayesianRidge""","""{'alpha_1': 0.0, 'alpha_2': 0.…",58.246851,0.864185,1.010318
"""BayesianRidge""","""{'alpha_1': 0.0, 'alpha_2': 0.…",56.670189,0.864185,1.010318
"""BayesianRidge""","""{'alpha_1': 0.0001, 'alpha_2':…",55.45783,0.864185,1.010318


In [16]:
from sklearn.kernel_ridge import KernelRidge

param_grid = {
    "alpha": np.linspace(start=0.5, stop=2.0, num=20),
    "degree": np.arange(start=2, stop=7),
}

grid_search(KernelRidge, param_grid)

100%|██████████| 100/100 [00:02<00:00, 38.67it/s]


model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""KernelRidge""","""{'alpha': 1.8421, 'degree': 2}""",17.78698,0.885085,0.995937
"""KernelRidge""","""{'alpha': 1.8421, 'degree': 3}""",18.435955,0.885085,0.995937
"""KernelRidge""","""{'alpha': 1.8421, 'degree': 4}""",18.251657,0.885085,0.995937
"""KernelRidge""","""{'alpha': 1.8421, 'degree': 5}""",18.011332,0.885085,0.995937
"""KernelRidge""","""{'alpha': 1.8421, 'degree': 6}""",18.29195,0.885085,0.995937
…,…,…,…,…
"""KernelRidge""","""{'alpha': 0.5, 'degree': 2}""",95.431805,0.843073,1.012571
"""KernelRidge""","""{'alpha': 0.5, 'degree': 3}""",23.62895,0.843073,1.012571
"""KernelRidge""","""{'alpha': 0.5, 'degree': 4}""",21.869898,0.843073,1.012571
"""KernelRidge""","""{'alpha': 0.5, 'degree': 5}""",19.290209,0.843073,1.012571


In [17]:
from sklearn.linear_model import ElasticNet

param_grid = {
    "alpha": np.linspace(start=0.5, stop=1.5, num=10),
    "l1_ratio": np.linspace(start=0.1, stop=0.9, num=10),
}
grid_search(ElasticNet, param_grid)

100%|██████████| 100/100 [00:00<00:00, 319.58it/s]


model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""ElasticNet""","""{'alpha': 0.5, 'l1_ratio': 0.1…",7.076263,1.028561,1.106313
"""ElasticNet""","""{'alpha': 0.5, 'l1_ratio': 0.1…",5.210876,1.028561,1.106313
"""ElasticNet""","""{'alpha': 0.5, 'l1_ratio': 0.2…",5.013943,1.028561,1.106313
"""ElasticNet""","""{'alpha': 0.5, 'l1_ratio': 0.3…",4.571915,1.028561,1.106313
"""ElasticNet""","""{'alpha': 0.5, 'l1_ratio': 0.4…",6.360054,1.028561,1.106313
…,…,…,…,…
"""ElasticNet""","""{'alpha': 1.5, 'l1_ratio': 0.5…",2.110243,1.028561,1.106313
"""ElasticNet""","""{'alpha': 1.5, 'l1_ratio': 0.6…",2.196074,1.028561,1.106313
"""ElasticNet""","""{'alpha': 1.5, 'l1_ratio': 0.7…",2.12121,1.028561,1.106313
"""ElasticNet""","""{'alpha': 1.5, 'l1_ratio': 0.8…",2.851248,1.028561,1.106313


In [18]:
from sklearn.linear_model import SGDRegressor

param_grid = {
    "alpha": np.array([1e-4, 1e-3, 1e-2]),
    "l1_ratio": np.linspace(start=0.1, stop=0.9, num=10),
    "penalty": np.array(["l1", "l2", "elasticnet"]),
}
grid_search(SGDRegressor, param_grid)

100%|██████████| 90/90 [00:13<00:00,  6.90it/s]


model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""SGDRegressor""","""{'alpha': 0.001, 'l1_ratio': 0…",140.700102,0.952048,1.031051
"""SGDRegressor""","""{'alpha': 0.0001, 'l1_ratio': …",206.234217,0.95442,1.031736
"""SGDRegressor""","""{'alpha': 0.0001, 'l1_ratio': …",127.006054,0.953221,1.031831
"""SGDRegressor""","""{'alpha': 0.0001, 'l1_ratio': …",130.512238,0.951951,1.031842
"""SGDRegressor""","""{'alpha': 0.001, 'l1_ratio': 0…",156.486988,0.954418,1.032225
…,…,…,…,…
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",104.090929,1.023733,1.092064
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",68.866014,1.02502,1.09287
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",113.832951,1.025992,1.09311
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",115.669966,1.027868,1.09359


In [67]:
from sklearn.neighbors import KNeighborsRegressor

param_grid = {
    "n_neighbors": np.arange(start=2, stop=9),
    "leaf_size": np.arange(start=5, stop=50, step=5),
    "weights": np.array(["uniform", "distance"]),
}
grid_search(KNeighborsRegressor, param_grid)

  0%|          | 0/126 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.232935,0.87419,0.98375
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.244141,0.87419,0.98375
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.247002,0.87419,0.98375
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.250101,0.87419,0.98375
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.625849,0.87419,0.98375
…,…,…,…,…
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.582218,9.5465e-8,1.22805
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.265837,9.5465e-8,1.22805
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.597,9.5465e-8,1.22805
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.324965,9.5465e-8,1.22805


In [77]:
from sklearn.ensemble import RandomForestRegressor

param_grid = {
    "min_samples_leaf": np.arange(start=2, stop=6),
}
grid_search(RandomForestRegressor, param_grid)

  0%|          | 0/4 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""RandomForestRegressor""","""{'min_samples_leaf': 5}""",27979.981899,0.458471,1.055396
"""RandomForestRegressor""","""{'min_samples_leaf': 2}""",32291.435003,0.3927,1.05599
"""RandomForestRegressor""","""{'min_samples_leaf': 3}""",30894.958019,0.409741,1.058537
"""RandomForestRegressor""","""{'min_samples_leaf': 4}""",29754.374027,0.435607,1.066952


In [72]:
from sklearn.ensemble import HistGradientBoostingRegressor

param_grid = {
    "learning_rate": np.array([1e-1, 5e-2, 1e-2]),
    "max_leaf_nodes": np.arange(start=20, stop=50, step=10),
    # "l2_regularization": np.linspace(start=0.0, stop=1.0, num=3),
    # "max_features": np.linspace(start=0.1, stop=1.0, num=3),
}
grid_search(HistGradientBoostingRegressor, param_grid)

  0%|          | 0/81 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.05, 'max_l…",2074.5399,0.263926,1.026484
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.05, 'max_l…",1612.729788,0.33925,1.034051
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.05, 'max_l…",1921.996832,0.285968,1.034399
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.1, 'max_le…",1709.148884,0.192629,1.036705
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.1, 'max_le…",2383.854866,0.042516,1.038475
…,…,…,…,…
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.1, 'max_le…",1579.900742,0.248323,1.09971
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.1, 'max_le…",1688.149214,0.172808,1.104118
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.1, 'max_le…",1602.896214,0.131334,1.105918
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.1, 'max_le…",2495.135069,0.040932,1.105953


## PCA

See if some approaches behave better on PCA-reduced data (down to 128D). Note that PCA should only be trained on the training set to avoid data leakage.


In [19]:
from sklearn.decomposition import PCA

pca = PCA(n_components=128)
pca.fit(X_train)

X_train = pca.transform(X_train)
X_test = pca.transform(X_test)

In [20]:
print(pca.explained_variance_ratio_[0:5])

[0.04350267 0.02861863 0.02662594 0.02461618 0.02177215]


Explained variance is poor, so that's a no.


In [21]:
from sklearn.neighbors import KNeighborsRegressor

param_grid = {
    "n_neighbors": np.arange(start=2, stop=9),
    "leaf_size": np.arange(start=5, stop=50, step=5),
    "weights": np.array(["uniform", "distance"]),
}
grid_search(KNeighborsRegressor, param_grid)

100%|██████████| 126/126 [00:01<00:00, 80.79it/s]


model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""KNeighborsRegressor""","""{'n_neighbors': 3, 'leaf_size'…",0.108957,0.733565,1.054583
"""KNeighborsRegressor""","""{'n_neighbors': 3, 'leaf_size'…",0.121832,0.733565,1.054583
"""KNeighborsRegressor""","""{'n_neighbors': 3, 'leaf_size'…",0.118971,0.733565,1.054583
"""KNeighborsRegressor""","""{'n_neighbors': 3, 'leaf_size'…",0.109196,0.733565,1.054583
"""KNeighborsRegressor""","""{'n_neighbors': 3, 'leaf_size'…",0.106812,0.733565,1.054583
…,…,…,…,…
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.117064,0.634757,1.156875
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.136852,0.634757,1.156875
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.401258,0.634757,1.156875
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.363111,0.634757,1.156875
