Test implementation of grid search using scikit-learn algorithms before spending time testing it with cuML.

Although scikit-learn has [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) for performing a grid search across various algorithms, it doesn't offer the control I need in terms of trial data output (e.g. what I want is outputting the results of each trial in a denormalized row)

XGBoost note: https://stackoverflow.com/questions/63776921/you-are-running-32-bit-python-on-a-64-bit-osmac-and-xgboost-library-could-not


In [1]:
import sklearn
import xgboost
import polars as pl
import os

In [2]:
data_dir = "/Users/maxwoolf/Downloads"

df = pl.read_parquet(os.path.join(data_dir, "movie_embeds.parquet"))
df

tconst,startYear,averageRating,json,embeds
str,i64,f64,str,"array[f32, 768]"
"""tt0171517""",1975,6.2,"""{'title': 'Mechtat i zhit', 'g…","[-0.00642, -0.008363, … -0.095905]"
"""tt0034421""",1942,7.1,"""{'title': 'A 2000 pengös férfi…","[-0.027937, 0.038252, … -0.067883]"
"""tt0989642""",2006,5.6,"""{'title': 'Midnight Running', …","[-0.001267, 0.034308, … -0.027295]"
"""tt0273692""",1980,7.4,"""{'title': 'J.S. Brown, o Últim…","[-0.015184, 0.027515, … -0.077242]"
"""tt4470172""",2014,5.6,"""{'title': 'Guruldayan Kalpler'…","[-0.000246, 0.017736, … -0.053503]"
…,…,…,…,…
"""tt1326956""",2009,8.1,"""{'title': 'Our Summer in Tehra…","[-0.023254, 0.013627, … -0.036384]"
"""tt0175092""",1982,6.7,"""{'title': ""Puss 'n Boots"", 'ge…","[0.017054, 0.044542, … -0.076013]"
"""tt2739566""",1998,7.1,"""{'title': 'Jungle Love Story',…","[-0.049089, 0.011179, … -0.030443]"
"""tt15485264""",2021,7.0,"""{'title': 'Ring Wandering', 'g…","[-0.012844, -0.001891, … -0.021231]"


Extract out the components as numpy arrays for compatability.


In [3]:
embeds = df["embeds"].to_numpy()
print(embeds.shape)

release_years = df["startYear"].to_numpy()
ratings = df["averageRating"].to_numpy()

(1600, 768)


Do a train-test split. Normally we stratify but sklearn will throw an error if a group is too small.


In [4]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    embeds,
    ratings,
    test_size=0.1,
    random_state=42,
    # stratify=release_years
)

X_train.shape

(1440, 768)

Test a simple OLS.


In [5]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mae = sklearn.metrics.mean_absolute_error(y_test, y_pred)
mae

np.float64(1.664403998851776)

## Start Grid Search

- Create a parameter grid
- Run sequentually since this will evenatually be on a single GPU.
- Write the results ([time elapsed](https://stackoverflow.com/a/63232040), train MAE, test MAE) to a dict along with the metadata.
- Load all the data into polars and save as parquet.

Test with a Support Vector Machine since that has tweakable parameters.


In [6]:
from tqdm.auto import tqdm
from sklearn.svm import SVR
import time
import itertools
import numpy as np

Build the grid to get all combinations of parameters. Normally numpy shenanigans would be used here for fast speed but that causes data type issues, e.g. several parameters must be `int`s but numpy shenanigans will force `float`s.


In [7]:
param_grid = {
    "C": np.linspace(start=0.8, stop=1.2, num=5),
    "epsilon": np.linspace(start=0.8, stop=1.2, num=5),
}

In [8]:
[x.tolist() for x in param_grid.values()]

[[0.8, 0.9, 1.0, 1.1, 1.2], [0.8, 0.9, 1.0, 1.1, 1.2]]

In [9]:
param_values = [x.tolist() for x in param_grid.values()]

combos = list(itertools.product(*param_values))
combos

[(0.8, 0.8),
 (0.8, 0.9),
 (0.8, 1.0),
 (0.8, 1.1),
 (0.8, 1.2),
 (0.9, 0.8),
 (0.9, 0.9),
 (0.9, 1.0),
 (0.9, 1.1),
 (0.9, 1.2),
 (1.0, 0.8),
 (1.0, 0.9),
 (1.0, 1.0),
 (1.0, 1.1),
 (1.0, 1.2),
 (1.1, 0.8),
 (1.1, 0.9),
 (1.1, 1.0),
 (1.1, 1.1),
 (1.1, 1.2),
 (1.2, 0.8),
 (1.2, 0.9),
 (1.2, 1.0),
 (1.2, 1.1),
 (1.2, 1.2)]

[zip shenanigans](https://stackoverflow.com/a/33737067) to map back to a list of dicts.


In [10]:
param_dicts = [
    dict(zip(list(param_grid.keys()), combos[i])) for i in range(len(combos))
]
param_dicts

[{'C': 0.8, 'epsilon': 0.8},
 {'C': 0.8, 'epsilon': 0.9},
 {'C': 0.8, 'epsilon': 1.0},
 {'C': 0.8, 'epsilon': 1.1},
 {'C': 0.8, 'epsilon': 1.2},
 {'C': 0.9, 'epsilon': 0.8},
 {'C': 0.9, 'epsilon': 0.9},
 {'C': 0.9, 'epsilon': 1.0},
 {'C': 0.9, 'epsilon': 1.1},
 {'C': 0.9, 'epsilon': 1.2},
 {'C': 1.0, 'epsilon': 0.8},
 {'C': 1.0, 'epsilon': 0.9},
 {'C': 1.0, 'epsilon': 1.0},
 {'C': 1.0, 'epsilon': 1.1},
 {'C': 1.0, 'epsilon': 1.2},
 {'C': 1.1, 'epsilon': 0.8},
 {'C': 1.1, 'epsilon': 0.9},
 {'C': 1.1, 'epsilon': 1.0},
 {'C': 1.1, 'epsilon': 1.1},
 {'C': 1.1, 'epsilon': 1.2},
 {'C': 1.2, 'epsilon': 0.8},
 {'C': 1.2, 'epsilon': 0.9},
 {'C': 1.2, 'epsilon': 1.0},
 {'C': 1.2, 'epsilon': 1.1},
 {'C': 1.2, 'epsilon': 1.2}]

Functionalize.


In [11]:
def build_grid_dict(param_grid):
    param_values = [x.tolist() for x in param_grid.values()]

    combos = list(itertools.product(*param_values))
    param_dicts = [
        dict(zip(list(param_grid.keys()), combos[i])) for i in range(len(combos))
    ]
    return param_dicts


param_grid = {
    "C": np.linspace(start=0.8, stop=1.2, num=5),
    "epsilon": np.linspace(start=0.8, stop=1.2, num=5),
}

build_grid_dict(param_grid)[:5]

[{'C': 0.8, 'epsilon': 0.8},
 {'C': 0.8, 'epsilon': 0.9},
 {'C': 0.8, 'epsilon': 1.0},
 {'C': 0.8, 'epsilon': 1.1},
 {'C': 0.8, 'epsilon': 1.2}]

Do the grid search!


In [12]:
# to guard against floating point precision issues when printing params
def round_param_floats(params):
    return {k: round(v, 4) if isinstance(v, float) else v for k, v in params.items()}


def grid_search(model_class, param_grid, model_name=None):
    param_dicts = build_grid_dict(param_grid)
    result_dicts = []

    for params in tqdm(param_dicts):
        model = model_class(**params)  # fresh model instantiation each run

        fit_start = time.time()
        model.fit(X_train, y_train)
        fit_end = time.time()

        y_pred = model.predict(X_train)
        train_mae = sklearn.metrics.mean_absolute_error(y_train, y_pred)

        y_pred = model.predict(X_test)
        test_mae = sklearn.metrics.mean_absolute_error(y_test, y_pred)

        result_dicts.append(
            {
                "model": model_name if model_name else model_class.__name__,
                "params": str(round_param_floats(params)),
                "fit_time_ms": (fit_end - fit_start) * 1000,
                "train_mae": train_mae,
                "test_mae": test_mae,
            }
        )

    return pl.from_dicts(result_dicts).sort("test_mae")

In [13]:
param_grid = {
    "C": np.linspace(start=0.8, stop=1.2, num=5),
    "epsilon": np.linspace(start=0.8, stop=1.2, num=5),
}

grid_search(SVR, param_grid)

  0%|          | 0/25 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""SVR""","""{'C': 0.8, 'epsilon': 1.0}""",95.1581,0.893894,1.020952
"""SVR""","""{'C': 0.9, 'epsilon': 1.0}""",97.18895,0.884697,1.021613
"""SVR""","""{'C': 1.0, 'epsilon': 1.0}""",98.151922,0.875689,1.021996
"""SVR""","""{'C': 1.1, 'epsilon': 1.0}""",97.400904,0.867216,1.022721
"""SVR""","""{'C': 0.8, 'epsilon': 0.9}""",104.934931,0.882155,1.023533
…,…,…,…,…
"""SVR""","""{'C': 1.0, 'epsilon': 1.2}""",82.366943,0.902671,1.02717
"""SVR""","""{'C': 0.9, 'epsilon': 1.2}""",81.73418,0.910174,1.027264
"""SVR""","""{'C': 1.2, 'epsilon': 0.8}""",118.883133,0.831849,1.027318
"""SVR""","""{'C': 1.1, 'epsilon': 1.2}""",81.638813,0.895719,1.027482


Try a few more linear models.


In [14]:
from sklearn.linear_model import Ridge

param_grid = {
    "alpha": np.linspace(start=0.5, stop=2.0, num=20),
}

grid_search(Ridge, param_grid)

  0%|          | 0/20 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""Ridge""","""{'alpha': 1.4474}""",4.287004,0.908754,1.007351
"""Ridge""","""{'alpha': 1.5263}""",4.657984,0.91085,1.00736
"""Ridge""","""{'alpha': 1.3684}""",4.197121,0.90658,1.007363
"""Ridge""","""{'alpha': 1.6053}""",4.576921,0.912866,1.007402
"""Ridge""","""{'alpha': 1.2895}""",4.451752,0.904323,1.007459
…,…,…,…,…
"""Ridge""","""{'alpha': 0.8158}""",5.245924,0.888077,1.011126
"""Ridge""","""{'alpha': 0.7368}""",5.831003,0.88484,1.013093
"""Ridge""","""{'alpha': 0.6579}""",6.709099,0.881487,1.015332
"""Ridge""","""{'alpha': 0.5789}""",7.67684,0.877938,1.017907


In [59]:
from sklearn.linear_model import BayesianRidge

param_grid = {
    "alpha_1": np.array([1e-6, 1e-5, 1e-4]),
    "alpha_2": np.array([1e-6, 1e-5, 1e-4]),
    "lambda_1": np.array([1e-6, 1e-5, 1e-4]),
    "lambda_2": np.array([1e-6, 1e-5, 1e-4]),
}

grid_search(BayesianRidge, param_grid)

  0%|          | 0/81 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""Bayesian Ridge Regression""","""{'alpha_1': 0.0001, 'alpha_2':…",54.588318,0.915301,1.007601
"""Bayesian Ridge Regression""","""{'alpha_1': 0.0001, 'alpha_2':…",55.998087,0.915301,1.007601
"""Bayesian Ridge Regression""","""{'alpha_1': 0.0, 'alpha_2': 0.…",55.694818,0.915302,1.007601
"""Bayesian Ridge Regression""","""{'alpha_1': 0.0, 'alpha_2': 0.…",58.756828,0.915302,1.007601
"""Bayesian Ridge Regression""","""{'alpha_1': 0.0, 'alpha_2': 0.…",57.154179,0.915302,1.007601
…,…,…,…,…
"""Bayesian Ridge Regression""","""{'alpha_1': 0.0, 'alpha_2': 0.…",89.975119,0.915301,1.007601
"""Bayesian Ridge Regression""","""{'alpha_1': 0.0, 'alpha_2': 0.…",55.245161,0.915301,1.007601
"""Bayesian Ridge Regression""","""{'alpha_1': 0.0, 'alpha_2': 0.…",56.51927,0.915301,1.007601
"""Bayesian Ridge Regression""","""{'alpha_1': 0.0, 'alpha_2': 0.…",56.350946,0.915301,1.007601


In [60]:
from sklearn.kernel_ridge import KernelRidge

param_grid = {
    "alpha": np.linspace(start=0.5, stop=2.0, num=20),
    "degree": np.arange(start=2, stop=7),
}

grid_search(KernelRidge, param_grid)

  0%|          | 0/100 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""KernelRidge""","""{'alpha': 2.0, 'degree': 2}""",18.443108,0.939202,1.024392
"""KernelRidge""","""{'alpha': 2.0, 'degree': 3}""",18.319845,0.939202,1.024392
"""KernelRidge""","""{'alpha': 2.0, 'degree': 4}""",18.078089,0.939202,1.024392
"""KernelRidge""","""{'alpha': 2.0, 'degree': 5}""",19.418955,0.939202,1.024392
"""KernelRidge""","""{'alpha': 2.0, 'degree': 6}""",18.694878,0.939202,1.024392
…,…,…,…,…
"""KernelRidge""","""{'alpha': 0.5, 'degree': 2}""",48.840046,0.890756,1.042783
"""KernelRidge""","""{'alpha': 0.5, 'degree': 3}""",18.843174,0.890756,1.042783
"""KernelRidge""","""{'alpha': 0.5, 'degree': 4}""",18.883944,0.890756,1.042783
"""KernelRidge""","""{'alpha': 0.5, 'degree': 5}""",19.228697,0.890756,1.042783


In [63]:
from sklearn.linear_model import ElasticNet

param_grid = {
    "alpha": np.linspace(start=0.5, stop=1.5, num=10),
    "l1_ratio": np.linspace(start=0.1, stop=0.9, num=10),
}
grid_search(ElasticNet, param_grid)

  0%|          | 0/100 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""ElasticNet""","""{'alpha': 0.5, 'l1_ratio': 0.1…",4.637957,1.11784,1.15688
"""ElasticNet""","""{'alpha': 0.5, 'l1_ratio': 0.1…",3.396034,1.11784,1.15688
"""ElasticNet""","""{'alpha': 0.5, 'l1_ratio': 0.2…",4.389048,1.11784,1.15688
"""ElasticNet""","""{'alpha': 0.5, 'l1_ratio': 0.3…",3.980875,1.11784,1.15688
"""ElasticNet""","""{'alpha': 0.5, 'l1_ratio': 0.4…",5.500793,1.11784,1.15688
…,…,…,…,…
"""ElasticNet""","""{'alpha': 1.5, 'l1_ratio': 0.5…",2.591372,1.11784,1.15688
"""ElasticNet""","""{'alpha': 1.5, 'l1_ratio': 0.6…",2.604008,1.11784,1.15688
"""ElasticNet""","""{'alpha': 1.5, 'l1_ratio': 0.7…",2.370119,1.11784,1.15688
"""ElasticNet""","""{'alpha': 1.5, 'l1_ratio': 0.8…",2.567053,1.11784,1.15688


In [64]:
from sklearn.linear_model import SGDRegressor

param_grid = {
    "alpha": np.array([1e-4, 1e-3, 1e-2]),
    "l1_ratio": np.linspace(start=0.1, stop=0.9, num=10),
    "penalty": np.array(["l1", "l2", "elasticnet"]),
}
grid_search(SGDRegressor, param_grid)

  0%|          | 0/90 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""SGDRegressor""","""{'alpha': 0.0001, 'l1_ratio': …",164.452076,1.009713,1.060977
"""SGDRegressor""","""{'alpha': 0.0001, 'l1_ratio': …",262.612343,1.0121,1.06296
"""SGDRegressor""","""{'alpha': 0.0001, 'l1_ratio': …",160.377026,1.012377,1.062963
"""SGDRegressor""","""{'alpha': 0.001, 'l1_ratio': 0…",158.321857,1.013135,1.064459
"""SGDRegressor""","""{'alpha': 0.001, 'l1_ratio': 0…",164.090872,1.012916,1.064549
…,…,…,…,…
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",67.302942,1.115121,1.158357
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",77.97575,1.115376,1.159054
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",69.360971,1.115226,1.159077
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",67.888975,1.115567,1.15919


In [67]:
from sklearn.neighbors import KNeighborsRegressor

param_grid = {
    "n_neighbors": np.arange(start=2, stop=9),
    "leaf_size": np.arange(start=5, stop=50, step=5),
    "weights": np.array(["uniform", "distance"]),
}
grid_search(KNeighborsRegressor, param_grid)

  0%|          | 0/126 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.232935,0.87419,0.98375
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.244141,0.87419,0.98375
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.247002,0.87419,0.98375
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.250101,0.87419,0.98375
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.625849,0.87419,0.98375
…,…,…,…,…
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.582218,9.5465e-8,1.22805
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.265837,9.5465e-8,1.22805
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.597,9.5465e-8,1.22805
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.324965,9.5465e-8,1.22805


In [77]:
from sklearn.ensemble import RandomForestRegressor

param_grid = {
    "min_samples_leaf": np.arange(start=2, stop=6),
}
grid_search(RandomForestRegressor, param_grid)

  0%|          | 0/4 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""RandomForestRegressor""","""{'min_samples_leaf': 5}""",27979.981899,0.458471,1.055396
"""RandomForestRegressor""","""{'min_samples_leaf': 2}""",32291.435003,0.3927,1.05599
"""RandomForestRegressor""","""{'min_samples_leaf': 3}""",30894.958019,0.409741,1.058537
"""RandomForestRegressor""","""{'min_samples_leaf': 4}""",29754.374027,0.435607,1.066952


In [72]:
from sklearn.ensemble import HistGradientBoostingRegressor

param_grid = {
    "learning_rate": np.array([1e-1, 5e-2, 1e-2]),
    "max_leaf_nodes": np.arange(start=20, stop=50, step=10),
    # "l2_regularization": np.linspace(start=0.0, stop=1.0, num=3),
    # "max_features": np.linspace(start=0.1, stop=1.0, num=3),
}
grid_search(HistGradientBoostingRegressor, param_grid)

  0%|          | 0/81 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.05, 'max_l…",2074.5399,0.263926,1.026484
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.05, 'max_l…",1612.729788,0.33925,1.034051
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.05, 'max_l…",1921.996832,0.285968,1.034399
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.1, 'max_le…",1709.148884,0.192629,1.036705
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.1, 'max_le…",2383.854866,0.042516,1.038475
…,…,…,…,…
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.1, 'max_le…",1579.900742,0.248323,1.09971
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.1, 'max_le…",1688.149214,0.172808,1.104118
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.1, 'max_le…",1602.896214,0.131334,1.105918
"""HistGradientBoostingRegressor""","""{'learning_rate': 0.1, 'max_le…",2495.135069,0.040932,1.105953


## PCA

See if some approaches behave better on PCA-reduced data (down to 128D). Note that PCA should only be performed on the test set.


In [19]:
from sklearn.decomposition import PCA

pca = PCA(n_components=128)
pca.fit(X_train)

X_train = pca.transform(X_train)
X_test = pca.transform(X_test)

In [20]:
from sklearn.neighbors import KNeighborsRegressor

param_grid = {
    "n_neighbors": np.arange(start=2, stop=9),
    "leaf_size": np.arange(start=5, stop=50, step=5),
    "weights": np.array(["uniform", "distance"]),
}
grid_search(KNeighborsRegressor, param_grid)

  0%|          | 0/126 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mae,test_mae
str,str,f64,f64,f64
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.105858,0.868924,0.9975
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.219107,0.868924,0.9975
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.10705,0.868924,0.9975
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.10705,0.868924,0.9975
"""KNeighborsRegressor""","""{'n_neighbors': 6, 'leaf_size'…",0.211954,0.868924,0.9975
…,…,…,…,…
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.24581,5.0441e-8,1.142689
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.128031,5.0441e-8,1.142689
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.261068,5.0441e-8,1.142689
"""KNeighborsRegressor""","""{'n_neighbors': 2, 'leaf_size'…",0.114918,5.0441e-8,1.142689
