Test implementation of grid search using scikit-learn algorithms before spending time testing it with cuML.

Although scikit-learn has [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) for performing a grid search across various algorithms, it doesn't offer the control I need in terms of trial data output (e.g. what I want is outputting the results of each trial in a denormalized row)

XGBoost note: https://stackoverflow.com/questions/63776921/you-are-running-32-bit-python-on-a-64-bit-osmac-and-xgboost-library-could-not


In [1]:
import os

# import xgboost
import polars as pl
import cuml
import cudf

In [2]:
df = (
    pl.scan_parquet(
        "movie_data_plus_embeds_all.parquet"
    )
    # .select(["tconst", "averageRating", "embedding"])
    .with_columns(averageRating=pl.col("averageRating").cast(pl.Float32))
    .collect()
    .sample(fraction=1.0, shuffle=True, seed=42)
)

df

tconst,startYear,numVotes,averageRating,json,embedding
str,i64,i64,f32,str,"array[f32, 768]"
"""tt0173052""",1999,354,4.1,"""{  ""title"": ""The Prince and t…","[0.046187, 0.006053, … 0.011911]"
"""tt0266288""",1996,1054,7.4,"""{  ""title"": ""Azhakiya Ravanan…","[-0.004875, -0.046969, … 0.017516]"
"""tt6263490""",2020,2713,4.3,"""{  ""title"": ""Getaway"",  ""gen…","[0.005363, -0.018672, … 0.015112]"
"""tt10049110""",2019,106,7.8,"""{  ""title"": ""Die Wiese"",  ""g…","[-0.009997, -0.029303, … 0.037793]"
"""tt5761612""",2018,133,3.8,"""{  ""title"": ""Woman on the Edg…","[0.020259, -0.031869, … -0.01841]"
…,…,…,…,…,…
"""tt0079376""",1979,168,6.2,"""{  ""title"": ""The Proud Twins""…","[0.062672, -0.009446, … 0.019441]"
"""tt1161064""",2008,1194,3.2,"""{  ""title"": ""Super Capers: Th…","[0.022779, 0.053063, … -0.009691]"
"""tt0179526""",1997,340,5.7,"""{  ""title"": ""Who's the Caboos…","[0.001937, 0.003111, … -0.002453]"
"""tt0188233""",1979,32,5.7,"""{  ""title"": ""That's Erotic"", …","[0.03125, 0.013802, … 0.009849]"


To avoid conversion overhead, store all data as `cuDF` [DataFrames](https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.dataframe/#cudf.DataFrame).

In [3]:
n_test = 20000

X_train = cudf.DataFrame(df[:-n_test]["embedding"].to_numpy().copy())
X_test = cudf.DataFrame(df[-n_test:]["embedding"].to_numpy().copy())

y_train = cudf.Series(df[:-n_test]["averageRating"].to_numpy().copy())
y_test = cudf.Series(df[-n_test:]["averageRating"].to_numpy().copy())

y_train

0         4.1
1         7.4
2         4.3
3         7.8
4         3.8
         ... 
222547    5.0
222548    6.7
222549    6.4
222550    6.0
222551    6.5
Length: 222552, dtype: float32

Test using the train set mean only.

In [4]:
y_train_mean = y_train.mean()
y_train_mean

6.055189800023445

In [5]:
y_train_mean_series = cudf.Series([y_train_mean] * n_test)

nir = cuml.metrics.mean_squared_error(y_train_mean_series, y_test)
nir

1.6378202126246257

Test a simple OLS.


In [6]:
model = cuml.LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mse = cuml.metrics.mean_squared_error(y_test, y_pred)
mse

1.1870156526565552

## Start Grid Search

- Create a parameter grid
- Run sequentually since this will evenatually be on a single GPU.
- Write the results ([time elapsed](https://stackoverflow.com/a/63232040), train MSE, test MSE) to a dict along with the metadata.
- Load all the data into polars and save as parquet.

Test with a Support Vector Machine since that has tweakable parameters.


In [7]:
import itertools
import time

import numpy as np
from sklearn.svm import SVR
from tqdm.auto import tqdm

Build the grid to get all combinations of parameters. Normally numpy shenanigans would be used here for fast speed but that causes data type issues, e.g. several parameters must be `int`s but numpy shenanigans will force `float`s.


In [8]:
param_grid = {
    "C": np.linspace(start=0.8, stop=1.2, num=5),
    "epsilon": np.linspace(start=0.8, stop=1.2, num=5),
}

In [9]:
[x.tolist() for x in param_grid.values()]

[[0.8, 0.9, 1.0, 1.1, 1.2], [0.8, 0.9, 1.0, 1.1, 1.2]]

In [10]:
param_values = [x.tolist() for x in param_grid.values()]

combos = list(itertools.product(*param_values))
combos

[(0.8, 0.8),
 (0.8, 0.9),
 (0.8, 1.0),
 (0.8, 1.1),
 (0.8, 1.2),
 (0.9, 0.8),
 (0.9, 0.9),
 (0.9, 1.0),
 (0.9, 1.1),
 (0.9, 1.2),
 (1.0, 0.8),
 (1.0, 0.9),
 (1.0, 1.0),
 (1.0, 1.1),
 (1.0, 1.2),
 (1.1, 0.8),
 (1.1, 0.9),
 (1.1, 1.0),
 (1.1, 1.1),
 (1.1, 1.2),
 (1.2, 0.8),
 (1.2, 0.9),
 (1.2, 1.0),
 (1.2, 1.1),
 (1.2, 1.2)]

[zip shenanigans](https://stackoverflow.com/a/33737067) to map back to a list of dicts.


In [11]:
param_dicts = [
    dict(zip(list(param_grid.keys()), combos[i])) for i in range(len(combos))
]
param_dicts

[{'C': 0.8, 'epsilon': 0.8},
 {'C': 0.8, 'epsilon': 0.9},
 {'C': 0.8, 'epsilon': 1.0},
 {'C': 0.8, 'epsilon': 1.1},
 {'C': 0.8, 'epsilon': 1.2},
 {'C': 0.9, 'epsilon': 0.8},
 {'C': 0.9, 'epsilon': 0.9},
 {'C': 0.9, 'epsilon': 1.0},
 {'C': 0.9, 'epsilon': 1.1},
 {'C': 0.9, 'epsilon': 1.2},
 {'C': 1.0, 'epsilon': 0.8},
 {'C': 1.0, 'epsilon': 0.9},
 {'C': 1.0, 'epsilon': 1.0},
 {'C': 1.0, 'epsilon': 1.1},
 {'C': 1.0, 'epsilon': 1.2},
 {'C': 1.1, 'epsilon': 0.8},
 {'C': 1.1, 'epsilon': 0.9},
 {'C': 1.1, 'epsilon': 1.0},
 {'C': 1.1, 'epsilon': 1.1},
 {'C': 1.1, 'epsilon': 1.2},
 {'C': 1.2, 'epsilon': 0.8},
 {'C': 1.2, 'epsilon': 0.9},
 {'C': 1.2, 'epsilon': 1.0},
 {'C': 1.2, 'epsilon': 1.1},
 {'C': 1.2, 'epsilon': 1.2}]

Functionalize.


In [12]:
def build_grid_dict(param_grid):
    param_values = [x.tolist() for x in param_grid.values()]

    combos = list(itertools.product(*param_values))
    param_dicts = [
        dict(zip(list(param_grid.keys()), combos[i])) for i in range(len(combos))
    ]
    return param_dicts


param_grid = {
    "C": np.linspace(start=0.8, stop=1.2, num=5),
    "epsilon": np.linspace(start=0.8, stop=1.2, num=5),
}

build_grid_dict(param_grid)[:5]

[{'C': 0.8, 'epsilon': 0.8},
 {'C': 0.8, 'epsilon': 0.9},
 {'C': 0.8, 'epsilon': 1.0},
 {'C': 0.8, 'epsilon': 1.1},
 {'C': 0.8, 'epsilon': 1.2}]

Do the grid search!


In [13]:
save_folder = "grid_search_results"

os.makedirs(save_folder, exist_ok=True)


# to guard against floating point precision issues when printing params
def round_param_floats(params):
    return {k: round(v, 4) if isinstance(v, float) else v for k, v in params.items()}


def grid_search(model_class, param_grid, model_str):
    param_dicts = build_grid_dict(param_grid)
    result_dicts = []

    for params in tqdm(param_dicts):
        model = model_class(**params)  # fresh model instantiation each run

        fit_start = time.time()
        model.fit(X_train, y_train)
        fit_end = time.time()

        y_pred = model.predict(X_train)
        train_mse = cuml.metrics.mean_squared_error(y_train, y_pred)

        y_pred = model.predict(X_test)
        test_mse = cuml.metrics.mean_squared_error(y_test, y_pred)

        result_dicts.append(
            {
                "model": model_str,
                "params": str(round_param_floats(params)),
                "fit_time_ms": (fit_end - fit_start) * 1000,
                "train_mse": train_mse,
                "test_mse": test_mse,
            }
        )
        
    result_df = pl.from_dicts(result_dicts).sort("test_mse")
    result_df.write_parquet(os.path.join(save_folder, model_str) + ".parquet")

    return result_df

In [14]:
param_grid = {
    "C": np.linspace(start=0.8, stop=1.2, num=5),
    "epsilon": np.linspace(start=0.8, stop=1.2, num=5),
}

grid_search(cuml.svm.SVR, param_grid, "SupportVectorRegression")

  0%|          | 0/25 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mse,test_mse
str,str,f64,f64,f64
"""SupportVectorRegression""","""{'C': 1.2, 'epsilon': 0.8}""",15052.175283,1.048633,1.0874
"""SupportVectorRegression""","""{'C': 1.1, 'epsilon': 0.8}""",14231.781721,1.055605,1.089279
"""SupportVectorRegression""","""{'C': 1.2, 'epsilon': 0.9}""",13225.937366,1.055518,1.091265
"""SupportVectorRegression""","""{'C': 1.0, 'epsilon': 0.8}""",13459.87463,1.06304,1.091467
"""SupportVectorRegression""","""{'C': 1.1, 'epsilon': 0.9}""",12538.203239,1.062224,1.093113
…,…,…,…,…
"""SupportVectorRegression""","""{'C': 1.1, 'epsilon': 1.2}""",8125.117064,1.088614,1.106921
"""SupportVectorRegression""","""{'C': 0.8, 'epsilon': 1.1}""",7920.904636,1.100272,1.108328
"""SupportVectorRegression""","""{'C': 1.0, 'epsilon': 1.2}""",7566.693306,1.094958,1.10893
"""SupportVectorRegression""","""{'C': 0.9, 'epsilon': 1.2}""",7308.598518,1.101801,1.111324


Try a few more linear models.


In [15]:
param_grid = {
    "alpha": np.linspace(start=0.5, stop=2.0, num=20),
}

grid_search(cuml.Ridge, param_grid, "RidgeRegression")

  0%|          | 0/20 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mse,test_mse
str,str,f64,f64,f64
"""RidgeRegression""","""{'alpha': 0.5}""",1431.311607,1.22904,1.18907
"""RidgeRegression""","""{'alpha': 0.5789}""",974.885464,1.229467,1.189458
"""RidgeRegression""","""{'alpha': 0.6579}""",969.211817,1.229883,1.189837
"""RidgeRegression""","""{'alpha': 0.7368}""",983.141661,1.230287,1.190205
"""RidgeRegression""","""{'alpha': 0.8158}""",984.215975,1.23068,1.190562
…,…,…,…,…
"""RidgeRegression""","""{'alpha': 1.6842}""",968.740702,1.234272,1.193826
"""RidgeRegression""","""{'alpha': 1.7632}""",975.694656,1.234545,1.194074
"""RidgeRegression""","""{'alpha': 1.8421}""",966.972351,1.234812,1.194315
"""RidgeRegression""","""{'alpha': 1.9211}""",975.708485,1.235072,1.194551


In [16]:
param_grid = {
    "alpha": np.array([1e-4, 1e-3, 1e-2]),
    "l1_ratio": np.linspace(start=0.1, stop=0.9, num=3),
    "penalty": np.array(["none", "l1", "l2"]),
}
grid_search(cuml.SGD, param_grid, "SGDRegressor")

  0%|          | 0/27 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mse,test_mse
str,str,f64,f64,f64
"""SGDRegressor""","""{'alpha': 0.0001, 'l1_ratio': …",19494.043827,1.304869,1.260315
"""SGDRegressor""","""{'alpha': 0.0001, 'l1_ratio': …",19413.653135,1.304869,1.260315
"""SGDRegressor""","""{'alpha': 0.0001, 'l1_ratio': …",19401.616096,1.304869,1.260315
"""SGDRegressor""","""{'alpha': 0.001, 'l1_ratio': 0…",19094.285488,1.304869,1.260315
"""SGDRegressor""","""{'alpha': 0.001, 'l1_ratio': 0…",18917.403221,1.304869,1.260315
…,…,…,…,…
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",8115.404844,1.485299,1.445979
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",8071.959972,1.485299,1.445979
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",3664.698362,1.673841,1.637807
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",3766.567469,1.673841,1.637807


In [17]:
param_grid = {
    "n_neighbors": np.arange(start=2, stop=20),

}

grid_search(cuml.neighbors.KNeighborsRegressor, param_grid, "KNNRegressor")

  0%|          | 0/18 [00:00<?, ?it/s]

model,params,fit_time_ms,train_mse,test_mse
str,str,f64,f64,f64
"""KNNRegressor""","""{'n_neighbors': 19}""",95.722914,1.082461,1.173136
"""KNNRegressor""","""{'n_neighbors': 17}""",96.079111,1.070649,1.175417
"""KNNRegressor""","""{'n_neighbors': 18}""",94.62738,1.07699,1.175951
"""KNNRegressor""","""{'n_neighbors': 16}""",93.647003,1.064122,1.176751
"""KNNRegressor""","""{'n_neighbors': 15}""",97.162247,1.056909,1.179318
…,…,…,…,…
"""KNNRegressor""","""{'n_neighbors': 6}""",96.260786,0.913629,1.270564
"""KNNRegressor""","""{'n_neighbors': 5}""",97.173214,0.87238,1.30111
"""KNNRegressor""","""{'n_neighbors': 4}""",94.455957,0.812516,1.34876
"""KNNRegressor""","""{'n_neighbors': 3}""",93.913555,0.717228,1.433389


Reload all the saved parquets and combine them.

In [19]:
df_comb = pl.read_parquet(os.path.join(save_folder, "*")).sort("test_mse")
df_comb

model,params,fit_time_ms,train_mse,test_mse
str,str,f64,f64,f64
"""SupportVectorRegression""","""{'C': 1.2, 'epsilon': 0.8}""",15052.175283,1.048633,1.0874
"""SupportVectorRegression""","""{'C': 1.1, 'epsilon': 0.8}""",14231.781721,1.055605,1.089279
"""SupportVectorRegression""","""{'C': 1.2, 'epsilon': 0.9}""",13225.937366,1.055518,1.091265
"""SupportVectorRegression""","""{'C': 1.0, 'epsilon': 0.8}""",13459.87463,1.06304,1.091467
"""SupportVectorRegression""","""{'C': 1.1, 'epsilon': 0.9}""",12538.203239,1.062224,1.093113
…,…,…,…,…
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",8071.959972,1.485299,1.445979
"""KNNRegressor""","""{'n_neighbors': 2}""",104.799986,0.531296,1.607363
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",3664.698362,1.673841,1.637807
"""SGDRegressor""","""{'alpha': 0.01, 'l1_ratio': 0.…",3766.567469,1.673841,1.637807


In [20]:
df_comb.write_csv("imdb_grid_search.csv")