# Model exploration

In this file, we intend to explore the different ML models and the various parameters available in each of them, in order to achieve the best solution.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import pandas as pd

dataset = pd.read_csv("/content/drive/Shareddrives/ML 2024/tables/dataset.csv")
dataset.designation = 'dataset'

dataset.head()

Mounted at /content/drive


Unnamed: 0,year,tmID,franchID,confID,rank,firstRound,semis,finals,o_fga,o_fta,...,rating_second_max_C,rating_second_max_F,rating_second_max_G,rating_third_max_C,rating_third_max_F,rating_third_max_G,mean_C,mean_F,mean_G,no_new_players
0,1,LAS,LAS,0,1.0,1,0,0,1956.0,693.0,...,0.591842,0.615393,0.560788,0.47286,0.536651,0.560788,0.532351,0.576022,0.535423,2.0
1,1,NYL,NYL,1,1.0,1,1,0,1815.0,567.0,...,0.0,0.579372,0.593629,0.0,0.543367,0.593629,0.0,0.561369,0.512405,6.0
2,1,CLE,CLE,1,2.0,1,0,0,1828.0,570.0,...,0.532204,0.570828,0.572901,0.532204,0.570828,0.572901,0.522991,0.513338,0.50983,2.0
3,1,HOU,HOU,0,2.0,1,1,1,1894.0,634.0,...,0.516522,0.555364,0.613733,0.516522,0.555364,0.593196,0.516522,0.555364,0.603464,4.0
4,1,ORL,CON,1,3.0,0,0,0,1911.0,546.0,...,0.551317,0.0,0.528959,0.551317,0.0,0.451005,0.551317,0.0,0.489982,4.0


Let's define a function that prepares the table for training and testing the set by removing the non-numeric columns.

In [None]:
def prepare_table_for_model(table):
  ret_table = table.copy()

  del ret_table['franchID']
  del ret_table['tmID']

  return ret_table

def getXY(data):
  X = data.drop(columns=['label', 'year']).values
  y = data['label'].values

  return (X, y)

Let's define train and test sets, and also X (data) and y (label) sets.

In [None]:
train_rows = dataset[dataset['year'] <= 8].copy()
train_seasons = 8

test_rows = dataset[dataset['year'].isin([9])].copy()
test_seasons = 1



train_data = prepare_table_for_model(train_rows)
test_data = prepare_table_for_model(test_rows)

X_train, y_train = getXY(train_data)
X_test, y_test = getXY(test_data)


train_data_no_nans = prepare_table_for_model(train_rows.dropna())
test_data_no_nans = prepare_table_for_model(test_rows.dropna())

X_train_no_nans, y_train_no_nans = getXY(train_data_no_nans)
X_test_no_nans, y_test_no_nans = getXY(test_data_no_nans)

Let's now define a function that will evaluate prediction results.

Estabilishing ```err = sum (|prediction - label|)```, the lower the error, the better the results.

Since we will be using GridSearchCV in order to find the best parameter combinations for each model/estimator, let's prepare our evaluation function to be used by GridSearchCV to evaluate the estimators.

It is important noting that the scorer that will be passed to GridSearchCV is actually the negation of the computed error value, since GridSearchCV considers that a greater score value means a better estimator. Of course, since our evaluation represents an error, we want to counteract this when treated as a score, thus the negation.

In [None]:
def get_predictions(estimator, X):
  return [i[1] for i in estimator.predict_proba(X)]   # i[1] since each prediction come as an array of [prob_not_qualifying, prob_qualifying]

def get_error(estimator, X, labels):
  predictions = get_predictions(estimator, X)

  err = 0

  for i in range(len(predictions)):
    err += abs(predictions[i] - labels[i])

  return err

def scorer(estimator, X, y):
    return -get_error(estimator, X, y)     # negating so that greater errors mean actually less score

The purpose of this notebook is to find the best possible estimator. So, we will define a data structure that allows us to save them to be used later with the actual final query set we want to ask the estimator.

We will also save the associated feature selector, so that we can transform the query set according to the transformation on this notebook that led to the best estimator found.

In [None]:
best_estimator_feature_selector = {
    "error": 1000,
    "estimator": None,
    "feature_selector": None,
    "supports_na": None
}

## Decision Tree Classifier (no NaNs)

### Feature Selection

Let's use the Sequential Feature Selection to find out a better feature subset for this classifier. We will then reshape the train and test sets accordingly.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SequentialFeatureSelector

classifier = DecisionTreeClassifier()
sfs = SequentialFeatureSelector(
    estimator = classifier,
    scoring = scorer,
    n_jobs = -1
)

sfs.fit(X_train_no_nans, y_train_no_nans)

new_X_train_no_nans = sfs.transform(X_train_no_nans)
new_X_test_no_nans = sfs.transform(X_test_no_nans)

Checking the effects:

In [None]:
print(f'Original features: {X_train_no_nans.shape[1]}')
print(f'Selected features: {new_X_train_no_nans.shape[1]}')

Original features: 69
Selected features: 34


### Parameter specification

Let's use GridSearchCV in order to find the best parameters regarding Decision Tree model, printing the results on the train set.

In [None]:
from sklearn.model_selection import GridSearchCV

parameter_grid = {
    "max_depth": [None] + list(range(1, 21)),
    "criterion": ["gini", "entropy", "log_loss"],
    "splitter": ["best", "random"],
}

grid_search = GridSearchCV(
    classifier,
    parameter_grid,
    scoring=scorer,
    n_jobs=-1,  # to maximize parallelism
    verbose=2
  )

grid_search.fit(new_X_train_no_nans, y_train_no_nans)

estimator = grid_search.best_estimator_
params = grid_search.best_params_
error = -grid_search.best_score_    # negating since the yielded "score" is actually an error, as explained earlier

print("Best parameters:", params)
print(f"Best minimal total error: {error}")
print("Best average error per season:", round(-grid_search.best_score_ / train_seasons, 2))

Fitting 5 folds for each of 126 candidates, totalling 630 fits
Best parameters: {'criterion': 'gini', 'max_depth': 9, 'splitter': 'best'}
Best minimal total error: 7.25
Best average error per season: 0.91


Let's now run our model with the test data, printing the average error per season:

In [None]:
test_error = round(get_error(grid_search.best_estimator_, new_X_test_no_nans, y_test_no_nans)/test_seasons, 2)
print(f"Average error per season on test set: {test_error}")

Average error per season on test set: 3.0


Okay, now we want to save this exact estimator and corresponding feature selector, if they correspond to the best score so far, as explained earlier.

In [None]:
if test_error < best_estimator_feature_selector["error"]:
  best_estimator_feature_selector["error"] = test_error
  best_estimator_feature_selector["estimator"] = estimator
  best_estimator_feature_selector["feature_selector"] = sfs
  best_estimator_feature_selector["supports_na"] = False

In [None]:
print(best_estimator_feature_selector)

{'error': 3.0, 'estimator': DecisionTreeClassifier(max_depth=9), 'feature_selector': SequentialFeatureSelector(estimator=DecisionTreeClassifier(), n_jobs=-1,
                          scoring=<function scorer at 0x7f219fdb39a0>), 'supports_na': False}


Likewise, the process is repeated for some other classifiers.

## Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SequentialFeatureSelector

classifier = RandomForestClassifier()
sfs = SequentialFeatureSelector(
    estimator = classifier,
    scoring = scorer,
    n_jobs = -1
)

sfs.fit(X_train, y_train)

new_X_train = sfs.transform(X_train)
new_X_test = sfs.transform(X_test)

In [None]:
print(f'Original features: {X_train.shape[1]}')
print(f'Selected features: {new_X_train.shape[1]}')

Original features: 69
Selected features: 34


In [None]:
from sklearn.model_selection import GridSearchCV

parameter_grid = {
    "n_estimators": [50, 100],
    "max_depth": [None, 10, 20],
    "max_features": ["sqrt", "log2"],
    "criterion": ["gini", "entropy"],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2],
    "bootstrap": [True]
}

grid_search = GridSearchCV(
    classifier,
    parameter_grid,
    scoring=scorer,
    n_jobs=-1,  # to maximize parallelism
    verbose=2
  )


grid_search.fit(new_X_train, y_train)

estimator = grid_search.best_estimator_
params = grid_search.best_params_
error = -grid_search.best_score_    # negating since the yielded "score" is actually an error, as explained earlier

print("Best parameters:", params)
print(f"Best minimal total error: {error}")
print("Best average error per season:", round(-grid_search.best_score_ / train_seasons, 2))

Fitting 5 folds for each of 96 candidates, totalling 480 fits
Best parameters: {'bootstrap': True, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 50}
Best minimal total error: 9.142615873015874
Best average error per season: 1.14


In [None]:
test_error = round(get_error(grid_search.best_estimator_, new_X_test_no_nans, y_test_no_nans)/test_seasons, 2)
print(f"Average error per season on test set: {test_error}")

Average error per season on test set: 6.09


In [None]:
if test_error < best_estimator_feature_selector["error"]:
  best_estimator_feature_selector["error"] = test_error
  best_estimator_feature_selector["estimator"] = estimator
  best_estimator_feature_selector["feature_selector"] = sfs
  best_estimator_feature_selector["supports_na"] = True

## Histogram-based Gradient Boosting Classifier

In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.feature_selection import SequentialFeatureSelector

classifier = HistGradientBoostingClassifier()
sfs = SequentialFeatureSelector(
    estimator = classifier,
    scoring = scorer,
    n_jobs = -1
)

sfs.fit(X_train, y_train)

new_X_train = sfs.transform(X_train)
new_X_test = sfs.transform(X_test)

In [None]:
from sklearn.model_selection import GridSearchCV

parameter_grid = {
    "learning_rate": [0.01, 0.05, 0.1],
    "max_iter": [100, 200, 300],
    "max_depth": [None, 10, 20],
    "min_samples_leaf": [5, 10, 20],
    "max_leaf_nodes": [None, 10, 20],
    "l2_regularization": [0.0, 0.1, 0.5],
    "early_stopping": [True, False]
}

grid_search = GridSearchCV(
    classifier,
    parameter_grid,
    scoring=scorer,
    n_jobs=-1,  # to maximize parallelism
    verbose=2
)


grid_search.fit(new_X_train, y_train)

estimator = grid_search.best_estimator_
params = grid_search.best_params_
error = -grid_search.best_score_    # negating since the yielded score is actually an error, as explained earlier

print("Best parameters:", params)
print(f"Best minimal total error: {error}")
print("Best average error per season:", round(-grid_search.best_score_ / train_seasons, 2))

Fitting 5 folds for each of 1458 candidates, totalling 7290 fits
Best parameters: {'early_stopping': False, 'l2_regularization': 0.0, 'learning_rate': 0.1, 'max_depth': None, 'max_iter': 300, 'max_leaf_nodes': None, 'min_samples_leaf': 20}
Best minimal total error: 6.468370905113528
Best average error per season: 0.81


In [None]:
test_error = round(get_error(grid_search.best_estimator_, new_X_test, y_test)/test_seasons, 2)
print(f"Average error per season on test set: {test_error}")

Average error per season on test set: 7.15


In [None]:
if test_error < best_estimator_feature_selector["error"]:
  best_estimator_feature_selector["error"] = test_error
  best_estimator_feature_selector["estimator"] = estimator
  best_estimator_feature_selector["feature_selector"] = sfs
  best_estimator_feature_selector["supports_na"] = True

## K Neareast Neighbors Classifier (no NaNs)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SequentialFeatureSelector

classifier = KNeighborsClassifier()
sfs = SequentialFeatureSelector(
    estimator = classifier,
    scoring = scorer,
    n_jobs = -1
)

sfs.fit(X_train_no_nans, y_train_no_nans)

new_X_train_no_nans = sfs.transform(X_train_no_nans)
new_X_test_no_nans = sfs.transform(X_test_no_nans)

In [None]:
from sklearn.model_selection import GridSearchCV

parameter_grid = {
    "n_neighbors": [3, 5, 7, 10],
    "weights": ["uniform", "distance"],
    "algorithm": ["auto", "ball_tree", "kd_tree", "brute"],
    "leaf_size": [10, 30, 50],
    "p": [1, 2]
}

grid_search = GridSearchCV(
    classifier,
    parameter_grid,
    scoring=scorer,
    n_jobs=-1,  # to maximize parallelism
    verbose=2
)


grid_search.fit(new_X_train_no_nans, y_train_no_nans)

estimator = grid_search.best_estimator_
params = grid_search.best_params_
error = -grid_search.best_score_    # negating since the yielded score is actually an error, as explained earlier

print("Best parameters:", params)
print(f"Best minimal total error: {error}")
print("Best average error per season:", round(-grid_search.best_score_ / train_seasons, 2))

Fitting 5 folds for each of 192 candidates, totalling 960 fits
Best parameters: {'algorithm': 'auto', 'leaf_size': 10, 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}
Best minimal total error: 8.360000000000001
Best average error per season: 1.05


  _data = np.array(data, dtype=dtype, copy=copy,


In [None]:
test_error = round(get_error(grid_search.best_estimator_, new_X_test_no_nans, y_test_no_nans)/test_seasons, 2)
print(f"Average error per season on test set: {test_error}")

Average error per season on test set: 5.6


In [None]:
if test_error < best_estimator_feature_selector["error"]:
  best_estimator_feature_selector["error"] = test_error
  best_estimator_feature_selector["estimator"] = estimator
  best_estimator_feature_selector["feature_selector"] = sfs
  best_estimator_feature_selector["supports_na"] = False

## Neural Network Classifier (no NaNs)

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.feature_selection import SequentialFeatureSelector

classifier = MLPClassifier()
sfs = SequentialFeatureSelector(
    estimator = classifier,
    scoring = scorer,
    n_jobs = -1
)

sfs.fit(X_train_no_nans, y_train_no_nans)

new_X_train_no_nans = sfs.transform(X_train_no_nans)
new_X_test_no_nans = sfs.transform(X_test_no_nans)

In [None]:
from sklearn.model_selection import GridSearchCV

parameter_grid = {
    "hidden_layer_sizes": [(50,), (100,), (50, 50)],
    "activation": ["relu", "tanh"],
    "solver": ["adam", "sgd"],
    "alpha": [1e-4, 1e-3],
    "learning_rate": ["constant", "adaptive"]
}

grid_search = GridSearchCV(
    classifier,
    parameter_grid,
    scoring=scorer,
    n_jobs=-1,  # to maximize parallelism
    verbose=2
)


grid_search.fit(new_X_train_no_nans, y_train_no_nans)

estimator = grid_search.best_estimator_
params = grid_search.best_params_
error = -grid_search.best_score_    # negating since the yielded score is actually an error, as explained earlier

print("Best parameters:", params)
print(f"Best minimal total error: {error}")
print("Best average error per season:", round(-grid_search.best_score_ / train_seasons, 2))

Fitting 5 folds for each of 48 candidates, totalling 240 fits
Best parameters: {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
Best minimal total error: 7.316029448237414
Best average error per season: 0.91




In [None]:
test_error = round(get_error(grid_search.best_estimator_, new_X_test_no_nans, y_test_no_nans)/test_seasons, 2)
print(f"Average error per season on test set: {test_error}")

Average error per season on test set: 5.99


In [None]:
if test_error < best_estimator_feature_selector["error"]:
  best_estimator_feature_selector["error"] = test_error
  best_estimator_feature_selector["estimator"] = estimator
  best_estimator_feature_selector["feature_selector"] = sfs
  best_estimator_feature_selector["supports_na"] = False

## Exporting Best estimator

Let's now, as explained before, store the best estimator (+ feature selection) to use it in "production".

But, first, we will define a new training set comprehending all of the years for which we now the real labels. You see, the training and testing sets we defined previously aimed to obtain insights about which model to use, so, in order to actually evaluate the models, we needed to have a test set for which we knew the real labels. Now, since the next step is to query the model about actually unknown labels (the final purpose of the project), we can use all the data with known labels to retrain the selected model one last time before "going into production".

In [None]:
new_train_rows = dataset[dataset['year'] <= 9].copy()

if best_estimator_feature_selector["supports_na"] == False:
  new_train_rows = new_train_rows.dropna()

new_train_data = prepare_table_for_model(new_train_rows)
new_X_train, new_y_train = getXY(new_train_data)

selector = best_estimator_feature_selector["feature_selector"]
new_X_train = selector.transform(new_X_train)

estimator = best_estimator_feature_selector["estimator"]
estimator.fit(new_X_train, new_y_train)

best_estimator_feature_selector["estimator"] = estimator
print(estimator)

DecisionTreeClassifier(max_depth=9)


In [None]:
import pickle

with open("/content/drive/Shareddrives/ML 2024/best_estimator.pkl", "wb") as f:
  pickle.dump(best_estimator_feature_selector, f, pickle.HIGHEST_PROTOCOL)