# Predictions

This notebook attends the final purpose of the project, which is to determine probabilities of each team qualifying for the playoff in the final season.

## Previous notebook: Model Exploration

Recalling the previous notebook, we tested some possible classifiers. For each of them, best possible feature selector and combination of hyper-parameters were determined. Iteratively, we saved the best estimator found (according to its performance against a defined test set), along with the associated feature selector, re-trained the estimator with all the dataset rows we have known labels for, and reshaped the final query set according to the selector.

## Importing the estimator

First, we need to redefine the function that the imported object refers to in the other notebook.

In [None]:
def get_predictions(estimator, X):
  return [i[1] for i in estimator.predict_proba(X)]   # i[1] since each prediction come as an array of [prob_not_qualifying, prob_qualifying]

def get_error(estimator, X, labels):
  predictions = get_predictions(estimator, X)

  err = 0

  for i in range(len(predictions)):
    err += abs(predictions[i] - labels[i])

  return err

def scorer(estimator, X, y):
    return -get_error(estimator, X, y)     # negating so that greater errors mean actually less score

Let's now import the estimator and feature selector.

In [None]:
from google.colab import drive

drive.mount('/content/drive', force_remount=True)

import pickle

with open("/content/drive/Shareddrives/ML 2024/best_estimator.pkl", "rb") as f:
  estimator_selector = pickle.load(f)

estimator = estimator_selector["estimator"]
selector = estimator_selector["feature_selector"]
supports_na = estimator_selector["supports_na"]

print(estimator_selector)

Mounted at /content/drive
{'error': 3.0, 'estimator': DecisionTreeClassifier(max_depth=9), 'feature_selector': SequentialFeatureSelector(estimator=DecisionTreeClassifier(), n_jobs=-1,
                          scoring=<function scorer at 0x7c112925e680>), 'supports_na': False}


## Loading the query data

We need to load the rows that we will query the estimator with, i.e. the data we want to predict on.

In [None]:
import pandas as pd

dataset = pd.read_csv("/content/drive/Shareddrives/ML 2024/tables/dataset.csv")
dataset.designation = 'dataset'

Let's define a function that prepares a table to serve as a query set by removing the non-numeric columns (franchiseID and teamID). However, franchiseIDs and teamIDs will be returned to be used as keys later.

In [None]:
def prepare_table_for_model(table):
  franchIDs = table['franchID'].values
  tmIDs = table['tmID'].values

  ret_table = table.copy()

  del ret_table['franchID']
  del ret_table['tmID']

  return (franchIDs, tmIDs, ret_table)

def getXY(data):
  X = data.drop(columns=['label', 'year']).values
  y = data['label'].values

  return (X, y)

Since there are actually no labels in the final rows over which the predictions will occur (since, once again, we do not know the real values), we are only getting the X set for the query year.

In [None]:
query_rows = dataset[dataset['year'].isin([10])].copy()
if supports_na == False:
  query_rows.fillna(0, inplace=True)

query_franchIDs, query_tmIDs, query_data = prepare_table_for_model(query_rows)

X_query, _ = getXY(query_data)

## Feature selection

Let's apply the imported feature selector to our query set.

In [None]:
X_query = selector.transform(X_query)

## Querying the model

The moment has arrived. Let's get our results.

We first need to define a function to present them as intended.

In [None]:
def getResults(estimator, X, franchIDs, tmIDs):
  y_prob = get_predictions(estimator, X)

  df = pd.DataFrame(columns=['franchID', 'tmID', 'prob'])
  df['franchID'] = franchIDs
  df['tmID'] = tmIDs
  df['prob'] = y_prob

  # we need to change tmID = DET to TUL to reflect reality
  df.loc[df['tmID'] == 'DET', 'tmID'] = 'TUL'

  # rename prob to Playoff
  df.rename(columns={'prob': 'Playoff'}, inplace=True)

  return df.sort_values(by=['tmID'])

In [None]:
results = getResults(estimator, X_query, query_franchIDs, query_tmIDs)
print(results)

   franchID tmID  Playoff
2       ATL  ATL      0.0
8       CHI  CHI      0.0
10      CON  CON      1.0
0       IND  IND      1.0
5       LAS  LAS      1.0
9       MIN  MIN      0.0
11      NYL  NYL      1.0
1       PHO  PHO      1.0
6       SAS  SAS      1.0
3       SEA  SEA      1.0
4       DET  TUL      1.0
7       WAS  WAS      0.0


Let's now export our results, preserving two decimal places.

In [None]:
results[['tmID', 'Playoff']].to_csv("/content/drive/Shareddrives/ML 2024/results.csv", float_format='%.2f', index=False)