**This is a competition notebook with the goal of obtaining the highest possible rank on the public leaderboard.**

Marko Marfat

In [None]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler

from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import cross_val_score

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import chi2

import matplotlib.pyplot as plt
from sklearn.model_selection import RandomizedSearchCV

**Loading the data**

In [None]:
train_data = pd.read_csv('../input/dapprojekt22/train.csv')
test_data = pd.read_csv('../input/dapprojekt22/test.csv')

train_data.head()

**Removing data with NA values**

Fortunately, in this dataset, the only NA values are in columns which only contain NA values. This way, it's simple to decide what to do with NA values - they'll simply be removed from the dataset.

In [None]:
train_data = train_data.dropna(axis=1)
test_data = test_data.dropna(axis=1)

**Removing data with a uniform distribution**

All data with uniform (constant) distribution will be removed as it serves no purpose. Only identifiers will be kept as they can be later used to for joining dataframes and similar operations.

In [None]:
remove_uniform = ['PCT_AST_HOME', 'PCT_AST_AWAY', 'PCT_BLKA_HOME', 'PCT_BLKA_AWAY', 'PCT_BLK_HOME', 'PCT_BLK_AWAY', 'PCT_DREB_HOME', 'PCT_DREB_AWAY', 'PCT_FG3A_HOME', 
                  'PCT_FG3A_AWAY', 'PCT_FG3M_HOME', 'PCT_FG3M_AWAY', 'PCT_FGA_HOME', 'PCT_FGA_AWAY', 'PCT_FGM_HOME', 'PCT_FGM_AWAY', 'PCT_FTA_HOME', 'PCT_FTA_AWAY', 
                  'PCT_FTM_HOME', 'PCT_FTM_AWAY', 'PCT_OREB_HOME', 'PCT_OREB_AWAY', 'PCT_PFD_HOME', 'PCT_PFD_AWAY', 'PCT_PF_HOME', 'PCT_PF_AWAY', 'PCT_PTS_HOME', 
                  'PCT_PTS_AWAY', 'PCT_REB_HOME', 'PCT_REB_AWAY', 'PCT_STL_HOME', 'PCT_STL_AWAY', 'PCT_TOV_HOME', 'PCT_TOV_AWAY']

train_data = train_data.loc[:, ~train_data.columns.isin(remove_uniform)]
test_data = test_data.loc[:, ~test_data.columns.isin(remove_uniform)]
train_data.head(n=5)

**Adding a new column to signify which team won the current match**

This is done so that the correlation between the statistics and the outcome of the match can be determined.

In [None]:
train_data.loc[train_data['PTS_HOME'] > train_data['PTS_AWAY'], 'CURRENT_WINNER'] = 0
train_data.loc[train_data['PTS_HOME'] < train_data['PTS_AWAY'], 'CURRENT_WINNER'] = 1

train_data.head(5)

**Determining the correlation**

Tracking down the 50 variables which have the highest magnitude of correlation with the target variable. This is done to get an idea of which stats have the highest predictability of the outcome.

In [None]:
top50_corr_var = np.abs(train_data.iloc[:, :-1].corrwith(train_data['CURRENT_WINNER'])).sort_values(ascending=False).head(50)
top50_corr_var

**Reducing the dataset to only these 50 variables**

**Note:** Variables `TEAM_ABBREVIATION_HOME`, `TEAM_ABBREVIATION_AWAY`, `NEXT_HOME`, `NEXT_AWAY` and `NEXT_WINNER` will also be included as they are needed for learning the model

In [None]:
new_vars = list(top50_corr_var.index)
new_vars.extend(['TEAM_ABBREVIATION_HOME', 'TEAM_ABBREVIATION_AWAY', 'NEXT_HOME', 'NEXT_AWAY', 'NEXT_WINNER'])

train_data = train_data[new_vars]

new_vars.remove('NEXT_WINNER')
new_vars.append('id')

test_data = test_data[new_vars]

train_data.head()

**Transforming the data**

Next step is to transform the data so that a meaningful classifier can be made. Data will be transformed so that each team which is playing the next match will have their average stats from their last 5 games. These stats will be split into the home and away categories since the performances of team depends if they play in a comfortable environment (@ home) or a new environment (@ away).

First step is determining how many matches did each team play prior to their next match.

In [None]:
def match_counter(row, test):
    next_home = row['NEXT_HOME']
    next_away = row['NEXT_HOME']
    
    past_matches = train_data.loc[0:row.name]
    
    if test:
        past_matches = test_data.loc[0:row.name]
        
    # HAH -> HOME AT HOME -> How many matches did the 'NEXT_HOME' team play at home so far?
    # HAA -> HOME AT AWAY -> How many matches did the 'NEXT_HOME' team play at away so far?
    # AAH -> AWAY AT HOME -> How many matches did the 'NEXT_AWAY' team play at home so far?
    # AAA -> AWAY AT AWAY -> How many matches did the 'NEXT_AWAY' team play at away so far?
        
    row['COUNT_HAH'] = len(past_matches.loc[past_matches.TEAM_ABBREVIATION_HOME == next_home])
    row['COUNT_HAA'] = len(past_matches.loc[past_matches.TEAM_ABBREVIATION_AWAY == next_home])
    row['COUNT_AAH'] = len(past_matches.loc[past_matches.TEAM_ABBREVIATION_HOME == next_away])
    row['COUNT_AAA'] = len(past_matches.loc[past_matches.TEAM_ABBREVIATION_AWAY == next_away])
    
    return row
    
train_data = train_data.apply(match_counter, test=False, axis=1)
test_data = test_data.apply(match_counter, test=True, axis=1)

Columns with total games played for NEXT_HOME and NEXT_AWAY will also be added because they may be used.

In [None]:
train_data['COUNT_HT'] = train_data['COUNT_HAH'] + train_data['COUNT_HAA']
train_data['COUNT_AT'] = train_data['COUNT_AAH'] + train_data['COUNT_AAA']

test_data['COUNT_HT'] = test_data['COUNT_HAH'] + test_data['COUNT_HAA']
test_data['COUNT_AT'] = test_data['COUNT_AAH'] + test_data['COUNT_AAA']

In [None]:
train_data.loc[150:160]

Now it's time to transform the actual stats into the explained format.

First, it's necessary to calculate all the means of the features (stats) since those will be used when there isn't enough information about the match (e.g. not enough games played).

In [None]:
train_feature_means = dict()
test_feature_means = dict()

def calculate_feature_means(column, test):
    if not test:
        train_feature_means[f"{column.name}"] = np.mean(column)
    else:
        test_feature_means[f"{column.name}"] = np.mean(column)
    
train_data.iloc[:, :-11].apply(calculate_feature_means, test=False, axis=0);
test_data.iloc[:, :-11].apply(calculate_feature_means, test=True, axis=0);

Next step is calculating stats of the teams which are playing the next game. The requirement was looking at the last 5 games of each team. If there aren't 5 games to look at, the means from the previous step will be used.

In [None]:
stats_vars = new_vars[:-5]

def calculate_stats(row, test):
    next_home = row['NEXT_HOME']
    next_away = row['NEXT_AWAY']
    
    matches_HAH = row['COUNT_HAH']
    matches_HAA = row['COUNT_HAA']
    matches_AAH = row['COUNT_AAH']
    matches_AAA = row['COUNT_AAA']
    
    past_matches = train_data.loc[0:row.name]
    mean_values = train_feature_means
    
    
    if test:
        past_matches = test_data.loc[0:row.name]
        mean_values = test_feature_means
    
    
    if matches_HAH >= 5:
        last_5_games = past_matches[past_matches.TEAM_ABBREVIATION_HOME == next_home].iloc[-5:]
        
        for stat in stats_vars:
            if stat.endswith('HOME'):
                row[stat + '_HAH'] = last_5_games[stat].mean()
            
    else:
        for stat in stats_vars:
            if stat.endswith('HOME'):
                row[stat + '_HAH'] = mean_values[stat]
        
    if matches_HAA >= 5:
        last_5_games = past_matches[past_matches.TEAM_ABBREVIATION_AWAY == next_home].iloc[-5:]
        
        for stat in stats_vars:
            if stat.endswith('AWAY'):
                row[stat + '_HAA'] = last_5_games[stat].mean()
            
    else:
        for stat in stats_vars:
            if stat.endswith('AWAY'):
                row[stat + '_HAA'] = mean_values[stat]
            
    if matches_AAH >= 5:
        last_5_games = past_matches[past_matches.TEAM_ABBREVIATION_HOME == next_away].iloc[-5:]
        
        for stat in stats_vars:
            if stat.endswith('HOME'):
                row[stat + '_AAH'] = last_5_games[stat].mean()
        
    else:
        for stat in stats_vars:
            if stat.endswith('HOME'):
                row[stat + '_AAH'] = mean_values[stat]
        
    if matches_AAA >= 5:
        last_5_games = past_matches[past_matches.TEAM_ABBREVIATION_AWAY == next_away].iloc[-5:]
        
        for stat in stats_vars:
            if stat.endswith('AWAY'):
                row[stat + '_AAA'] = last_5_games[stat].mean()
    else:
        for stat in stats_vars:
            if stat.endswith('AWAY'):
                row[stat + '_AAA'] = mean_values[stat]
        
    return row
    

train_data = train_data.apply(calculate_stats, test=False, axis=1)
test_data = test_data.apply(calculate_stats, test=True, axis=1)

In [None]:
train_data = train_data.iloc[:, 54:]
test_data = test_data.iloc[:, 54:]

Now we have the dataset with all the necessary transformations to start training the models.

In [None]:
train_data.loc[100:110]

**Normalising the data**

Next step before training the models is data normalisation. Standard scaling will be used.

In [None]:
# Removing COUNT columns as they aren't needed anymore
count_cols = ['COUNT_HAH', 'COUNT_HAA', 'COUNT_AAH', 'COUNT_AAA', 'COUNT_HT', 'COUNT_AT']
train_data = train_data.loc[:, ~train_data.columns.isin(count_cols)]
test_data = test_data.loc[:, ~test_data.columns.isin(count_cols)]

In [None]:
train_data.iloc[:, 1:] = StandardScaler().fit_transform(train_data.iloc[:, 1:])
test_data.iloc[:, 1:] = StandardScaler().fit_transform(test_data.iloc[:, 1:])

**Cross-validation function**

Let's test some models using cross-validation.

In [None]:
# Moving NEXT_WINNER and id to the end of the dataset
cols_train = train_data.columns.tolist()
cols_test = test_data.columns.tolist()

cols_train = cols_train[1:] + cols_train[:1]
cols_test = cols_test[1:] + cols_test[:1]

train_data = train_data[cols_train]
test_data = test_data[cols_test]

In [None]:
def cv_compare(data):
    models = [GaussianNB(), LogisticRegression(), RandomForestClassifier(), ExtraTreesClassifier(), XGBClassifier()]

    tss = TimeSeriesSplit(n_splits=5)
    
    X = data.iloc[:, :-1]
    y = data.iloc[:, -1:].squeeze(axis=1).ravel()
    
    table = pd.DataFrame(columns = ["Algorithm", "Fold 1", "Fold 2", "Fold 3", "Fold 4", "Fold 5", "Average"])
    
    for model in models:
        acc = cross_val_score(model, X, y, scoring='accuracy', cv = tss, n_jobs = -1)
        row = {'Algorithm': type(model).__name__, 'Fold 1': acc[0], 'Fold 2': acc[1], 'Fold 3': acc[2], 'Fold 4': acc[3], 'Fold 5': acc[4], 'Average': np.mean(acc)}
        table = table.append(row, ignore_index=True)
    
    table.set_index('Algorithm', inplace=True)
    display(table)

In [None]:
cv_compare(train_data)

In [None]:
tss = TimeSeriesSplit(n_splits=5)
accuracies = []

for i in range(int(100 / 5)):
    selector = SelectKBest(score_func=f_classif, k=int((i+1)*5)).fit(train_data.iloc[:, :-1], train_data.iloc[:, -1:].squeeze(axis=1).ravel())
    cols = selector.get_support(indices=True)
    
    new_featureset = train_data.iloc[:,cols]
    
    X, y = new_featureset, train_data.iloc[:, -1:].squeeze(axis=1).ravel()
    accuracies.append(cross_val_score(ExtraTreesClassifier(), X, y, scoring='accuracy', cv = tss, n_jobs = -1))
    

In [None]:
all_accs = [np.mean(x) for x in accuracies]

plt.plot(all_accs)
plt.show()

It seems that using ~40 variables is the best for this model.

Let's see how the models perform after eliminating other features.

In [None]:
selector = SelectKBest(score_func=f_classif, k=int(40)).fit(train_data.iloc[:, :-1], train_data.iloc[:, -1:].squeeze(axis=1).ravel())
cols = selector.get_support(indices=True)
                       
new_featureset = train_data.iloc[:,cols]  
train_data_40 = new_featureset
train_data_40['NEXT_WINNER'] = train_data.iloc[:, -1:]

In [None]:
test_data_40 = test_data.iloc[:, cols]
test_data_40['id'] = test_data.iloc[:, -1:]

In [None]:
cv_compare(train_data_40)

Let's try using RandomForest with hyperparameter tuning.

In [None]:
#n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
#max_features = ['auto', 'sqrt']
#max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
#max_depth.append(None)
#min_samples_split = [2, 5, 10]
#min_samples_leaf = [1, 2, 4]
#bootstrap = [True, False]
#random_grid = {'n_estimators': n_estimators,
#                'max_features': max_features,
#                'max_depth': max_depth,
#                'min_samples_split': min_samples_split,
#                'min_samples_leaf': min_samples_leaf,
#                'bootstrap': bootstrap}
#
#rf_classifier = RandomForestClassifier()
#rf_random = RandomizedSearchCV(estimator = rf_classifier, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
#rf_random.fit(train_data_40.loc[:, train_data_40.columns != 'NEXT_WINNER'], train_data_40['NEXT_WINNER'])

In [None]:
#rf_random.best_params_

# best params {'n_estimators': 1000,
# 'min_samples_split': 5,
# 'min_samples_leaf': 2,
# 'max_features': 'sqrt',
# 'max_depth': 10,
# 'bootstrap': True}

In [None]:
model = RandomForestClassifier(n_estimators=1000, min_samples_split=5, min_samples_leaf=2, max_features='sqrt', max_depth=10, bootstrap=True)
model.fit(train_data_40.loc[:, train_data_40.columns != 'NEXT_WINNER'], train_data_40['NEXT_WINNER'])

predictions = model.predict(test_data_40.loc[:, test_data_40.columns != 'id'])

In [None]:
submission = test_data.loc[:,test_data.columns.isin(('id', ))]
submission.loc[:, 'NEXT_WINNER'] = predictions

submission.to_csv("submission.csv", index = None)
submission.head()