# NCAA 2020 

This unfinished script was going to be my submission to the NCAA March Madness 2020 competition before it was cancelled due to the coronavirus.  This competition has been run for the last several years.  Code taken from public notebooks is acknowledged throughout.

## Objective

The objective of the competition is to predict the probability a particular basketball team will beat another.  The predictions were to be evaluated on the results of the March Madness competition set to take place in 2020, which was cancelled due to the coronavirus.  

Predictions were to be made for every possible combination of teams for each of the teams in the finals, prior to the finals actually taking place.  Only the predictions made on the games actually played were to be evaluated.  The metric used in this competition is the logloss.  The logloss notably heavily penalises predictions which are very confident and wrong.

## Model 

The model initially used was the lightgbm gradient boosting model.  This model was chosen for it's predictive accuracy, training speed and ease of use.  I chose this model particularly because following the data science bowl competition I wanted to test the permutation importance method of feature selection using this dataset.  Information about permutation importance is available here:

https://academic.oup.com/bioinformatics/article/26/10/1340/193348

For the initial stages of the competition I planned to use holdout sets created taking an entire year's results from the training set for validation purposes.  The previous years competition results seemed to indicate there were two types of competition results - one type where every team that was heavily favoured to win actually won, and the best models were trained using standard techniques.  The second type was the case where teams heavily favoured to lose actually won, and models that were trained to be resilient to outlying predictions won.  My plan was to create one of each, since you are allowed two final submissions.

## Features

This script creates hundreds of features, and then attempts to use permutation importance to select the important ones.  There doesn't seem to be enough samples for permutation to work with this number of features - this problem was never rectified because the competition was cancelled.  

## Post-processing

This was an important tool in prior years.  I experimented with a few different methods.  The results are summarised here:

https://www.kaggle.com/jarnel/clipping-spline-experiment-on-test-predictions

## Retrospective

Following from the retrospective on my data science bowl competition submission, my version control is again lacking - however I did include notes on what I was intending to work on next.  Comparisons between my holdout set results and the private leaderboard for previous years, this submission may have been within the top 5%.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tqdm as tqdm
import matplotlib.gridspec as gridspec
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
from sklearn.utils import shuffle
from sklearn.model_selection import KFold, GroupKFold
from sklearn.metrics import log_loss, mean_absolute_error, mean_squared_error
import lightgbm as lgb
import matplotlib.pyplot as plt
import os
import glob
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
# Any results you write to the current directory are saved as output.

## Feature Extraction

In [None]:
# There are many files for this competition.  Most of this block is taken from a public kernel - it creates a dictionary containing all the data files.

data_dict = {}
for i in glob.glob('/kaggle/input/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament/MDataFiles_Stage1/*'):
    name = i.split('/')[-1].split('.')[0]
    if name != 'MTeamSpellings':
        data_dict[name] = pd.read_csv(i)
    else:
        data_dict[name] = pd.read_csv(i, encoding='cp1252')
data_dict.keys()

In [None]:
# Create some score difference features.

season_result = data_dict['MRegularSeasonDetailedResults']
rankings = data_dict['MMasseyOrdinals']

season_result['Score_difference_last'] = season_result['WScore'] - season_result['LScore']
season_result['score_diff_mean'] = season_result['Score_difference_last'].mean()
season_result['score_diff_med'] = season_result['Score_difference_last'].median()
season_result['score_diff_var'] = season_result['Score_difference_last'].var()

In [None]:
# The training set size can be doubled by switching the teams.  The predictions are made as the probability team 1 beats team 2, swapping team 2 and team 1 creates an extra sample.

season_win_result = season_result[['Season', 'DayNum', 'LTeamID', 'WTeamID', 'WScore', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR',
                                  'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 'Score_difference_last', 'score_diff_mean', 'score_diff_med', 'score_diff_var']]
season_lose_result = season_result[['Season', 'DayNum', 'LTeamID', 'WTeamID', 'LScore', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR',
                                   'LAst', 'LTO', 'LStl', 'LBlk', 'LPF', 'Score_difference_last', 'score_diff_mean', 'score_diff_med', 'score_diff_var']]
season_lose_result['result'] = 0
season_win_result['result'] = 1
season_win_result.rename(columns={'WTeamID':'TeamID1', 'LTeamID':'TeamID2', 'WScore':'Score', 'WFGM':'FGM', 'WFGA':'FGA', 'WFGM3':'FGM3', 'WFGA3':'FGA3',
                                  'WFTM':'FTM', 'WFTA':'FTA', 'WOR':'OR', 'WDR':'DR', 'WAst':'Ast', 'WTO':'TO', 'WStl':'Stl',
                                  'WBlk':'Blk', 'WPF':'PF'}, inplace=True)
season_lose_result.rename(columns={'LTeamID':'TeamID1', 'WTeamID':'TeamID2', 'LScore':'Score', 'LFGM':'FGM', 'LFGA':'FGA', 'LFGM3':'FGM3', 'LFGA3':'FGA3',
                                  'LFTM':'FTM', 'LFTA':'FTA', 'LOR':'OR', 'LDR':'DR', 'LAst':'Ast', 'LTO':'TO', 'LStl':'Stl',
                                  'LBlk':'Blk', 'LPF':'PF'}, inplace=True)
season_lose_result['Score_difference_last'] = -season_lose_result['Score_difference_last']
season_lose_result['score_diff_mean'] = -season_lose_result['score_diff_mean']
season_lose_result['score_diff_med'] = -season_lose_result['score_diff_med']
season_lose_result['score_diff_var'] = -season_lose_result['score_diff_var']
season_result = pd.concat((season_win_result, season_lose_result)).reset_index(drop=True)

In [None]:
# Creates a seed difference feature - the most important feature in every model I trained.  The seed is the ranking of the team going into the competition - 
# a large magnitude negative seed difference suggests a high probability of victory. 

tourney_result = data_dict['MNCAATourneyDetailedResults']
tourney_seed = data_dict['MNCAATourneySeeds']
tourney_result['Score_difference'] = tourney_result['WScore'] - tourney_result['LScore']

tourney_result = tourney_result[['Season', 'WTeamID', 'LTeamID', 'Score_difference', 'DayNum']]

tourney_result = pd.merge(tourney_result, tourney_seed, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID'], how='left')
tourney_result.rename(columns={'Seed':'WSeed'}, inplace=True)
tourney_result = tourney_result.drop('TeamID', axis=1)
tourney_result = pd.merge(tourney_result, tourney_seed, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID'], how='left')
tourney_result.rename(columns={'Seed':'LSeed'}, inplace=True)
tourney_result = tourney_result.drop('TeamID', axis=1)
tourney_result['WSeed'] = tourney_result['WSeed'].apply(lambda x: int(x[1:3]))
tourney_result['LSeed'] = tourney_result['LSeed'].apply(lambda x: int(x[1:3]))

test_df = pd.read_csv('/kaggle/input/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament/MSampleSubmissionStage1_2020.csv')
test_df['Season'] = test_df['ID'].map(lambda x: int(x[:4]))
test_df['WTeamID'] = test_df['ID'].map(lambda x: int(x[5:9]))
test_df['LTeamID'] = test_df['ID'].map(lambda x: int(x[10:14]))

test_df = pd.merge(test_df, tourney_seed, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID'], how='left')
test_df.rename(columns={'Seed':'Seed1'}, inplace=True)
test_df = test_df.drop('TeamID', axis=1)
test_df = pd.merge(test_df, tourney_seed, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID'], how='left')
test_df.rename(columns={'Seed':'Seed2'}, inplace=True)
test_df = test_df.drop('TeamID', axis=1)

In [None]:
# Summary statistics for the last 14 days of regular season games.

last_14_days = season_result.loc[season_result['DayNum'] >= 118].reset_index(drop=True)
for col in [x for x in last_14_days.columns if x not in ['Score_difference_last', 'TeamID1', 'TeamID2', 'Season', 'DayNum']]:
    season_result_map_mean = last_14_days.groupby(['Season', 'TeamID1'])[col].mean().reset_index()
    season_result_map_var = last_14_days.groupby(['Season', 'TeamID1'])[col].var().reset_index()
    season_result_map_last = last_14_days.groupby(['Season', 'TeamID1'])[col].last().reset_index()
    season_result_map_min = last_14_days.groupby(['Season', 'TeamID1'])[col].min().reset_index()
    season_result_map_max = last_14_days.groupby(['Season', 'TeamID1'])[col].max().reset_index()
    season_result_map_med = last_14_days.groupby(['Season', 'TeamID1'])[col].median().reset_index()

    tourney_result = pd.merge(tourney_result, season_result_map_mean, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'W{col}14dMeanT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    tourney_result = pd.merge(tourney_result, season_result_map_mean, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'L{col}14dMeanT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    
    tourney_result = pd.merge(tourney_result, season_result_map_var, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'W{col}14dVarT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    tourney_result = pd.merge(tourney_result, season_result_map_var, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'L{col}14dVarT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
        
    tourney_result = pd.merge(tourney_result, season_result_map_last, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'W{col}14dLastT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    tourney_result = pd.merge(tourney_result, season_result_map_last, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'L{col}14dLastT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    
    tourney_result = pd.merge(tourney_result, season_result_map_min, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'W{col}14dMinT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    tourney_result = pd.merge(tourney_result, season_result_map_min, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'L{col}14dMinT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    
    tourney_result = pd.merge(tourney_result, season_result_map_max, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'W{col}14dMaxT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    tourney_result = pd.merge(tourney_result, season_result_map_max, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'L{col}14dMaxT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    
    tourney_result = pd.merge(tourney_result, season_result_map_med, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'W{col}14dMedT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    tourney_result = pd.merge(tourney_result, season_result_map_med, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'L{col}14dMedT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    
    test_df = pd.merge(test_df, season_result_map_mean, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'W{col}14dMeanT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    test_df = pd.merge(test_df, season_result_map_mean, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'L{col}14dMeanT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)

    test_df = pd.merge(test_df, season_result_map_var, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'W{col}14dVarT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    test_df = pd.merge(test_df, season_result_map_var, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'L{col}14dVarT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    
    test_df = pd.merge(test_df, season_result_map_last, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'W{col}14dLastT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    test_df = pd.merge(test_df, season_result_map_last, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'L{col}14dLastT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    
    test_df = pd.merge(test_df, season_result_map_min, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'W{col}14dMinT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    test_df = pd.merge(test_df, season_result_map_min, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'L{col}14dMinT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    
    test_df = pd.merge(test_df, season_result_map_max, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'W{col}14dMaxT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    test_df = pd.merge(test_df, season_result_map_max, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'L{col}14dMaxT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    
    test_df = pd.merge(test_df, season_result_map_med, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'W{col}14dMedT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    test_df = pd.merge(test_df, season_result_map_med, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'L{col}14dMedT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)

In [None]:
# Summary statistics for past seasons.

for col in [x for x in season_result.columns if x not in ['Score_difference_last', 'TeamID1', 'TeamID2', 'Season', 'DayNum']]:
    season_result_map_mean = season_result.groupby(['Season', 'TeamID1'])[col].mean().reset_index()
    season_result_map_var = season_result.groupby(['Season', 'TeamID1'])[col].var().reset_index()
    season_result_map_last = season_result.groupby(['Season', 'TeamID1'])[col].last().reset_index()
    season_result_map_min = season_result.groupby(['Season', 'TeamID1'])[col].min().reset_index()
    season_result_map_max = season_result.groupby(['Season', 'TeamID1'])[col].max().reset_index()
    
    tourney_result = pd.merge(tourney_result, season_result_map_mean, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'W{col}MeanT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    tourney_result = pd.merge(tourney_result, season_result_map_mean, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'L{col}MeanT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    
    tourney_result = pd.merge(tourney_result, season_result_map_var, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'W{col}VarT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    tourney_result = pd.merge(tourney_result, season_result_map_var, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'L{col}VarT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
        
    tourney_result = pd.merge(tourney_result, season_result_map_last, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'W{col}LastT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    tourney_result = pd.merge(tourney_result, season_result_map_last, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'L{col}LastT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    
    tourney_result = pd.merge(tourney_result, season_result_map_min, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'W{col}MinT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    tourney_result = pd.merge(tourney_result, season_result_map_min, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'L{col}MinT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    
    tourney_result = pd.merge(tourney_result, season_result_map_max, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'W{col}MaxT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    tourney_result = pd.merge(tourney_result, season_result_map_max, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    tourney_result.rename(columns={f'{col}':f'L{col}MaxT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID1', axis=1)
    
    test_df = pd.merge(test_df, season_result_map_mean, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'W{col}MeanT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    test_df = pd.merge(test_df, season_result_map_mean, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'L{col}MeanT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)

    test_df = pd.merge(test_df, season_result_map_var, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'W{col}VarT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    test_df = pd.merge(test_df, season_result_map_var, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'L{col}VarT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    
    test_df = pd.merge(test_df, season_result_map_last, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'W{col}LastT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    test_df = pd.merge(test_df, season_result_map_last, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'L{col}LastT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    
    test_df = pd.merge(test_df, season_result_map_min, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'W{col}MinT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    test_df = pd.merge(test_df, season_result_map_min, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'L{col}MinT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    
    test_df = pd.merge(test_df, season_result_map_max, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'W{col}MaxT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)
    test_df = pd.merge(test_df, season_result_map_max, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID1'], how='left')
    test_df.rename(columns={f'{col}':f'L{col}MaxT'}, inplace=True)
    test_df = test_df.drop('TeamID1', axis=1)

In [None]:
# Create features based on the various betting systems included in the datasets.

init_list = set(rankings['SystemName'].unique())
for season in tourney_result['Season'].unique():
    systems = set(rankings.loc[rankings['Season'] == season]['SystemName'].unique())
    init_list = init_list.intersection(systems)
len(init_list)

In [None]:
# Summary statistics across different systems.

for systemname in init_list:
    season_rankings_mean = rankings.loc[rankings['SystemName'] == systemname]
    season_rankings_mean = season_rankings_mean.groupby(['Season', 'TeamID'])['OrdinalRank'].mean().reset_index()

    tourney_result = pd.merge(tourney_result, season_rankings_mean, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID'], how='left')
    tourney_result.rename(columns={'OrdinalRank':f'W{systemname}OrdinalRankT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID', axis=1)
    tourney_result = pd.merge(tourney_result, season_rankings_mean, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID'], how='left')
    tourney_result.rename(columns={'OrdinalRank':f'L{systemname}OrdinalRankT'}, inplace=True)
    tourney_result = tourney_result.drop('TeamID', axis=1)

    test_df = pd.merge(test_df, season_rankings_mean, left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID'], how='left')
    test_df.rename(columns={'OrdinalRank':f'W{systemname}OrdinalRankT'}, inplace=True)
    test_df = test_df.drop('TeamID', axis=1)
    test_df = pd.merge(test_df, season_rankings_mean, left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID'], how='left')
    test_df.rename(columns={'OrdinalRank':f'L{systemname}OrdinalRankT'}, inplace=True)
    test_df = test_df.drop('TeamID', axis=1)

In [None]:
# Feature renaming to make data usable in lightgbm.

tourney_win_result = tourney_result.copy()
for col in [x for x in tourney_win_result.columns if x not in ['Season', 'DayNum', 'Score_difference']]:
    if col[0] == 'W':
        tourney_win_result.rename(columns={f'{col}':f'{col[1:]+"1"}'}, inplace=True)
    elif col[0] == 'L':
        tourney_win_result.rename(columns={f'{col}':f'{col[1:]+"2"}'}, inplace=True)
tourney_lose_result = tourney_win_result.copy()
for col in tourney_lose_result.columns:
    if col[-1] == '1':
        col2 = col[:-1] + '2'
        tourney_lose_result[col] = tourney_win_result[col2]
        tourney_lose_result[col2] = tourney_win_result[col]
tourney_lose_result.columns

In [None]:
# Create difference features between teams.

tourney_win_result['Seed_diff'] = tourney_win_result['Seed1'] - tourney_win_result['Seed2']
tourney_win_result['ScoreMeanT_diff'] = tourney_win_result['ScoreMeanT1'] - tourney_win_result['ScoreMeanT2']
tourney_win_result['ScoreVarT_diff'] = tourney_win_result['ScoreVarT1'] - tourney_win_result['ScoreVarT2']

tourney_lose_result['Seed_diff'] = tourney_lose_result['Seed1'] - tourney_lose_result['Seed2']
tourney_lose_result['ScoreMeanT_diff'] = tourney_lose_result['ScoreMeanT1'] - tourney_lose_result['ScoreMeanT2']
tourney_lose_result['ScoreVarT_diff'] = tourney_lose_result['ScoreVarT1'] - tourney_lose_result['ScoreVarT2']

tourney_lose_result['Score_difference'] = -tourney_lose_result['Score_difference']
tourney_win_result['result'] = 1
tourney_lose_result['result'] = 0

tourney_result = pd.concat((tourney_win_result, tourney_lose_result)).reset_index(drop=True)

for col in test_df.columns[2:]:
    if col[0] == 'W':
        test_df.rename(columns={f'{col}':f'{col[1:]+"1"}'}, inplace=True)
    elif col[0] == 'L':
        test_df.rename(columns={f'{col}':f'{col[1:]+"2"}'}, inplace=True)

test_df['Seed1'] = test_df['Seed1'].apply(lambda x: int(x[1:3]))
test_df['Seed2'] = test_df['Seed2'].apply(lambda x: int(x[1:3]))
test_df['Seed_diff'] = test_df['Seed1'] - test_df['Seed2']
test_df['ScoreMeanT_diff'] = test_df['ScoreMeanT1'] - test_df['ScoreMeanT2']
test_df['ScoreVarT_diff'] = test_df['ScoreVarT1'] - test_df['ScoreVarT2']
test_df = test_df.drop(['ID', 'Pred'], axis=1)

In [None]:
# More feature renaming to make dataset usable.

t1cols = [x for x in tourney_result.columns if 'OrdinalRankT1' in x]
t2cols = [x for x in tourney_result.columns if 'OrdinalRankT2' in x]
tourney_result['mean_rankt1'] = np.mean(tourney_result[t1cols], axis=1)
tourney_result['mean_rankt2'] = np.mean(tourney_result[t2cols], axis=1)
test_df['mean_rankt1'] = np.mean(test_df[t1cols], axis=1)
test_df['mean_rankt2'] = np.mean(test_df[t2cols], axis=1)
for col in t1cols:
    tourney_result[f'{col}_diff'] = tourney_result[col] - tourney_result[f'{col[:-1]+"2"}']
    test_df[f'{col}_diff'] = test_df[col] - test_df[f'{col[:-1]+"2"}']

In [None]:
# Creates a team 'quality' feature, which was useful in previous competitions.  Code taken from a public kernel.

season_result[['TeamID1', 'TeamID2']] = season_result[['TeamID1', 'TeamID2']].astype(str)

import statsmodels.api as sm
def team_quality(season):
    formula = 'result~-1+TeamID1+TeamID2'
    glm = sm.GLM.from_formula(formula=formula, 
                              data=season_result.loc[season_result['Season'] == season, ['TeamID1', 'TeamID2', 'result']].reset_index(drop=True), 
                              family=sm.families.Binomial()).fit()
    
    quality = pd.DataFrame(glm.params).reset_index()
    quality.columns = ['TeamID','quality']
    quality['Season'] = season
    quality['quality'] = np.exp(quality['quality'])
    quality = quality.loc[quality.TeamID.str.contains('ID1')].reset_index(drop=True)

    quality['TeamID'] = quality['TeamID'].apply(lambda x: x[-5:-1]).astype(int)
    return quality

glm_quality = pd.concat([team_quality(x) for x in tourney_result['Season'].unique()]).reset_index(drop=True)

In [None]:
# Merge dataframes

tourney_result = pd.merge(tourney_result, glm_quality, left_on=['Season', 'TeamID1'], right_on=['Season', 'TeamID'], how='left')
tourney_result.rename(columns={'quality':'quality_T1'}, inplace=True)
tourney_result = tourney_result.drop(columns='TeamID')
tourney_result = pd.merge(tourney_result, glm_quality, left_on=['Season', 'TeamID2'], right_on=['Season', 'TeamID'], how='left')
tourney_result.rename(columns={'quality':'quality_T2'}, inplace=True)
tourney_result = tourney_result.drop(columns='TeamID')

test_df = pd.merge(test_df, glm_quality, left_on=['Season', 'TeamID1'], right_on=['Season', 'TeamID'], how='left')
test_df.rename(columns={'quality':'quality_T1'}, inplace=True)
test_df = test_df.drop(columns='TeamID')
test_df = pd.merge(test_df, glm_quality, left_on=['Season', 'TeamID2'], right_on=['Season', 'TeamID'], how='left')
test_df.rename(columns={'quality':'quality_T2'}, inplace=True)
test_df = test_df.drop(columns='TeamID')

test_df = test_df.drop(columns=['TeamID1', 'TeamID2'])
tourney_result = tourney_result.drop(columns=['TeamID1', 'TeamID2'])

tourney_result['winrate_diff'] = tourney_result['resultMeanT1'] - tourney_result['resultMeanT2']
test_df['winrate_diff'] = test_df['resultMeanT1'] - test_df['resultMeanT2']

In [None]:
# Create feature list.

for col in tourney_result.columns:
    if col[-2:] == 'T1':
        tourney_result[f'{col}_diff'] = tourney_result[col] - tourney_result[f'{col[:-1] + "2"}']
        test_df[f'{col}_diff'] = test_df[col] - test_df[f'{col[:-1] + "2"}']
        
features = [x for x in tourney_result.columns if x not in ['result', 'Score_difference', 'Season', 'DayNum']]

## Training

In [None]:
# The class NCAA_model creates a model, accepting several self-explanatory parameters.  Methods are available to train and predict using different post-processing methods.

from scipy.optimize import minimize
from scipy.interpolate import UnivariateSpline
from functools import partial
from bayes_opt import BayesianOptimization

def minimize_clipper(labels, preds, clips):
    clipped = np.clip(preds, clips[0], clips[1])
    return log_loss(labels, clipped)

def spline_model(labels, preds):
    comb = pd.DataFrame({'labels':labels, 'preds':preds})
    comb = comb.sort_values(by='preds').reset_index(drop=True)
    spline_model = UnivariateSpline(comb['preds'].values, comb['labels'].values)
    adjusted = spline_model(preds)
    return spline_model, log_loss(labels, adjusted)

"""
Ideas:
-custom loss function - large score differences attract the same penalty as small score differences
-add team1 vs team2 history stats
-for submission add p(x given x beat prev team) with dependence on bracket
-player stats
-adjust tournament winrates for late bracket games
"""

class NCAA_model():
    
    def __init__(self, params, train_df, test_df, use_holdback=True, regression=False, verbose=True):
        self.params = params
        self.verbose = verbose
        self.test_df = test_df
        self.has_trained_models = False
        self.models = []
        if use_holdback == True:
            self.use_holdback=2019
        else:
            self.use_holdback = use_holdback
            
        if regression:
            self.params['objective'] = 'regression'
            self.params['metric'] = 'mse'
            self.target = 'Score_difference'
            self.eval_func = mean_squared_error
        else:
            self.params['objective'] = 'binary'
            self.params['metric'] = 'binary'
            self.target = 'result'
            self.eval_func = log_loss
            
        if not self.verbose:
            self.params['verbosity'] = -1 
            
        if self.use_holdback:
            self.holdback_df = train_df.query(f'Season == {self.use_holdback}')
            self.holdback_target = self.holdback_df[self.target]
            self.train_df = train_df.query(f'Season != {self.use_holdback}')
        else:
            self.train_df = train_df
            
        self.target = self.train_df[self.target]
        
    def train(self, features, n_splits, n_boost_round=5000, stopping_rounds=None, verbose_eval=1000):
        self.feature_importances = pd.DataFrame(columns=features)
        self.preds = np.zeros(shape=(self.test_df.shape[0]))
        self.train_preds = np.zeros(shape=self.train_df.shape[0])
        self.oof = np.zeros(shape=(self.train_df.shape[0]))
        if self.use_holdback:
            self.holdback_preds = np.zeros(shape=(self.holdback_df.shape[0]))
        
        cv = GroupKFold(n_splits=n_splits)        
        for fold, (tr_idx, v_idx) in enumerate(cv.split(self.train_df, self.target, self.train_df['Season'])):
            if self.verbose:
                print(f'Fold: {fold}')
                
            x_train, y_train = self.train_df.iloc[tr_idx][features], self.target.iloc[tr_idx]
            x_valid, y_valid = self.train_df.iloc[v_idx][features], self.target.iloc[v_idx]
            X_t = lgb.Dataset(x_train, y_train)
            X_v = lgb.Dataset(x_valid, y_valid)
            
            if self.has_trained_models:
                self.models[fold] = lgb.train(self.params, X_t, num_boost_round = n_boost_round, early_stopping_rounds=stopping_rounds,
                                                  valid_sets = [X_t, X_v], verbose_eval=(verbose_eval if self.verbose else None),
                                                                                        init_model=self.models[fold])                
            else:
                model = lgb.train(self.params, X_t, num_boost_round = n_boost_round, early_stopping_rounds=stopping_rounds,
                                                  valid_sets = [X_t, X_v], verbose_eval=(verbose_eval if self.verbose else None))
                self.models.append(model)
                
            self.oof[v_idx] = self.models[fold].predict(x_valid)
            self.train_preds[tr_idx] += self.models[fold].predict(x_train) / (n_splits-1)
            self.preds += self.models[fold].predict(self.test_df[features]) / n_splits
            self.feature_importances[f'fold_{fold}'] = self.models[fold].feature_importance()
            if self.use_holdback:
                self.holdback_preds += self.models[fold].predict(self.holdback_df[features]) / n_splits
            
            
        tr_score = self.eval_func(self.target, self.train_preds)
        oof_score = self.eval_func(self.target, self.oof)
        self.has_trained_models = True
        if self.verbose:
            print(f'Training {self.params["metric"]}: {tr_score}')
            print(f'OOF {self.params["metric"]}: {oof_score}')
        if self.use_holdback:
            hb_score = self.eval_func(self.holdback_target, self.holdback_preds)
            if self.verbose:
                print(f'Holdback set {self.params["metric"]}: {hb_score}')
            return tr_score, oof_score, hb_score
        return tr_score, oof_score
        
    def fit_clipper(self, verbose=True):
        preds = self.holdback_preds if self.use_holdback else self.oof
        conv_target = np.where(self.holdback_target>0,1,0) if self.use_holdback else np.where(self.target>0,1,0)

        partial_func = partial(minimize_clipper, conv_target, preds)
        opt = minimize(partial_func, x0=[0.08, 0.92], method='nelder-mead')
        if verbose:
            print(f'Clip score: {opt.fun}')
        clips = opt.x
        score = opt.fun
        return clips, score
    
    def fit_spline_model(self, verbose=True):
        preds = self.holdback_preds if self.use_holdback else self.oof
        conv_target = np.where(self.holdback_target>0,1,0) if self.use_holdback else np.where(self.target>0,1,0)
        spline, score = spline_model(conv_target, preds)
        if verbose:
            print(f'Spline score: {score}')

        return spline, score
    
    def postprocess_preds(self, opt_tool, method='clip', use_data='test', return_preds=False):
        pred_dict = {'test':self.preds, 'train':self.train_preds, 'oof':self.oof}
        label_dict = {'test':None, 'train':self.target, 'oof':self.target}       
        if self.use_holdback:
            pred_dict['hb'] = self.holdback_preds
            label_dict['hb'] = self.holdback_target
        if method == 'spline':
            adjusted_preds = opt_tool(pred_dict[use_data])
        elif method == 'clip':
            adjusted_preds = np.clip(pred_dict[use_data], opt_tool[0], opt_tool[1])
            
        if use_data == 'test':
            return adjusted_preds
        if return_preds:
            return adjusted_preds, self.eval_func(label_dict[use_data], adjusted_preds)
        return self.eval_func(label_dict[use_data], adjusted_preds)

## Permutation Importance

Large chunks of code taken from public notebooks.  This is an early stage where I was investigating the different scoring methods, prior to the cancellation of the competition.

In [None]:
# Null importance code.  The skeleton taken from a public notebook.  Modified for use in this competition.

from sklearn.metrics import roc_auc_score

def get_feature_importances(data, shuffle, seed=None):
    # Shuffle target if required
    y = data['result'].copy()
    if shuffle:
        # Here you could as well use a binomial distribution
        y = data['result'].copy().sample(frac=1.0)
    params = {'num_leaves': 400,
          'bagging_fraction': 0.6,
          'feature_fraction':0.4,
          'max_depth': -1,
          'learning_rate': 0.005,
          'bagging_freq':1,
          "boosting_type": "rf",
          "bagging_seed": 11,
          'metric':'mse',
          "verbosity": 0,
          'random_state': 47,
         }
    models = []
    cv = KFold(n_splits=10, random_state=0, shuffle=True)        
    for fold, (tr_idx, v_idx) in enumerate(cv.split(data)):
        x_train, y_train = data.iloc[tr_idx][features], y.iloc[tr_idx]
        x_valid, y_valid = data.iloc[v_idx][features], y.iloc[v_idx]
        X_t = lgb.Dataset(x_train, y_train)
        X_v = lgb.Dataset(x_valid, y_valid)
        model = lgb.train(params, X_t, num_boost_round = 15000, early_stopping_rounds=50,
                                          valid_sets = [X_t, X_v], verbose_eval=None)
        models.append(model)
    # Get feature importances
    imp_df = pd.DataFrame()
    imp_df["feature"] = list(features)
    imp_df["importance_gain"] = np.mean([x.feature_importance(importance_type='gain') for x in models], axis=0)
    imp_df["importance_split"] = np.mean([x.feature_importance(importance_type='split') for x in models], axis=0)
    imp_df['trn_score'] = log_loss(y, sum([x.predict(data[features]) for x in models])/10)
    
    return imp_df

# Seed the unexpected randomness of this world
np.random.seed(123)
# Get the actual importance, i.e. without shuffling
actual_imp_df = get_feature_importances(data=tourney_result, shuffle=False)
print(actual_imp_df.head())

null_imp_df = pd.DataFrame()
nb_runs = 80
for i in range(nb_runs):
    # Get current run importances
    imp_df = get_feature_importances(data=tourney_result, shuffle=True)
    imp_df['run'] = i + 1 
    # Concat the latest importances with the old ones
    null_imp_df = pd.concat([null_imp_df, imp_df], axis=0)
    
print(null_imp_df.head())

null_imp_df.to_csv('null_importances_distribution_rf.csv')
actual_imp_df.to_csv('actual_importances_ditribution_rf.csv')

In [None]:
# Measure the difference between the null distribution and the measured importance with labels that weren't permuted.

from scipy.stats import entropy
entropies_gain = []
entropies_spl = []
for feature in actual_imp_df['feature'].unique():
    entropies_gain.append(entropy(actual_imp_df.loc[actual_imp_df['feature']==feature]['importance_gain'].values, null_imp_df.loc[null_imp_df['feature']==feature]['importance_gain'].values))
    entropies_spl.append(entropy(actual_imp_df.loc[actual_imp_df['feature']==feature]['importance_split'].values, null_imp_df.loc[null_imp_df['feature']==feature]['importance_split'].values))
entropy_df = pd.DataFrame()
entropy_df['feature'] = actual_imp_df['feature'].unique()
entropy_df['gain'] = entropies_gain
entropy_df['split'] = entropies_spl
entropy_df.describe()

In [None]:
# Display function.  Taken from public notebook.

def display_distributions(actual_imp_df_, null_imp_df_, feature_):
    plt.figure(figsize=(13, 6))
    gs = gridspec.GridSpec(1, 2)
    # Plot Split importances
    ax = plt.subplot(gs[0, 0])
    a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature_, 'importance_split'].values, label='Null importances')
    ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature_, 'importance_split'].mean(), 
               ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')
    ax.legend()
    ax.set_title('Split Importance of %s' % feature_.upper(), fontweight='bold')
    plt.xlabel('Null Importance (split) Distribution for %s ' % feature_.upper())
    # Plot Gain importances
    ax = plt.subplot(gs[0, 1])
    a = ax.hist(null_imp_df_.loc[null_imp_df_['feature'] == feature_, 'importance_gain'].values, label='Null importances')
    ax.vlines(x=actual_imp_df_.loc[actual_imp_df_['feature'] == feature_, 'importance_gain'].mean(), 
               ymin=0, ymax=np.max(a[0]), color='r',linewidth=10, label='Real Target')
    ax.legend()
    ax.set_title('Gain Importance of %s' % feature_.upper(), fontweight='bold')
    plt.xlabel('Null Importance (gain) Distribution for %s ' % feature_.upper())

In [None]:
# Example of what the permutation importance does.

display_distributions(actual_imp_df_=actual_imp_df, null_imp_df_=null_imp_df, feature_='Seed1')

In [None]:
# Different scoring method, taken from public notebook.

feature_scores = []
for _f in actual_imp_df['feature'].unique():
    f_null_imps_gain = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_gain'].values
    f_act_imps_gain = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_gain'].mean()
    gain_score = np.log(1e-10 + f_act_imps_gain / (1 + np.percentile(f_null_imps_gain, 75)))  # Avoid didvide by zero
    f_null_imps_split = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_split'].values
    f_act_imps_split = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_split'].mean()
    split_score = np.log(1e-10 + f_act_imps_split / (1 + np.percentile(f_null_imps_split, 75)))  # Avoid didvide by zero
    feature_scores.append((_f, split_score, gain_score))

scores_df = pd.DataFrame(feature_scores, columns=['feature', 'split_score', 'gain_score'])

plt.figure(figsize=(16, 16))
gs = gridspec.GridSpec(1, 2)
# Plot Split importances
ax = plt.subplot(gs[0, 0])
sns.barplot(x='split_score', y='feature', data=scores_df.sort_values('split_score', ascending=False).iloc[0:100], ax=ax)
ax.set_title('Feature scores wrt split importances', fontweight='bold', fontsize=14)
# Plot Gain importances
ax = plt.subplot(gs[0, 1])
sns.barplot(x='gain_score', y='feature', data=scores_df.sort_values('gain_score', ascending=False).iloc[0:100], ax=ax)
ax.set_title('Feature scores wrt gain importances', fontweight='bold', fontsize=14)
plt.tight_layout()

In [None]:
# Different scoring method, taken from public notebook.

correlation_scores = []
for _f in actual_imp_df['feature'].unique():
    f_null_imps = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_gain'].values
    f_act_imps = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_gain'].values
    gain_score = 100 * (f_null_imps < np.percentile(f_act_imps, 5)).sum() / f_null_imps.size
    f_null_imps = null_imp_df.loc[null_imp_df['feature'] == _f, 'importance_split'].values
    f_act_imps = actual_imp_df.loc[actual_imp_df['feature'] == _f, 'importance_split'].values
    split_score = 100 * (f_null_imps < np.percentile(f_act_imps, 5)).sum() / f_null_imps.size
    correlation_scores.append((_f, split_score, gain_score))

corr_scores_df = pd.DataFrame(correlation_scores, columns=['feature', 'split_score', 'gain_score'])

fig = plt.figure(figsize=(16, 16))
gs = gridspec.GridSpec(1, 2)
# Plot Split importances
ax = plt.subplot(gs[0, 0])
sns.barplot(x='split_score', y='feature', data=corr_scores_df.sort_values('split_score', ascending=False).iloc[0:100], ax=ax)
ax.set_title('Feature scores wrt split importances', fontweight='bold', fontsize=14)
# Plot Gain importances
ax = plt.subplot(gs[0, 1])
sns.barplot(x='gain_score', y='feature', data=corr_scores_df.sort_values('gain_score', ascending=False).iloc[0:100], ax=ax)
ax.set_title('Feature scores wrt gain importances', fontweight='bold', fontsize=14)
plt.tight_layout()
plt.suptitle("Features' split and gain scores", fontweight='bold', fontsize=16)
fig.subplots_adjust(top=0.93)

Historical scores for comparison with my local CV.

Historical targets:
2019 - 0.43
2018 - 0.53
2017 - 0.44
2016 - 0.48
2015 - 0.44

possibly remove 2018

In [None]:
# Parameter search.  

def get_opt_params(init_points, n_iter, use_multiple_years=None):
    
    def run_parameter_opt(num_leaves, feature_fraction, bagging_fraction, min_data_in_leaf, max_depth, lambda_l1, lambda_l2,
                          min_split_gain, min_child_weight, learning_rate):
        params = {'num_leaves': int(num_leaves),
              'min_child_weight': min_child_weight,
              'min_split_gain' : min_split_gain,
              'feature_fraction': feature_fraction,
              'bagging_fraction': bagging_fraction,
              'min_data_in_leaf': int(min_data_in_leaf),
              'max_depth': int(max_depth),
              'learning_rate': learning_rate,
              'lambda_l1': lambda_l1,
              'lambda_l2': lambda_l2,
              'random_state': 7,
              'boosting_type': 'gbdt',
              'bagging_seed': 0,
        }
        if not use_multiple_years:
            model = NCAA_model(params, tourney_result, test_df, use_holdback=[2019], regression=False, verbose=False)   
            tr_score, oof_score, hb_score = model.train(features, n_splits=10, n_boost_round=10000, stopping_rounds=100)
            spline, spline_s = model.fit_spline_model(verbose=False)
           # return -spline_s
            return -hb_score
        spline_scores = []
        hb_scores = []
        for year in use_multiple_years:
            model = NCAA_model(params, tourney_result, test_df, use_holdback=[year], regression=True, verbose=False)   
            tr_score, oof_score, hb_score = model.train(features, n_splits=10, n_boost_round=10000, stopping_rounds=100)
            spline, spline_s = model.fit_spline_model(verbose=False)
            spline_scores.append(spline_s)
            hb_scores.append(hb_score)
        return -np.mean(spline_scores)
        return -np.mean(hb_scores)
    lgbBO = BayesianOptimization(run_parameter_opt, {'num_leaves': (15, 1000),
                                        'feature_fraction': (0.7, 1),#.4515
                                        'bagging_fraction': (0.5, 1),
                                        'min_data_in_leaf': (20, 300),
                                        'max_depth': (-1, 35),
                                        'lambda_l1': (0, 3),
                                        'lambda_l2': (0, 3),
                                        'min_split_gain': (0, 0.3),
                                        'min_child_weight': (0, 0.5),
                                        'learning_rate': (0.001, 0.75)})
    lgbBO.maximize(init_points=init_points, n_iter=n_iter)
    
    return lgbBO.max['params']

params = get_opt_params(init_points=5, n_iter=5, use_multiple_years=[2015, 2017, 2019])#4791
params['random_state'] = 7
params['boosting_type'] = 'gbdt'
params['bagging_seed'] =  0
params['min_data_in_leaf'] =  int(params['min_data_in_leaf'])
params['max_depth'] =  int(params['max_depth'])
params['num_leaves'] =  int(params['num_leaves'])

In [None]:
# Find the best number of features to remove.  Features are removed in order of their importance after accounting for the null importance distribution.

from scipy.stats import norm

def score_feature_selection(df=None, train_features=None, cat_feats=None, target=None, use_multiple_years=None):
    # Fit LightGBM 
    if not use_multiple_years:
        model = NCAA_model(params, tourney_result, test_df, use_holdback=[2019], regression=False, verbose=False)   
        tr_score, oof_score, hb_score = model.train(train_features, n_splits=10, n_boost_round=1500, stopping_rounds=100)
        spline, spline_s = model.fit_spline_model(verbose=False)
        return spline_s
    spline_scores = []
    hb_scores = []
    for year in use_multiple_years:
        model = NCAA_model(params, tourney_result, test_df, use_holdback=[year], regression=False, verbose=False)   
        tr_score, oof_score, hb_score = model.train(train_features, n_splits=10, n_boost_round=1500, stopping_rounds=100)
        spline, spline_s = model.fit_spline_model(verbose=False)
        spline_scores.append(spline_s)
        hb_scores.append(hb_score)
    return np.mean(hb_scores)

base_score = 1

gain_feats = list(entropy_df.sort_values(by='gain', ascending=False)['feature'].values)
split_feats = list(entropy_df.sort_values(by='split', ascending=False)['feature'].values)
shape = len(features)
steps = 50
for step in range(1,steps):
    reduced_gain_feats = gain_feats[:int(step*(shape/steps))]
    reduced_split_feats = split_feats[:int(step*(shape/steps))]        
            
    inters = sorted(list(set(reduced_split_feats).intersection(set(reduced_gain_feats))))
    union_set = sorted(list(set(reduced_split_feats).union(set(reduced_gain_feats))))
    print(len(reduced_split_feats), len(reduced_gain_feats), len(inters), len(union_set))
    print(f'Results for threshold {step}')
    split_results = score_feature_selection(df=tourney_result, train_features=reduced_split_feats, cat_feats=None, target=tourney_result['result'], use_multiple_years=[2015, 2017, 2019])
    print(f'split set: {split_results}')
    gain_results = score_feature_selection(df=tourney_result, train_features=reduced_gain_feats, cat_feats=None, target=tourney_result['result'], use_multiple_years=[2015, 2017, 2019])
    print(f'gain set: {gain_results}')
    try:
        intersection_results = score_feature_selection(df=tourney_result, train_features=inters, cat_feats=None, target=tourney_result['result'], use_multiple_years=[2015, 2017, 2019])
        print(f'intersection set: {intersection_results}')
    except ValueError:
        pass
    union_results = score_feature_selection(df=tourney_result, train_features=union_set, cat_feats=None, target=tourney_result['result'], use_multiple_years=[2015, 2017, 2019])
    print(f'union set: {union_results}')
    
    if split_results < base_score:
        reduced_feats = reduced_split_feats
        base_score = split_results
    if gain_results < base_score:
        reduced_feats = reduced_gain_feats
        base_score = gain_results
    if intersection_results < base_score:
        reduced_feats = inters
        base_score = intersection_results
    if union_results < base_score:
        reduced_feats = union_set
        base_score = union_results
    print(base_score)

In [None]:
len(features)

In [None]:
# Same as above cell, with different scoring method.

def score_feature_selection(df=None, train_features=None, cat_feats=None, target=None, use_multiple_years=None):
    # Fit LightGBM 
    if not use_multiple_years:
        model = NCAA_model(params, tourney_result, test_df, use_holdback=[2019], regression=False, verbose=False)   
        tr_score, oof_score, hb_score = model.train(train_features, n_splits=10, n_boost_round=1500, stopping_rounds=100)
        spline, spline_s = model.fit_spline_model(verbose=False)
        return spline_s
    spline_scores = []
    hb_scores = []
    for year in use_multiple_years:
        model = NCAA_model(params, tourney_result, test_df, use_holdback=[year], regression=False, verbose=False)   
        tr_score, oof_score, hb_score = model.train(train_features, n_splits=10, n_boost_round=1500, stopping_rounds=100)
        spline, spline_s = model.fit_spline_model(verbose=False)
        spline_scores.append(spline_s)
        hb_scores.append(hb_score)
    return np.mean(hb_scores)

base_score = 1

scores = []

for threshold in range(1, 341):
    reduced_split_feats = scores_df.sort_values(by='split_score', ascending=False)['feature'].values[:threshold]
    reduced_gain_feats = scores_df.sort_values(by='gain_score', ascending=False)['feature'].values[:threshold]
    
    inters = sorted(list(set(reduced_split_feats).intersection(set(reduced_gain_feats))))
    union_set = sorted(list(set(reduced_split_feats).union(set(reduced_gain_feats))))
    print(len(reduced_split_feats), len(reduced_gain_feats), len(inters), len(union_set))
    print(f'Results for threshold {threshold}')
    split_results = score_feature_selection(df=tourney_result, train_features=reduced_split_feats, cat_feats=None, target=tourney_result['result'], use_multiple_years=[2015, 2017, 2019])
    print(f'split set: {split_results}')
    gain_results = score_feature_selection(df=tourney_result, train_features=reduced_gain_feats, cat_feats=None, target=tourney_result['result'], use_multiple_years=[2015, 2017, 2019])
    print(f'gain set: {gain_results}')
    union_results = score_feature_selection(df=tourney_result, train_features=union_set, cat_feats=None, target=tourney_result['result'], use_multiple_years=[2015, 2017, 2019])
    print(f'union set: {union_results}')

    try:
        intersection_results = score_feature_selection(df=tourney_result, train_features=inters, cat_feats=None, target=tourney_result['result'], use_multiple_years=[2015, 2017, 2019])
        print(f'intersection set: {intersection_results}')
        if intersection_results < base_score:
            reduced_feats = inters
            base_score = intersection_results
            scores.append([split_results, gain_results, intersection_results, union_results])
    except ValueError:
        pass
    
    if split_results < base_score:
        reduced_feats = reduced_split_feats
        base_score = split_results
    if gain_results < base_score:
        reduced_feats = reduced_gain_feats
        base_score = gain_results
    if union_results < base_score:
        reduced_feats = union_set
        base_score = union_results
    print(base_score)

In [None]:
# Same as above cell, with different scoring method.

from scipy.stats import norm

def score_feature_selection(df=None, train_features=None, cat_feats=None, target=None, use_multiple_years=None):
    # Fit LightGBM 
    if not use_multiple_years:
        model = NCAA_model(params, tourney_result, test_df, use_holdback=[2019], regression=False, verbose=False)   
        tr_score, oof_score, hb_score = model.train(train_features, n_splits=10, n_boost_round=1500, stopping_rounds=100)
        spline, spline_s = model.fit_spline_model(verbose=False)
        return spline_s
    spline_scores = []
    for year in use_multiple_years:
        model = NCAA_model(params, tourney_result, test_df, use_holdback=[year], regression=False, verbose=False)   
        tr_score, oof_score, hb_score = model.train(train_features, n_splits=10, n_boost_round=1500, stopping_rounds=100)
        spline, spline_s = model.fit_spline_model(verbose=False)
        spline_scores.append(spline_s)
    return np.mean(spline_scores)

base_score = 1

null_f_std = null_imp_df.groupby('feature')['importance_gain'].std().reset_index()
null_f_mean = null_imp_df.groupby('feature')['importance_gain'].mean().reset_index()
act_f_mean = actual_imp_df.groupby('feature')['importance_gain'].mean().reset_index()
act_f_mean['p values gain'] = 1 - norm.cdf(act_f_mean['importance_gain'], null_f_mean['importance_gain'], null_f_std['importance_gain'])
null_f_std = null_imp_df.groupby('feature')['importance_split'].std().reset_index()
null_f_mean = null_imp_df.groupby('feature')['importance_split'].mean().reset_index()
act_f_mean_spl = actual_imp_df.groupby('feature')['importance_split'].mean().reset_index()
act_f_mean['p values split'] = 1 - norm.cdf(act_f_mean_spl['importance_split'], null_f_mean['importance_split'], null_f_std['importance_split'])
bottom = np.min(act_f_mean[['p values split', 'p values gain']])
print(len(features))

gain_feats = list(act_f_mean.sort_values(by='p values gain', ascending=True)['feature'].values)
split_feats = list(act_f_mean.sort_values(by='p values split', ascending=True)['feature'].values)
shape = len(features)
steps = 96
for step in range(1,steps):
    reduced_gain_feats = gain_feats[:int(step*(shape/steps))]
    reduced_split_feats = split_feats[:int(step*(shape/steps))]


    inters = sorted(list(set(reduced_split_feats).intersection(set(reduced_gain_feats))))
    union_set = sorted(list(set(reduced_split_feats).union(set(reduced_gain_feats))))
    print(len(reduced_split_feats), len(reduced_gain_feats), len(inters), len(union_set))
    print(f'Results for threshold {step}')
    split_results = score_feature_selection(df=tourney_result, train_features=reduced_split_feats, cat_feats=None, target=tourney_result['result'], use_multiple_years=[2015, 2017, 2019])
    print(f'split set: {split_results}')
    gain_results = score_feature_selection(df=tourney_result, train_features=reduced_gain_feats, cat_feats=None, target=tourney_result['result'], use_multiple_years=[2015, 2017, 2019])
    print(f'gain set: {gain_results}')
    try:
        intersection_results = score_feature_selection(df=tourney_result, train_features=inters, cat_feats=None, target=tourney_result['result'], use_multiple_years=[2015, 2017, 2019])
        print(f'intersection set: {intersection_results}')
    except ValueError:
        pass
    union_results = score_feature_selection(df=tourney_result, train_features=union_set, cat_feats=None, target=tourney_result['result'], use_multiple_years=[2015, 2017, 2019])
    print(f'union set: {union_results}')

    if split_results < base_score:
        reduced_feats = reduced_split_feats
        base_score = split_results
    if gain_results < base_score:
        reduced_feats = reduced_gain_feats
        base_score = gain_results
    if intersection_results < base_score:
        reduced_feats = inters
        base_score = intersection_results
    if union_results < base_score:
        reduced_feats = union_set
        base_score = union_results
print(base_score)

In [None]:
null_f_std = null_imp_df.groupby('feature')['importance_gain'].std().reset_index()
null_f_mean = null_imp_df.groupby('feature')['importance_gain'].mean().reset_index()
act_f_mean = actual_imp_df.groupby('feature')['importance_gain'].mean().reset_index()
act_f_mean['p values gain'] = 1 - norm.cdf(act_f_mean['importance_gain'], null_f_mean['importance_gain'], null_f_std['importance_gain'])
null_f_std = null_imp_df.groupby('feature')['importance_split'].std().reset_index()
null_f_mean = null_imp_df.groupby('feature')['importance_split'].mean().reset_index()
act_f_mean_spl = actual_imp_df.groupby('feature')['importance_split'].mean().reset_index()
act_f_mean['p values split'] = 1 - norm.cdf(act_f_mean_spl['importance_split'], null_f_mean['importance_split'], null_f_std['importance_split'])

reduced_feats = act_f_mean.loc[act_f_mean['p values gain'] < 0.05, 'feature'].values

In [None]:
reduced_feats

In [None]:
len(reduced_feats), len(features)

In [None]:
# Another parameter search, with the optimal features.

def get_opt_params(init_points, n_iter, use_multiple_years=None):
    
    def run_parameter_opt(num_leaves, feature_fraction, bagging_fraction, min_data_in_leaf, max_depth, lambda_l1, lambda_l2,
                          min_split_gain, min_child_weight, learning_rate):
        params = {'num_leaves': int(num_leaves),
              'min_child_weight': min_child_weight,
              'min_split_gain' : min_split_gain,
              'feature_fraction': feature_fraction,
              'bagging_fraction': bagging_fraction,
              'min_data_in_leaf': int(min_data_in_leaf),
              'max_depth': int(max_depth),
              'learning_rate': learning_rate,
              'lambda_l1': lambda_l1,
              'lambda_l2': lambda_l2,
              'random_state': 7,
              'boosting_type': 'gbdt',
              'bagging_seed': 0,
        }
        if not use_multiple_years:
            model = NCAA_model(params, tourney_result, test_df, use_holdback=[2019], regression=False, verbose=False)   
            tr_score, oof_score, hb_score = model.train(reduced_feats, n_splits=10, n_boost_round=10000, stopping_rounds=100)
            spline, spline_s = model.fit_spline_model(verbose=False)
            return -spline_s
        
        spline_scores = []
        hb_scores = []
        for year in use_multiple_years:
            model = NCAA_model(params, tourney_result, test_df, use_holdback=[year], regression=False, verbose=False)   
            tr_score, oof_score, hb_score = model.train(reduced_feats, n_splits=10, n_boost_round=10000, stopping_rounds=100)
            spline, spline_s = model.fit_spline_model(verbose=False)
            spline_scores.append(spline_s)
            hb_scores.append(hb_score)            
        return -np.mean(hb_scores)

    lgbBO = BayesianOptimization(run_parameter_opt, {'num_leaves': (15, 1200),
                                        'feature_fraction': (0.5, 1),#.4515
                                        'bagging_fraction': (0.4, 1),
                                        'min_data_in_leaf': (20, 300),
                                        'max_depth': (-1, 35),
                                        'lambda_l1': (0, 3.5),
                                        'lambda_l2': (0, 3.5),
                                        'min_split_gain': (0, 0.3),
                                        'min_child_weight': (0, 0.5),
                                        'learning_rate': (0.001, 0.75)})
    lgbBO.maximize(init_points=init_points, n_iter=n_iter)
    
    return lgbBO.max['params']

params = get_opt_params(init_points=12, n_iter=48, use_multiple_years=[2015, 2017, 2019])
params['random_state'] = 7
params['boosting_type'] = 'gbdt'
params['bagging_seed'] =  0
params['min_data_in_leaf'] =  int(params['min_data_in_leaf'])
params['max_depth'] =  int(params['max_depth'])
params['num_leaves'] =  int(params['num_leaves'])

#{'bagging_fraction':0.6316, 'feature_fraction':0.2083, 'lambda_l1':0.4754, 'lambda_l2':0.5603, 'learning_rate':0.4492, 'max_depth':1,
#'min_child_in_leaf':0.1958, 'min_data_in_leaf':33, 'min_split_gain':0.1215, 'num_leaves':310}

In [None]:
# Fit the final model.  Instead of early stopping using only RMSE, I ran the model and checked the validation score after using a smoothing spline.  

step_size = 5
steps = 50
boosting_rounds = [step_size*(x+1) for x in range(steps)]
def run_boost_round_test(boosting_rounds, step_size):
    training_scores, oof_scores, holdback_scores = [], [], []
    model = NCAA_model(params, tourney_result, test_df, use_holdback=[2017], regression=False, verbose=False)   
    print(f'Training for {step_size*steps} rounds.')
    for rounds in range(step_size,boosting_rounds[-1]+1,step_size):
        print(f'{"*"*50}')
        print(f'Rounds: {rounds}')
        if model.use_holdback:
            tr_score, oof_score, hb_score = model.train(reduced_feats, n_splits=10, n_boost_round=step_size, stopping_rounds=None)
        else:
            tr_score, oof_score = model.train(reduced_feats, n_splits=10, n_boost_round=step_size, stopping_rounds=None)
        clips, clip_s = model.fit_clipper(verbose=True)
        spline, spline_s = model.fit_spline_model(verbose=True)
        
        training_scores.append([tr_score, model.postprocess_preds(clips, use_data = 'train'), 
                               model.postprocess_preds(spline, use_data = 'train', method='spline')])
        oof_scores.append([oof_score, model.postprocess_preds(clips, use_data = 'oof'),
                          model.postprocess_preds(spline, use_data = 'oof', method='spline')])
        holdback_scores.append([hb_score, model.postprocess_preds(clips, use_data = 'hb'),
                               model.postprocess_preds(spline, use_data = 'hb', method='spline')])

    training_scores, oof_scores, holdback_scores
    fig,ax = plt.subplots(nrows=1,ncols=3, figsize=(20,5), sharey=True, sharex=True)
    plot_df = pd.DataFrame(data=training_scores, columns=['Classifier', 'Clipped', 'Spline'], index=boosting_rounds)
    plot_df.plot(ax=ax[0], title='Training')
    plot_df = pd.DataFrame(data=oof_scores, columns=['Classifier', 'Clipped', 'Spline'], index=boosting_rounds)
    plot_df.plot(ax=ax[1], title='Out of Fold')
    plot_df = pd.DataFrame(data=holdback_scores, columns=['Classifier', 'Clipped', 'Spline'], index=boosting_rounds)
    plot_df.plot(ax=ax[2], title='Holdback')
        
run_boost_round_test(boosting_rounds, step_size)

## Predictions

In [None]:
# Generate predictions on final model.  First fits a smoothing spline on a model train on a reduced training set.

model = NCAA_model(params, tourney_result, test_df, use_holdback=[2015, 2017, 2019], regression=False, verbose=False)   
tr_score, oof_score, _ = model.train(reduced_feats, n_splits=10, n_boost_round=1500, stopping_rounds=100)
spline, spline_s = model.fit_spline_model(verbose=True)

model = NCAA_model(params, tourney_result, test_df, use_holdback=False, regression=False, verbose=False)   
tr_score, oof_score = model.train(reduced_feats, n_splits=10, n_boost_round=1500, stopping_rounds=100)

## Submission

In [None]:
# Submit predictions

y_preds = model.postprocess_preds(spline, method='spline')
submission_df = pd.read_csv('../input/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament/MSampleSubmissionStage1_2020.csv')
submission_df['Pred'] = np.clip(y_preds, 0.0001, 0.9999)
submission_df.to_csv('submission.csv', index=False)
submission_df.describe()