# Overview ##

This notebook creates a basic logistic regression model based on the seed differences and season average metric differences (e.g., FG%, PPG, Opp. PPG) between teams. 

Note that the model is trained entirely on data from 2003-2017 and their known outcomes. The resulting classifier is then used on 2018 data to generate predictions for this year's tournament on March 11.

In [512]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.model_selection import GridSearchCV


## Load the training data ##
We're keeping it relatively simple & using only a handful files for this model: the tourney seeds, tourney results, and a detailed results dataset to calculate our other features.

In [None]:
data_dir = '../input/'
df_seeds = pd.read_csv(data_dir + 'Stage2UpdatedDataFiles/NCAATourneySeeds.csv')
df_tour = pd.read_csv(data_dir + 'DataFiles/NCAATourneyCompactResults.csv')

# We load detailed season data to calculate season average statistics for each team
df_reg_season_detailed = pd.read_csv(data_dir + 
                'Stage2UpdatedDataFiles/RegularSeasonDetailedResults.csv')
df_reg_season_detailed.drop(labels=['WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WDR', 'WAst', 
                'WStl', 'WBlk', 'WPF', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 'LDR', 
                'LAst', 'LStl', 'LBlk', 'LPF', 'WLoc', 'NumOT'], inplace=True, axis=1)
df_reg_season_detailed.head()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WFGM,WFGA,WOR,WTO,LFGM,LFGA,LOR,LTO
0,2003,10,1104,68,1328,62,27,58,14,23,22,53,10,18
1,2003,10,1272,70,1393,63,26,62,15,13,24,67,20,12
2,2003,11,1266,73,1437,61,24,58,17,10,22,73,31,12
3,2003,11,1296,56,1457,50,18,38,6,12,18,49,17,19
4,2003,11,1400,77,1208,71,30,61,17,14,24,62,21,10


## Create a new data frame with season average metrics ##
We are creating a new data frame with season average statistics for each team for use as features in our machine learning algorithm.

In [None]:
# TODO: Other statistics I should investigate:
#   -wins! (seems like an obvious one)
#   -offensive rebounds
#   -get to the foul line frequently

#   -Also look into player/team efficiency
#   -Avg win shares/min per team
#   -Should use MasseyOrdinals dataset (rating and ranking system data)
# Strength of Schedule!

# TODO: Neaten up code and break into other functions
# TODO: Investigate performance -- e.g., should avoid creating new data frames

yearList = range(2003,2019) #2003 is the first year we have detailed data for
teams_pd = pd.read_csv(data_dir + 'DataFiles/Teams.csv')
teamIDs = teams_pd['TeamID'].tolist()

rows = list()

for year in yearList:
    for team in teamIDs:
        df_curr_season = df_reg_season_detailed[df_reg_season_detailed.Season == year]       

        df_curr_team_wins = df_curr_season[df_curr_season.WTeamID == team]
        df_curr_team_losses = df_curr_season[df_curr_season.LTeamID == team]
        
        # no games played by them this year.. skip (current team didn't win or lose any games)
        if df_curr_team_wins.shape[0] == 0 and df_curr_team_losses.shape[0] == 0:
            continue;
        
        df_winteam = df_curr_team_wins.rename(columns={'WTeamID':'TeamID', 'WFGM':'FGM', 
                    'WFGA':'FGA', 'WTO':'TO', 'WScore':'Score', 'LScore':'OppScore'})
        
        # drop all columns except the ones we are using
        df_winteam = df_winteam[['TeamID', 'FGM', 'FGA', 'TO', 'Score', 'OppScore']]

        df_loseteam = df_curr_team_losses.rename(columns={'LTeamID':'TeamID', 'LFGM':'FGM',
                    'LFGA':'FGA', 'LTO':'TO', 'LScore':'Score', 'WScore':'OppScore'})
        # drop all columns except the ones we are using
        df_loseteam = df_loseteam[['TeamID', 'FGM', 'FGA', 'TO', 'Score', 'OppScore']] 

        # dataframe w/ all relevant stats from current year for current team
        df_curr_team = pd.concat((df_winteam, df_loseteam)) 

        FGPercent = df_curr_team['FGM'].sum() / df_curr_team['FGA'].sum()
        TurnoverAvg = df_curr_team['TO'].sum() / len(df_curr_team['TO'].values)
        PPG = df_curr_team['Score'].sum() / len(df_curr_team['Score'].values)
        OppPPG = df_curr_team['OppScore'].sum() / len(df_curr_team['OppScore'].values)

        # collect all data in rows list first for effeciency
        rows.append([year, team, FGPercent, TurnoverAvg, PPG, OppPPG])

df_training_data = pd.DataFrame(rows, columns=['Season', 'TeamID', 'FGPercent', 
                                               'TOAvg', 'PPG', 'OppPPG'])
df_training_data.head()

Here we show the contents of our (currently) simple model. We will (for now) only predict using the data frames constructed by team seedings and tournament results.

In [None]:
df_seeds.head()

In [None]:
df_tour.head()

First, we'll simplify the datasets to remove the columns we won't be using and convert the seedings to the needed format (stripping the regional abbreviation in front of the seed).

In [None]:
def seed_to_int(seed):
    #Get just the digits from the seeding. Return as int
    s_int = int(seed[1:3])
    return s_int
df_seeds['seed_int'] = df_seeds.Seed.apply(seed_to_int)
df_seeds.drop(labels=['Seed'], inplace=True, axis=1) # This is the string label
df_seeds.head()

In [None]:
df_tour.drop(labels=['DayNum', 'WScore', 'LScore', 'WLoc', 'NumOT'], inplace=True, axis=1)
df_tour.head()

## Merge seed for each team ##
Merge the Seeds with their corresponding TeamIDs in the compact results dataframe.

In [None]:
df_winseeds = df_seeds.rename(columns={'TeamID':'WTeamID', 'seed_int':'WSeed'})
df_lossseeds = df_seeds.rename(columns={'TeamID':'LTeamID', 'seed_int':'LSeed'})
df_dummy = pd.merge(left=df_tour, right=df_winseeds, how='left', on=['Season', 'WTeamID'])
df_concat = pd.merge(left=df_dummy, right=df_lossseeds, on=['Season', 'LTeamID'])
df_concat['SeedDiff'] = df_concat.WSeed - df_concat.LSeed
df_concat.head()

Now we'll combine our advanced season statistics and merge them into the df_concat data frame.

In [None]:
df_winstats = df_training_data.rename(columns={'TeamID':'WTeamID', 'FGPercent':'WFGPercent', 
                            'TOAvg':'WTOAvg', 'PPG':'WPPG', 'OppPPG':'WOppPPG'})
df_lossstats = df_training_data.rename(columns={'TeamID':'LTeamID', 'FGPercent':'LFGPercent',
                            'TOAvg':'LTOAvg', 'PPG':'LPPG', 'OppPPG':'LOppPPG'})
df_dummy = pd.merge(left=df_concat, right=df_winstats, on=['Season', 'WTeamID'])
df_concat = pd.merge(left=df_dummy, right=df_lossstats, on=['Season', 'LTeamID'])
df_concat['FGPercentDiff'] = df_concat.WFGPercent - df_concat.LFGPercent
df_concat['TOAvgDiff'] = df_concat.WTOAvg - df_concat.LTOAvg
df_concat['PPGDiff'] = df_concat.WPPG - df_concat.LPPG
df_concat['OppPPGDiff'] = df_concat.WOppPPG - df_concat.LOppPPG
df_concat['WWinMargin'] = df_concat.WPPG - df_concat.WOppPPG
df_concat['LWinMargin'] = df_concat.LPPG - df_concat.LOppPPG
df_concat['WinMarginDiff'] = df_concat.WWinMargin - df_concat.LWinMargin
 # drop all columns except the ones we are using
df_concat = df_concat[['Season', 'WTeamID', 'LTeamID', 'SeedDiff', 'FGPercentDiff', 
                       'TOAvgDiff', 'PPGDiff', 'OppPPGDiff', 'WinMarginDiff']]

# Note: We can have SeedDiff == 0 due to the First Four (68 teams)! Also Final Four onwards!
# Note: Pandas merges tossed out data from before 2003!
df_concat.head()

Now we'll create a dataframe that summarizes wins & losses along with their corresponding seed differences, FG% differences, and turnover differences. This is the meat of what we'll be creating our model on.

In [None]:
# We create positive and negative versions of the data so the 
# supervised learning algorithm has sample data of each class to classify

df_wins = pd.DataFrame()
df_wins['SeedDiff'] = df_concat['SeedDiff']
df_wins['FGPercentDiff'] = df_concat['FGPercentDiff']
df_wins['TOAvgDiff'] = df_concat['TOAvgDiff']
df_wins['PPGDiff'] = df_concat['PPGDiff']
df_wins['OppPPGDiff'] = df_concat['OppPPGDiff']
df_wins['WinMarginDiff'] = df_concat['WinMarginDiff']
df_wins['Result'] = 1

df_losses = pd.DataFrame()
df_losses['SeedDiff'] = -df_concat['SeedDiff']
df_losses['FGPercentDiff'] = -df_concat['FGPercentDiff']
df_losses['TOAvgDiff'] = -df_concat['TOAvgDiff']
df_losses['PPGDiff'] = -df_concat['PPGDiff']
df_losses['OppPPGDiff'] = -df_concat['OppPPGDiff']
df_losses['WinMarginDiff'] = -df_concat['WinMarginDiff']
df_losses['Result'] = 0

df_predictions = pd.concat((df_wins, df_losses))
df_predictions.head()

In [None]:
X_train = [list(a) for a in zip(df_predictions.SeedDiff.values, df_predictions.FGPercentDiff.values, 
                                df_predictions.TOAvgDiff.values, df_predictions.PPGDiff.values,
                                df_predictions.OppPPGDiff.values, df_predictions.WinMarginDiff.values)]
X_train = np.array(X_train)
y_train = df_predictions.Result.values
X_train, y_train = shuffle(X_train, y_train)

## Train the model ##
Use a basic logistic regression to train the model. You can set different C values to see how performance changes.

In [None]:
logreg = LogisticRegression()
params = {'C': np.logspace(start=-15, stop=15, num=31)} # {C: array[1^-5 , 1^-4, ... 1^5] }
clf = GridSearchCV(logreg, params, scoring='neg_log_loss', refit=True) #sklearn model selection
clf.fit(X_train, y_train)
print('Best log_loss: {:.4}, with best C: {}'.format(clf.best_score_, clf.best_params_['C']))

# Initial Model Provided: 
# Just based on seeds - Logistic Regression
# Best log_loss: -0.5531, with best C: 0.01
# Best log_loss: -0.5529, with best C: 0.1

# First iteration of Model
# Features: Seeds, avg FG%, avg turnovers, avg PPG, avg Opp PPG, avg Win Margin
# Logistic Regression
# Best log_loss: -0.5187, with best C: 1000.0


# Keep in mind, the provided values are an average representation of our classifier's
# success! Depending on how the data is shuffled, each run of the program may yield
# a slightly different classifier (and thus different predictions/success rate)

In [None]:
X1 = np.arange(-10, 10)
X2 = np.zeros(20, dtype=np.int)
X = [list(a) for a in zip(X1, X2, X2, X2, X2, X2)]
X = np.array(X)

preds = clf.predict_proba(X)[:,1]

plt.plot(X1, preds)
plt.xlabel('Team1 seed - Team2 seed')
plt.ylabel('P(Team1 will win)')

Plotting validates our intuition, that the probability a team will win decreases as the seed differential to its opponent decreases.

In [None]:
df_sample_sub = pd.read_csv(data_dir + 'SampleSubmissionStage2.csv')
n_test_games = len(df_sample_sub)

def get_year_t1_t2(ID):
    """Return a tuple with ints `year`, `team1` and `team2`."""
    return (int(x) for x in ID.split('_'))

In [None]:
X_test = np.zeros(shape=(n_test_games, 6))

for ii, row in df_sample_sub.iterrows():
    year, t1, t2 = get_year_t1_t2(row.ID)
    t1_seed = df_seeds[(df_seeds.TeamID == t1) & (df_seeds.Season == year)].seed_int.values[0]
    t2_seed = df_seeds[(df_seeds.TeamID == t2) & (df_seeds.Season == year)].seed_int.values[0]
    diff_seed = t1_seed - t2_seed
    X_test[ii, 0] = diff_seed
    
    t1_FGPercent = df_training_data[(df_training_data.TeamID == t1) & 
                                    (df_training_data.Season == year)].FGPercent.values[0]
    t2_FGPercent = df_training_data[(df_training_data.TeamID == t2) & 
                                    (df_training_data.Season == year)].FGPercent.values[0]
    diff_FGPercent = t1_FGPercent - t2_FGPercent
    X_test[ii, 1] = diff_FGPercent
    
    t1_TOAvg = df_training_data[(df_training_data.TeamID == t1) & 
                                (df_training_data.Season == year)].TOAvg.values[0]
    t2_TOAvg = df_training_data[(df_training_data.TeamID == t2) & 
                                (df_training_data.Season == year)].TOAvg.values[0]
    diff_TOAvg = t1_TOAvg - t2_TOAvg
    X_test[ii, 2] = diff_TOAvg
    
    t1_PPG = df_training_data[(df_training_data.TeamID == t1) & 
                              (df_training_data.Season == year)].PPG.values[0]
    t2_PPG = df_training_data[(df_training_data.TeamID == t2) & 
                              (df_training_data.Season == year)].PPG.values[0]
    diff_PPG = t1_PPG - t2_PPG
    X_test[ii, 3] = diff_PPG
    
    t1_OppPPG = df_training_data[(df_training_data.TeamID == t1) & 
                                 (df_training_data.Season == year)].OppPPG.values[0]
    t2_OppPPG = df_training_data[(df_training_data.TeamID == t2) & 
                                 (df_training_data.Season == year)].OppPPG.values[0]
    diff_OppPPG = t1_OppPPG - t2_OppPPG
    X_test[ii, 4] = diff_OppPPG
    
    X_test[ii, 5] = diff_PPG - diff_OppPPG # Win Margin

## Make Predictions ##
Create predictions using the logistic regression model we trained.

In [None]:
preds = clf.predict_proba(X_test)[:,1]

clipped_preds = np.clip(preds, 0.05, 0.95)
df_sample_sub.Pred = clipped_preds
df_sample_sub.head()

Lastly, create your submission file!

In [None]:
df_sample_sub.to_csv('predictions.csv', index=False)