## Google Cloud & NCAA® ML Competition 2018-Women's

### https://www.kaggle.com/c/womens-machine-learning-competition-2018/


### Strategy:
- Exploit the simplicity of my men's approach and similarity of the "Compact" datasets to perform rapid modelling of the women's data
- There are far fewer years of women's data, but it has the same format

#### From my Men's Notebook: 
- Short on time, rely on data analysis performed and published in others' Kaggle kernels
    - Missed the first part of the competition so I am behind on model development
    - Wanted to explore some of the fastai NN stuff, but won't have time
    - Going to focus on GBM algorithms
- Employ 538's take on ELO calcs to determine team rating, implement the Kaggle kernel by LiamKerwin 
    - https://en.wikipedia.org/wiki/Elo_rating_system
    - https://www.kaggle.com/lpkirwin
    - https://github.com/fivethirtyeight/nfl-elo-game/blob/master/forecast.py
    - https://fivethirtyeight.com/features/how-we-calculate-nba-elo-ratings/
    - Use the ELO as key features
    - Modify as needed for my own needs
- Focus only on core data
    - Tons of data are provided, external data is allowed
    - Kaggle Kernels and other reading indicate super knowledgeable use of data (including external) may give an edge
    - Otherwise, people are getting competitive results with relatively simple features
    - Will use only the most compact dataset and see how it goes!
    
### Special Notes:
- Calculated ELO for every regular and tournament game chronologically to the very beginning of NCAA Tournament 2018
- Randomized which team was Team1 and Team2, setting label "Team1_Wins" 1 or 0 appropriately
- Gave every non-seeded team a seed of 17
- Calculated the difference in seed between team 1 and team 2 for every regular and tournament game

In [84]:
import numpy as np
import pandas as pd
from sklearn.metrics import log_loss

In [85]:
# Parameters for the ELO calc

# How fast ELO changes
K = 20.

# Bonus for playing at home
HOME_ADVANTAGE = 100.

### Import data
- The "Compact" results are very simple and have
    - season (year)
    - day of that season game takes place
    - ID of winner and loser (Names are in a different table)
    - WLoc is home/away/neutral for winner
    - NumOT is overtime periods for that gamee
    - WScore and LScore are final scores for the game

In [87]:
rs = pd.read_csv("WRegularSeasonCompactResults.csv")
ts = pd.read_csv("WNCAATourneyCompactResults.csv")
display(rs.head(3))
display(ts.head(3))

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1998,18,3104,91,3202,41,H,0
1,1998,18,3163,87,3221,76,H,0
2,1998,18,3222,66,3261,59,H,0


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1998,137,3104,94,3422,46,H,0
1,1998,137,3112,75,3365,63,H,0
2,1998,137,3163,93,3193,52,H,0


In [88]:
# Aggregate TeamIDs and count them

team_ids = set(rs.WTeamID).union(set(rs.LTeamID)).union(set(ts.WTeamID)).union(set(ts.LTeamID))
len(team_ids)

354

In [89]:
# This dictionary will be used as a lookup for current
# scores while the algorithm is iterating through each game

elo_dict = dict(zip(list(team_ids), [1500] * len(team_ids)))

In [90]:
# Elo updates will be scaled based on the margin of victory

rs['margin'] = rs.WScore - rs.LScore
rs['reg_season'] = 1

ts['margin'] = ts.WScore - ts.LScore
ts['reg_season'] = 0

In [91]:
# LiamKerwin's implementation of ELO functions

def elo_pred(elo1, elo2):
    return(1. / (10. ** (-(elo1 - elo2) / 400.) + 1.))

def expected_margin(elo_diff):
    return((7.5 + 0.006 * elo_diff))

def elo_update(w_elo, l_elo, margin):
    elo_diff = w_elo - l_elo
    pred = elo_pred(w_elo, l_elo)
    mult = ((margin + 3.) ** 0.8) / expected_margin(elo_diff)
    update = K * mult * (1 - pred)
    return(pred, update)

- We have data as far back as **1998**, and it is already pretty clean
- The ELO algorithm will start with teams in 1998 having 1500 pts apiece
- it will then adjust up and down based on successes
- w/ ELO transferring to the next season.

In [92]:
rs.Season.unique()

array([1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008,
       2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018], dtype=int64)

In [93]:
df = pd.concat([rs,ts],axis=0)

In [94]:
df.tail()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
1255,2017,147,3163,90,3332,52,H,0,38,0
1256,2017,147,3376,71,3199,64,N,0,7,0
1257,2017,151,3280,66,3163,64,N,1,2,0
1258,2017,151,3376,62,3390,53,N,0,9,0
1259,2017,153,3376,67,3280,55,N,0,12,0


In [95]:
# The way we combined rs and ts means that seasons and days aren't sorted
# Sort by year then day

df.sort_values(['Season','DayNum'],axis=0,inplace=True)
display(df.head(3))
display(df.tail(3))

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
0,1998,18,3104,91,3202,41,H,0,50,1
1,1998,18,3163,87,3221,76,H,0,11,1
2,1998,18,3222,66,3261,59,H,0,7,1


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
101890,2018,132,3311,69,3372,65,N,0,4,1
101891,2018,132,3343,63,3335,34,N,0,29,1
101892,2018,132,3384,66,3352,56,H,0,10,1


In [96]:
#Check to ensure everything is in the right order. 

display(df[df.Season == 1998].head(3))
display(df[df.Season == 1998].tail(3))
display(df[df.Season == 2018].head(3))
display(df[df.Season == 2018].tail(3))

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
0,1998,18,3104,91,3202,41,H,0,50,1
1,1998,18,3163,87,3221,76,H,0,11,1
2,1998,18,3222,66,3261,59,H,0,7,1


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
60,1998,151,3256,84,3301,65,H,0,19,0
61,1998,151,3397,86,3116,58,H,0,28,0
62,1998,153,3397,93,3256,75,H,0,18,0


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
96684,2018,11,3104,90,3105,32,H,0,58,1
96685,2018,11,3108,84,3368,77,A,1,7,1
96686,2018,11,3110,72,3409,67,A,0,5,1


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
101890,2018,132,3311,69,3372,65,N,0,4,1
101891,2018,132,3343,63,3335,34,N,0,29,1
101892,2018,132,3384,66,3352,56,H,0,10,1


In [97]:
# I'm going to iterate over the games dataframe using 
# index numbers, so want to check that nothing is out
# of order before I do that.

assert np.all(df.index.values == np.array(range(df.shape[0]))), "Index is out of order."

AssertionError: Index is out of order.

In [98]:
# This fixes it.

df.reset_index(inplace=True)
assert np.all(df.index.values == np.array(range(df.shape[0]))), "Index is out of order."

In [99]:
# Implemented by LiamKerwin
# Calculates ELO

preds = []
w_elo = []
l_elo = []

# Loop over all rows of the games dataframe
for row in df.itertuples():
    
    # Get key data from current row
    w = row.WTeamID
    l = row.LTeamID
    margin = row.margin
    wloc = row.WLoc
    
    # Does either team get a home-court advantage?
    w_ad, l_ad, = 0., 0.
    if wloc == "H":
        w_ad += HOME_ADVANTAGE
    elif wloc == "A":
        l_ad += HOME_ADVANTAGE
    
    # Get elo updates as a result of the game
    pred, update = elo_update(elo_dict[w] + w_ad,
                              elo_dict[l] + l_ad, 
                              margin)
    elo_dict[w] += update
    elo_dict[l] -= update
    
    # Save prediction and new Elos for each round
    preds.append(pred)
    w_elo.append(elo_dict[w])
    l_elo.append(elo_dict[l])

In [100]:
df['w_elo'] = w_elo
df['l_elo'] = l_elo

In [101]:
df.tail(3)

Unnamed: 0,index,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,w_elo,l_elo
103150,101890,2018,132,3311,69,3372,65,N,0,4,1,1334.329917,1567.263771
103151,101891,2018,132,3343,63,3335,34,N,0,29,1,1812.939933,1736.006131
103152,101892,2018,132,3384,66,3352,56,H,0,10,1,1450.728901,1475.536369


In [102]:
df.set_index('index',inplace=True)
df.tail(3)

Unnamed: 0_level_0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,w_elo,l_elo
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
101890,2018,132,3311,69,3372,65,N,0,4,1,1334.329917,1567.263771
101891,2018,132,3343,63,3335,34,N,0,29,1,1812.939933,1736.006131
101892,2018,132,3384,66,3352,56,H,0,10,1,1450.728901,1475.536369


#### Save Point 1A:

In [103]:
df.to_csv('WNCAA_Women_Data_1A.csv',index=True)

In [104]:
df = pd.read_csv('WNCAA_Women_Data_1A.csv',index_col=0)

In [105]:
df.tail(3)

Unnamed: 0_level_0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,w_elo,l_elo
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
101890,2018,132,3311,69,3372,65,N,0,4,1,1334.329917,1567.263771
101891,2018,132,3343,63,3335,34,N,0,29,1,1812.939933,1736.006131
101892,2018,132,3384,66,3352,56,H,0,10,1,1450.728901,1475.536369


In [106]:
df.loc[:,('w_elo','l_elo')].describe()

Unnamed: 0,w_elo,l_elo
count,103153.0,103153.0
mean,1590.738266,1439.64982
std,245.123762,219.450689
min,886.85156,856.662756
25%,1421.042838,1291.042325
50%,1558.943031,1422.334975
75%,1743.978934,1566.069482
max,2620.188415,2596.52349


In [107]:
# Implemented by LiamKerwin

def final_elo_per_season(df, team_id):
    d = df.copy()
    d = d.loc[(d.WTeamID == team_id) | (d.LTeamID == team_id), :]
    d.sort_values(['Season', 'DayNum'], inplace=True)
    d.drop_duplicates(['Season'], keep='last', inplace=True)
    w_mask = d.WTeamID == team_id
    l_mask = d.LTeamID == team_id
    d['season_elo'] = None
    d.loc[w_mask, 'season_elo'] = d.loc[w_mask, 'w_elo']
    d.loc[l_mask, 'season_elo'] = d.loc[l_mask, 'l_elo']
    out = pd.DataFrame({
        'team_id': team_id,
        'season': d.Season,
        'season_elo': d.season_elo
    })
    return(out)

In [108]:
df_list = [final_elo_per_season(df[df.reg_season == 1], id) for id in team_ids]
season_elos = pd.concat(df_list)

In [109]:
season_elos.sample(10)

Unnamed: 0_level_0,season,season_elo,team_id
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
55011,2009,1528.69,3315
86197,2015,1359.2,3318
3849,1998,1551.72,3414
70443,2012,1870.76,3304
60275,2010,1285.54,3206
60327,2010,1486.76,3142
96537,2017,1609.04,3460
7872,1999,1335.58,3151
39885,2006,1561.92,3160
85990,2015,1970.77,3304


In [110]:
df_list_t = [final_elo_per_season(df[df.reg_season == 0], id) for id in team_ids]
tourn_elos = pd.concat(df_list_t)

In [111]:
tourn_elos.sample(10)

Unnamed: 0_level_0,season,season_elo,team_id
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
779,2010,1778.08,3355
404,2004,1802.68,3283
708,2009,1848.32,3265
319,2003,1749.48,3408
311,2002,1986.32,3435
991,2013,1894.81,3234
707,2009,1625.56,3441
1194,2016,2061.06,3333
1045,2014,1961.72,3345
1191,2016,2093.53,3390


In [112]:
season_elos.to_csv('season_elos.csv',index=True)
tourn_elos.to_csv('tournament_elos.csv',index=True)

### Seeds
- Need the seed information to add to the table
- Tournament seeds are in a different table
- No seed information for standard Season

Strategy:
- Will split out (and encode) the Division
- Split out the the Seed #

In [113]:
seeds = pd.read_csv('WNCAATourneySeeds.csv')

In [114]:
seeds.head(2)

Unnamed: 0,Season,Seed,TeamID
0,1998,W01,3330
1,1998,W02,3163


In [115]:
import re
q1 = re.compile('[a-zA-Z]')
q2 = re.compile('[0-9][0-9]*')

In [116]:
seeds['Div'] = seeds.Seed.apply(lambda x: q1.search(x).group())
seeds['Num'] = seeds.Seed.apply(lambda x: q2.search(x).group())

In [117]:
adict = {'W': 1, 'X': 2, 'Y': 3, 'Z': 4}
seeds.Div = seeds.Div.map(adict)
seeds.head(2)

Unnamed: 0,Season,Seed,TeamID,Div,Num
0,1998,W01,3330,1,1
1,1998,W02,3163,1,2


In [118]:
df.columns

Index(['Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc',
       'NumOT', 'margin', 'reg_season', 'w_elo', 'l_elo'],
      dtype='object')

In [119]:
df2 = df.copy()

In [120]:
df = df2.copy()

### Important Notes:
- Competition requires calc of probability of "1st team" winning
    - 1st team is the team with lowest ID, given two competitors
    - Data is not structured that way
    
  
- In order to train a model...
    - Need to erase the Wteam/Lteam distinction
    - Instead just use Team1 and Team2, with a target label "Team1_Wins"


- If changed as is, all WTeam become Team1 and Team1_Wins is always 1
    - Will screw up model training!
    - Alternative: Randomize which team is Team1
    - Make sure label matches

The requirement for Team1_ID being the lower of the two is not important in this stage. IDs are just bookkeeping, not features, so the model will be blind to them. As long as the model is trained with labels set up correctly so that we are predicting the probability of Team1 victory, this will work.

In [121]:
# The following are a bit cumbersome
# but helped me keep my wits as I made the changes.

seeds['Team1_ID'] = seeds.TeamID
seeds['Team2_ID'] = seeds.TeamID

seeds['Team1_Seed'] = seeds.Num
seeds['Team2_Seed'] = seeds.Num

seeds['Team1_Div'] = seeds.Div
seeds['Team2_Div'] = seeds.Div


In [122]:
# Extract team1 (t1) and team2 (t2) components

seedst1 = seeds[['Season', 'Team1_ID', 'Team1_Seed']]
seedst2 = seeds[['Season', 'Team2_ID', 'Team2_Seed']]

In [123]:
# Creation of empty columns

df['Team1_ID'] = 0
df['Team2_ID'] = 0

df['Team1_Score'] = 0
df['Team2_Score'] = 0

df['Team1_Elo'] = 0
df['Team2_Elo'] = 0

df['Team1_Wins'] = 0

In [124]:
# The randomization happens here

shakeup = np.random.randint(0,2, size=(len(df)))

In [125]:
#Where shakeup has 0, we will use the Winning Team as Team 1
#That means the Losing Team is team2

df.loc[shakeup == 0,'Team1_ID'] = df.loc[shakeup == 0,'WTeamID']
df.loc[shakeup == 0,'Team2_ID'] = df.loc[shakeup == 0,'LTeamID']

df.loc[shakeup == 0,'Team1_Elo'] = df.loc[shakeup == 0,'w_elo']
df.loc[shakeup == 0,'Team2_Elo'] = df.loc[shakeup == 0,'l_elo']

df.loc[shakeup == 0,'Team1_Score'] = df.loc[shakeup == 0,'WScore']
df.loc[shakeup == 0,'Team2_Score'] = df.loc[shakeup == 0,'LScore']

df.loc[shakeup == 0,'Team1_Wins'] = 1

In [126]:
# Where shakeup has 1, we will use the Losing Team as Team 1
# That means the Winning Team is team2

df.loc[shakeup == 1,'Team1_ID'] = df.loc[shakeup == 1,'LTeamID']
df.loc[shakeup == 1,'Team2_ID'] = df.loc[shakeup == 1,'WTeamID']

df.loc[shakeup == 1,'Team1_Elo'] = df.loc[shakeup == 1,'l_elo']
df.loc[shakeup == 1,'Team2_Elo'] = df.loc[shakeup == 1,'w_elo']

df.loc[shakeup == 1,'Team1_Score'] = df.loc[shakeup == 1,'LScore']
df.loc[shakeup == 1,'Team2_Score'] = df.loc[shakeup == 1,'WScore']

# Techincally, this is already 0
df.loc[shakeup == 1,'Team1_Wins'] = 0

#Need a conditional to prevent me from accidentally adding multiple joins
if 'Team1_Seed' not in df.columns:
    df = df.merge(seedst1, how = 'left', on = ('Season', 'Team1_ID'))
    df = df.merge(seedst2, how = 'left', on = ('Season', 'Team2_ID'))

# The features come with a mix of text and numbers, and are object type
# The text are actually numbers anyway, so force them to behave that way
df['Team1_Seed'] = pd.to_numeric(df.Team1_Seed,errors='coerce')
df['Team2_Seed'] = pd.to_numeric(df.Team2_Seed,errors='coerce')

#has to come before the seed = 17 below
df.fillna(0,inplace=True)    

# Executive decision: Any team that doesn't have a seed gets 17
df.loc[df.Team1_Seed == 0, 'Team1_Seed'] = 17
df.loc[df.Team2_Seed == 0, 'Team2_Seed'] = 17

# Calculate the difference between seeds
df['Seed_Gap'] = df['Team1_Seed'] - df['Team2_Seed']

In [127]:
df.loc[((shakeup == 0) & (df.reg_season == 0))].head(3)

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,...,Team1_ID,Team2_ID,Team1_Score,Team2_Score,Team1_Elo,Team2_Elo,Team1_Wins,Team1_Seed,Team2_Seed,Seed_Gap
3935,1998,137,3112,75,3365,63,H,0,12,0,...,3112,3365,75,63,1667.742457,1665.617899,1,3.0,14.0,-11.0
3945,1998,137,3330,92,3384,39,H,0,53,0,...,3330,3384,92,39,1765.234964,1573.298022,1,1.0,16.0,-15.0
3948,1998,137,3438,77,3374,68,H,0,9,0,...,3438,3374,77,68,1589.012624,1630.2461,1,6.0,11.0,-5.0


In [128]:
df.loc[shakeup == 1].head(3)

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,...,Team1_ID,Team2_ID,Team1_Score,Team2_Score,Team1_Elo,Team2_Elo,Team1_Wins,Team1_Seed,Team2_Seed,Seed_Gap
1,1998,18,3163,87,3221,76,H,0,11,1,...,3221,3163,76,87,1492.660415,1507.339585,0,14.0,2.0,12.0
4,1998,18,3349,115,3411,35,H,0,80,1,...,3411,3349,35,115,1469.518837,1530.481163,0,17.0,17.0,0.0
10,1998,19,3141,73,3387,55,H,0,18,1,...,3387,3141,55,73,1489.848166,1510.151834,0,17.0,17.0,0.0


In [129]:
df3 = df.copy()

In [130]:
df = df3.copy()

In [131]:
#CHeck to see if I left anything Null
df.isnull().sum()

Season         0
DayNum         0
WTeamID        0
WScore         0
LTeamID        0
LScore         0
WLoc           0
NumOT          0
margin         0
reg_season     0
w_elo          0
l_elo          0
Team1_ID       0
Team2_ID       0
Team1_Score    0
Team2_Score    0
Team1_Elo      0
Team2_Elo      0
Team1_Wins     0
Team1_Seed     0
Team2_Seed     0
Seed_Gap       0
dtype: int64

In [132]:
df.columns

Index(['Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc',
       'NumOT', 'margin', 'reg_season', 'w_elo', 'l_elo', 'Team1_ID',
       'Team2_ID', 'Team1_Score', 'Team2_Score', 'Team1_Elo', 'Team2_Elo',
       'Team1_Wins', 'Team1_Seed', 'Team2_Seed', 'Seed_Gap'],
      dtype='object')

### More Notes:
- Keep in mind that the only inputs we can have into the model are:
    - TeamID
    - Seed
    - Anything we can calculate using pre-tournament knowledge (ELO, Seed_Gap)
    
- Start dropping defunct features and things we cannot carry forward

In [133]:
columns_to_drop = ['WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc',
       'NumOT', 'margin', 'w_elo', 'l_elo', 
        'Team1_Score', 'Team2_Score']

df.drop(columns_to_drop,axis=1,inplace=True)

In [134]:
df.columns

Index(['Season', 'DayNum', 'reg_season', 'Team1_ID', 'Team2_ID', 'Team1_Elo',
       'Team2_Elo', 'Team1_Wins', 'Team1_Seed', 'Team2_Seed', 'Seed_Gap'],
      dtype='object')

In [135]:
# I will change most of these to categorical later
# but as that is model specific, I did not do it here

df.dtypes

Season          int64
DayNum          int64
reg_season      int64
Team1_ID        int64
Team2_ID        int64
Team1_Elo     float64
Team2_Elo     float64
Team1_Wins      int64
Team1_Seed    float64
Team2_Seed    float64
Seed_Gap      float64
dtype: object

In [136]:
df.head()

Unnamed: 0,Season,DayNum,reg_season,Team1_ID,Team2_ID,Team1_Elo,Team2_Elo,Team1_Wins,Team1_Seed,Team2_Seed,Seed_Gap
0,1998,18,1,3104,3202,1521.290691,1478.709309,1,2.0,17.0,-15.0
1,1998,18,1,3221,3163,1492.660415,1507.339585,0,14.0,2.0,12.0
2,1998,18,1,3222,3261,1505.607497,1494.392503,1,17.0,17.0,0.0
3,1998,18,1,3307,3365,1505.607497,1494.392503,1,8.0,14.0,-6.0
4,1998,18,1,3411,3349,1469.518837,1530.481163,0,17.0,17.0,0.0


### Save Point 1B
Use this data for the modelling phase

In [137]:
df.to_csv('WNCAA_Women_Data_1B.csv',index=True)

In [138]:
df.head(3)

Unnamed: 0,Season,DayNum,reg_season,Team1_ID,Team2_ID,Team1_Elo,Team2_Elo,Team1_Wins,Team1_Seed,Team2_Seed,Seed_Gap
0,1998,18,1,3104,3202,1521.290691,1478.709309,1,2.0,17.0,-15.0
1,1998,18,1,3221,3163,1492.660415,1507.339585,0,14.0,2.0,12.0
2,1998,18,1,3222,3261,1505.607497,1494.392503,1,17.0,17.0,0.0


In [139]:
seeds2018 = seeds[seeds.Season == 2018].loc[:,('TeamID', 'Num')].copy()

seeds2018.reset_index(drop=True, inplace=True)
seeds2018.sort_values('TeamID',inplace=True)

# Rolls out pairings of (low_id, high_id) all the way to the end
# Gives n * (n-1) / 2 rows of pairings w/ no duplicates
# The tournament has 64 slots, but can have pre-games so the data may have n > 64
combinations = [[i,j] for i in seeds2018.TeamID for j in seeds2018.TeamID if i < j]

len(combinations)

2016

### Prepare the Test data from the tournament entrants
- Take the test data from Kaggle (namely, the teams and their seeds)
- Put the final ELO from 2018 reg.season as the ELO for the tournament
- Use the median tournament day (~day **XXX**) as the 'day' impution

In [140]:
test_df = []

test_df = pd.DataFrame(combinations, columns=['Team1_ID','Team2_ID'])

test_df = test_df.merge(seeds2018, how = 'left', left_on = 'Team1_ID', right_on = 'TeamID')
test_df = test_df.merge(seeds2018, how = 'left', left_on = 'Team2_ID', right_on = 'TeamID')

test_df.drop(['TeamID_x'], axis=1, inplace=True)
test_df.drop(['TeamID_y'], axis=1, inplace=True)

test_df.rename(columns = {'Num_x': 'Team1_Seed'}, inplace=True)
test_df.rename(columns = {'Num_y': 'Team2_Seed'}, inplace=True)

In [141]:
test_df['Team1_Seed'] = pd.to_numeric(test_df.Team1_Seed,errors='coerce')
test_df['Team2_Seed'] = pd.to_numeric(test_df.Team2_Seed,errors='coerce')

test_df['Seed_Gap'] = test_df['Team1_Seed'] - test_df['Team2_Seed']

In [142]:
test_df.dtypes

Team1_ID      int64
Team2_ID      int64
Team1_Seed    int64
Team2_Seed    int64
Seed_Gap      int64
dtype: object

In [143]:
test_df.rename(columns = {'Num': 'Team1_Seed'})

test_df.head()

Unnamed: 0,Team1_ID,Team2_ID,Team1_Seed,Team2_Seed,Seed_Gap
0,3110,3113,14,7,7
1,3110,3114,14,14,0
2,3110,3124,14,2,12
3,3110,3125,14,12,2
4,3110,3129,14,16,-2


In [144]:
final_elo = season_elos[season_elos.season == 2018].copy()
test_df = test_df.merge(final_elo[['season_elo','team_id']], how = 'left', left_on = 'Team1_ID', right_on = 'team_id')
test_df = test_df.merge(final_elo[['season_elo','team_id']], how = 'left', left_on = 'Team2_ID', right_on = 'team_id')
test_df.head(3)

Unnamed: 0,Team1_ID,Team2_ID,Team1_Seed,Team2_Seed,Seed_Gap,season_elo_x,team_id_x,season_elo_y,team_id_y
0,3110,3113,14,7,7,1578.23,3110,1948.27,3113
1,3110,3114,14,14,0,1578.23,3110,1691.2,3114
2,3110,3124,14,2,12,1578.23,3110,2373.18,3124


In [145]:
test_df.rename(columns = {'season_elo_x': 'Team1_Elo', 'season_elo_y': 'Team2_Elo'},inplace=True)
test_df.drop(['team_id_x','team_id_y'], axis=1, inplace=True)

test_df['reg_season'] = 0
df.loc[df.reg_season == 0, 'DayNum'].median()

139.0

In [146]:
test_df['DayNum'] = df.loc[df.reg_season == 0, 'DayNum'].median()
test_df['Season'] = 2018

In [147]:
test_df['DayNum'] = df.loc[df.reg_season == 0, 'DayNum'].median()
test_df['Season'] = 2018

In [148]:
test_df.head(3)

Unnamed: 0,Team1_ID,Team2_ID,Team1_Seed,Team2_Seed,Seed_Gap,Team1_Elo,Team2_Elo,reg_season,DayNum,Season
0,3110,3113,14,7,7,1578.23,1948.27,0,139.0,2018
1,3110,3114,14,14,0,1578.23,1691.2,0,139.0,2018
2,3110,3124,14,2,12,1578.23,2373.18,0,139.0,2018


### Save for use in the ML notebook

In [149]:
test_df.to_csv('Test_Data_for_Modelling.csv')