## Google Cloud & NCAA® ML Competition 2018-Women's

### https://www.kaggle.com/c/womens-machine-learning-competition-2018/


### Strategy:
- Exploit the simplicity of my men's approach and similarity of the "Compact" datasets to perform rapid modelling of the women's data
- There are far fewer years of women's data, but it has the same format

#### From my Men's Notebook: 
- Short on time, rely on data analysis performed and published in others' Kaggle kernels
    - Missed the first part of the competition so I am behind on model development
    - Wanted to explore some of the fastai NN stuff, but won't have time
    - Going to focus on GBM algorithms
- Employ 538's take on ELO calcs to determine team rating, implement the Kaggle kernel by LiamKerwin 
    - https://en.wikipedia.org/wiki/Elo_rating_system
    - https://www.kaggle.com/lpkirwin
    - https://github.com/fivethirtyeight/nfl-elo-game/blob/master/forecast.py
    - https://fivethirtyeight.com/features/how-we-calculate-nba-elo-ratings/
    - Use the ELO as key features
    - Modify as needed for my own needs
- Focus only on core data
    - Tons of data are provided, external data is allowed
    - Kaggle Kernels and other reading indicate super knowledgeable use of data (including external) may give an edge
    - Otherwise, people are getting competitive results with relatively simple features
    - Will use only the most compact dataset and see how it goes!

In [2]:
import numpy as np
import pandas as pd
from sklearn.metrics import log_loss

In [3]:
# Parameters for the ELO calc

# How fast ELO changes
K = 20.

# Bonus for playing at home
HOME_ADVANTAGE = 100.

### Import data
- The "Compact" results are very simple and have
    - season (year)
    - day of that season game takes place
    - ID of winner and loser (Names are in a different table)
    - WLoc is home/away/neutral for winner
    - NumOT is overtime periods for that gamee
    - WScore and LScore are final scores for the game

In [4]:
rs = pd.read_csv("WRegularSeasonCompactResults_PrelimData2018.csv")
ts = pd.read_csv("WNCAATourneyCompactResults_PrelimData2018.csv")
display(rs.head(3))
display(ts.head(3))

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1998,18,3104,91,3202,41,H,0
1,1998,18,3163,87,3221,76,H,0
2,1998,18,3222,66,3261,59,H,0


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1998,137,3104,94,3422,46,H,0
1,1998,137,3112,75,3365,63,H,0
2,1998,137,3163,93,3193,52,H,0


In [5]:
# Aggregate TeamIDs and count them

team_ids = set(rs.WTeamID).union(set(rs.LTeamID)).union(set(ts.WTeamID)).union(set(ts.LTeamID))
len(team_ids)

354

In [6]:
# This dictionary will be used as a lookup for current
# scores while the algorithm is iterating through each game

elo_dict = dict(zip(list(team_ids), [1500] * len(team_ids)))

In [7]:
# Elo updates will be scaled based on the margin of victory

rs['margin'] = rs.WScore - rs.LScore
rs['reg_season'] = 1

ts['margin'] = ts.WScore - ts.LScore
ts['reg_season'] = 0

In [8]:
# LiamKerwin's implementation of ELO functions

def elo_pred(elo1, elo2):
    return(1. / (10. ** (-(elo1 - elo2) / 400.) + 1.))

def expected_margin(elo_diff):
    return((7.5 + 0.006 * elo_diff))

def elo_update(w_elo, l_elo, margin):
    elo_diff = w_elo - l_elo
    pred = elo_pred(w_elo, l_elo)
    mult = ((margin + 3.) ** 0.8) / expected_margin(elo_diff)
    update = K * mult * (1 - pred)
    return(pred, update)

- We have data as far back as 1985, and it is already pretty clean
- The ELO algorithm will start with teams in 1985 having 1500 pts apiece
- it will then adjust up and down based on successes
- w/ ELO transferring to the next season.

In [9]:
rs.Season.unique()

array([1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008,
       2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018], dtype=int64)

In [10]:
df = pd.concat([rs,ts],axis=0)

In [11]:
df.tail()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
1255,2017,147,3163,90,3332,52,H,0,38,0
1256,2017,147,3376,71,3199,64,N,0,7,0
1257,2017,151,3280,66,3163,64,N,1,2,0
1258,2017,151,3376,62,3390,53,N,0,9,0
1259,2017,153,3376,67,3280,55,N,0,12,0


In [12]:
# The way we combined rs and ts means that seasons and days aren't sorted
# Sort by year then day

df.sort_values(['Season','DayNum'],axis=0,inplace=True)
display(df.head(3))
display(df.tail(3))

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
0,1998,18,3104,91,3202,41,H,0,50,1
1,1998,18,3163,87,3221,76,H,0,11,1
2,1998,18,3222,66,3261,59,H,0,7,1


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
101447,2018,119,3411,76,3106,48,H,0,28,1
101448,2018,119,3416,75,3187,54,H,0,21,1
101449,2018,119,3455,70,3409,59,A,0,11,1


In [14]:
#Check to ensure everything is in the right order. 

display(df[df.Season == 1998].head(3))
display(df[df.Season == 1998].tail(3))
display(df[df.Season == 2018].head(3))
display(df[df.Season == 2018].tail(3))

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
0,1998,18,3104,91,3202,41,H,0,50,1
1,1998,18,3163,87,3221,76,H,0,11,1
2,1998,18,3222,66,3261,59,H,0,7,1


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
60,1998,151,3256,84,3301,65,H,0,19,0
61,1998,151,3397,86,3116,58,H,0,28,0
62,1998,153,3397,93,3256,75,H,0,18,0


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
96684,2018,11,3104,90,3105,32,H,0,58,1
96685,2018,11,3108,84,3368,77,A,1,7,1
96686,2018,11,3110,72,3409,67,A,0,5,1


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
101447,2018,119,3411,76,3106,48,H,0,28,1
101448,2018,119,3416,75,3187,54,H,0,21,1
101449,2018,119,3455,70,3409,59,A,0,11,1


In [15]:
# I'm going to iterate over the games dataframe using 
# index numbers, so want to check that nothing is out
# of order before I do that.

assert np.all(df.index.values == np.array(range(df.shape[0]))), "Index is out of order."

AssertionError: Index is out of order.

In [16]:
# This fixes it.

df.reset_index(inplace=True)
assert np.all(df.index.values == np.array(range(df.shape[0]))), "Index is out of order."

In [17]:
# Implemented by LiamKerwin
# Calculates ELO

preds = []
w_elo = []
l_elo = []

# Loop over all rows of the games dataframe
for row in df.itertuples():
    
    # Get key data from current row
    w = row.WTeamID
    l = row.LTeamID
    margin = row.margin
    wloc = row.WLoc
    
    # Does either team get a home-court advantage?
    w_ad, l_ad, = 0., 0.
    if wloc == "H":
        w_ad += HOME_ADVANTAGE
    elif wloc == "A":
        l_ad += HOME_ADVANTAGE
    
    # Get elo updates as a result of the game
    pred, update = elo_update(elo_dict[w] + w_ad,
                              elo_dict[l] + l_ad, 
                              margin)
    elo_dict[w] += update
    elo_dict[l] -= update
    
    # Save prediction and new Elos for each round
    preds.append(pred)
    w_elo.append(elo_dict[w])
    l_elo.append(elo_dict[l])

In [18]:
df['w_elo'] = w_elo
df['l_elo'] = l_elo

In [19]:
df.tail(3)

Unnamed: 0,index,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,w_elo,l_elo
102707,101447,2018,119,3411,76,3106,48,H,0,28,1,1407.231151,1176.558527
102708,101448,2018,119,3416,75,3187,54,H,0,21,1,1706.959784,1493.383193
102709,101449,2018,119,3455,70,3409,59,A,0,11,1,1558.78312,1448.146788


In [20]:
df.set_index('index',inplace=True)
df.tail(3)

Unnamed: 0_level_0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,w_elo,l_elo
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
101447,2018,119,3411,76,3106,48,H,0,28,1,1407.231151,1176.558527
101448,2018,119,3416,75,3187,54,H,0,21,1,1706.959784,1493.383193
101449,2018,119,3455,70,3409,59,A,0,11,1,1558.78312,1448.146788


#### Save Point 1A:

In [21]:
df.to_csv('WNCAA_Women_Data_1A.csv',index=True)

In [22]:
df = pd.read_csv('WNCAA_Women_Data_1A.csv',index_col=0)

In [23]:
df.tail(3)

Unnamed: 0_level_0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,w_elo,l_elo
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
101447,2018,119,3411,76,3106,48,H,0,28,1,1407.231151,1176.558527
101448,2018,119,3416,75,3187,54,H,0,21,1,1706.959784,1493.383193
101449,2018,119,3455,70,3409,59,A,0,11,1,1558.78312,1448.146788


In [24]:
df.loc[:,('w_elo','l_elo')].describe()

Unnamed: 0,w_elo,l_elo
count,102710.0,102710.0
mean,1590.713488,1439.648071
std,244.95808,219.285651
min,886.85156,856.662756
25%,1421.05827,1291.176482
50%,1558.9078,1422.352478
75%,1743.816879,1565.942036
max,2619.523978,2596.52349


In [25]:
# Implemented by LiamKerwin

def final_elo_per_season(df, team_id):
    d = df.copy()
    d = d.loc[(d.WTeamID == team_id) | (d.LTeamID == team_id), :]
    d.sort_values(['Season', 'DayNum'], inplace=True)
    d.drop_duplicates(['Season'], keep='last', inplace=True)
    w_mask = d.WTeamID == team_id
    l_mask = d.LTeamID == team_id
    d['season_elo'] = None
    d.loc[w_mask, 'season_elo'] = d.loc[w_mask, 'w_elo']
    d.loc[l_mask, 'season_elo'] = d.loc[l_mask, 'l_elo']
    out = pd.DataFrame({
        'team_id': team_id,
        'season': d.Season,
        'season_elo': d.season_elo
    })
    return(out)

In [26]:
df_list = [final_elo_per_season(df[df.reg_season == 1], id) for id in team_ids]
season_elos = pd.concat(df_list)

In [27]:
season_elos.sample(10)

Unnamed: 0_level_0,season,season_elo,team_id
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
35216,2005,1170.11,3102
65493,2011,1702.89,3428
25814,2003,1789.41,3120
80919,2014,1130.37,3313
25768,2003,1637.67,3222
54926,2009,1817.46,3280
25916,2003,1408.08,3254
96572,2017,1147.38,3288
39792,2006,1776.06,3246
55213,2009,1809.15,3132


In [28]:
df_list_t = [final_elo_per_season(df[df.reg_season == 0], id) for id in team_ids]
tourn_elos = pd.concat(df_list_t)

In [29]:
tourn_elos.sample(10)

Unnamed: 0_level_0,season,season_elo,team_id
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
385,2004,1712.55,3285
322,2003,1660.73,3122
653,2008,1816.47,3462
961,2013,1415.29,3341
538,2006,1772.02,3449
35,1998,1569.42,3234
836,2011,1649.68,3216
1187,2016,1948.47,3280
463,2005,1793.09,3279
148,2000,1678.2,3453


In [30]:
season_elos.to_csv('season_elos.csv',index=True)
tourn_elos.to_csv('tournament_elos.csv',index=True)

### Seeds
- Need the seed information to add to the table
- Tournament seeds are in a different table
- No seed information for standard Season

Strategy:
- Will split out (and encode) the Division
- Split out the the Seed #

In [31]:
seeds = pd.read_csv('WNCAATourneySeeds.csv')

In [32]:
seeds.head(2)

Unnamed: 0,Season,Seed,TeamID
0,1998,W01,3330
1,1998,W02,3163


In [33]:
import re
q1 = re.compile('[a-zA-Z]')
q2 = re.compile('[0-9][0-9]*')

In [34]:
seeds['Div'] = seeds.Seed.apply(lambda x: q1.search(x).group())
seeds['Num'] = seeds.Seed.apply(lambda x: q2.search(x).group())

In [35]:
adict = {'W': 1, 'X': 2, 'Y': 3, 'Z': 4}
seeds.Div = seeds.Div.map(adict)
seeds.head(2)

Unnamed: 0,Season,Seed,TeamID,Div,Num
0,1998,W01,3330,1,1
1,1998,W02,3163,1,2


In [36]:
df.columns

Index(['Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc',
       'NumOT', 'margin', 'reg_season', 'w_elo', 'l_elo'],
      dtype='object')

In [37]:
df2 = df.copy()

In [60]:
df = df2.copy()

### Important Notes:
- Competition requires calc of probability of "1st team" winning
    - 1st team is the team with lowest ID, given two competitors
    - Data is not structured that way
    
  
- In order to train a model...
    - Need to erase the Wteam/Lteam distinction
    - Instead just use Team1 and Team2, with a target label "Team1_Wins"


- If changed as is, all WTeam become Team1 and Team1_Wins is always 1
    - Will screw up model training!
    - Alternative: Randomize which team is Team1
    - Make sure label matches

The requirement for Team1_ID being the lower of the two is not important in this stage. IDs are just bookkeeping, not features, so the model will be blind to them. As long as the model is trained with labels set up correctly so that we are predicting the probability of Team1 victory, this will work.

In [61]:
# The following are a bit cumbersome
# but helped me keep my wits as I made the changes.

seeds['Team1_ID'] = seeds.TeamID
seeds['Team2_ID'] = seeds.TeamID

seeds['Team1_Seed'] = seeds.Num
seeds['Team2_Seed'] = seeds.Num

seeds['Team1_Div'] = seeds.Div
seeds['Team2_Div'] = seeds.Div


In [62]:
# Extract team1 (t1) and team2 (t2) components

seedst1 = seeds[['Season', 'Team1_ID', 'Team1_Seed']]
seedst2 = seeds[['Season', 'Team2_ID', 'Team2_Seed']]

In [63]:
# Creation of empty columns

df['Team1_ID'] = 0
df['Team2_ID'] = 0

df['Team1_Score'] = 0
df['Team2_Score'] = 0

df['Team1_Elo'] = 0
df['Team2_Elo'] = 0

df['Team1_Wins'] = 0

In [65]:
# The randomization happens here

shakeup = np.random.randint(0,2, size=(len(df)))

In [66]:
#Where shakeup has 0, we will use the Winning Team as Team 1
#That means the Losing Team is team2

df.loc[shakeup == 0,'Team1_ID'] = df.loc[shakeup == 0,'WTeamID']
df.loc[shakeup == 0,'Team2_ID'] = df.loc[shakeup == 0,'LTeamID']

df.loc[shakeup == 0,'Team1_Elo'] = df.loc[shakeup == 0,'w_elo']
df.loc[shakeup == 0,'Team2_Elo'] = df.loc[shakeup == 0,'l_elo']

df.loc[shakeup == 0,'Team1_Score'] = df.loc[shakeup == 0,'WScore']
df.loc[shakeup == 0,'Team2_Score'] = df.loc[shakeup == 0,'LScore']

df.loc[shakeup == 0,'Team1_Wins'] = 1

In [67]:
# Where shakeup has 1, we will use the Losing Team as Team 1
# That means the Winning Team is team2

df.loc[shakeup == 1,'Team1_ID'] = df.loc[shakeup == 1,'LTeamID']
df.loc[shakeup == 1,'Team2_ID'] = df.loc[shakeup == 1,'WTeamID']

df.loc[shakeup == 1,'Team1_Elo'] = df.loc[shakeup == 1,'l_elo']
df.loc[shakeup == 1,'Team2_Elo'] = df.loc[shakeup == 1,'w_elo']

df.loc[shakeup == 1,'Team1_Score'] = df.loc[shakeup == 1,'LScore']
df.loc[shakeup == 1,'Team2_Score'] = df.loc[shakeup == 1,'WScore']

# Techincally, this is already 0
df.loc[shakeup == 1,'Team1_Wins'] = 0

#Need a conditional to prevent me from accidentally adding multiple joins
if 'Team1_Seed' not in df.columns:
    df = df.merge(seedst1, how = 'left', on = ('Season', 'Team1_ID'))
    df = df.merge(seedst2, how = 'left', on = ('Season', 'Team2_ID'))

# The features come with a mix of text and numbers, and are object type
# The text are actually numbers anyway, so force them to behave that way
df['Team1_Seed'] = pd.to_numeric(df.Team1_Seed,errors='coerce')
df['Team2_Seed'] = pd.to_numeric(df.Team2_Seed,errors='coerce')

#has to come before the seed = 17 below
df.fillna(0,inplace=True)    

# Executive decision: Any team that doesn't have a seed gets 17
df.loc[df.Team1_Seed == 0, 'Team1_Seed'] = 17
df.loc[df.Team2_Seed == 0, 'Team2_Seed'] = 17

# Calculate the difference between seeds
df['Seed_Gap'] = df['Team1_Seed'] - df['Team2_Seed']

In [68]:
df.loc[((shakeup == 0) & (df.reg_season == 0))].head(3)

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,...,Team1_ID,Team2_ID,Team1_Score,Team2_Score,Team1_Elo,Team2_Elo,Team1_Wins,Team1_Seed,Team2_Seed,Seed_Gap
3936,1998,137,3163,93,3193,52,H,0,41,0,...,3163,3193,93,52,1791.989206,1579.569208,1,2,15,-13
3938,1998,137,3203,74,3208,72,A,0,2,0,...,3203,3208,74,72,1624.052907,1587.70301,1,10,7,3
3939,1998,137,3234,77,3269,59,H,0,18,0,...,3234,3269,77,59,1577.007172,1578.778114,1,4,13,-9


In [69]:
df.loc[shakeup == 1].head(3)

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,...,Team1_ID,Team2_ID,Team1_Score,Team2_Score,Team1_Elo,Team2_Elo,Team1_Wins,Team1_Seed,Team2_Seed,Seed_Gap
1,1998,18,3163,87,3221,76,H,0,11,1,...,3221,3163,76,87,1492.660415,1507.339585,0,14,2,12
3,1998,18,3307,69,3365,62,H,0,7,1,...,3365,3307,62,69,1494.392503,1505.607497,0,14,8,6
4,1998,18,3349,115,3411,35,H,0,80,1,...,3411,3349,35,115,1469.518837,1530.481163,0,17,17,0


In [70]:
df3 = df.copy()

In [71]:
df = df3.copy()

In [72]:
#CHeck to see if I left anything Null
df.isnull().sum()

Season         0
DayNum         0
WTeamID        0
WScore         0
LTeamID        0
LScore         0
WLoc           0
NumOT          0
margin         0
reg_season     0
w_elo          0
l_elo          0
Team1_ID       0
Team2_ID       0
Team1_Score    0
Team2_Score    0
Team1_Elo      0
Team2_Elo      0
Team1_Wins     0
Team1_Seed     0
Team2_Seed     0
Seed_Gap       0
dtype: int64

In [73]:
df.columns

Index(['Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc',
       'NumOT', 'margin', 'reg_season', 'w_elo', 'l_elo', 'Team1_ID',
       'Team2_ID', 'Team1_Score', 'Team2_Score', 'Team1_Elo', 'Team2_Elo',
       'Team1_Wins', 'Team1_Seed', 'Team2_Seed', 'Seed_Gap'],
      dtype='object')

### More Notes:
- Keep in mind that the only inputs we can have into the model are:
    - TeamID
    - Seed
    - Anything we can calculate using pre-tournament knowledge (ELO, Seed_Gap)
    
- Start dropping defunct features and things we cannot carry forward

In [74]:
columns_to_drop = ['WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc',
       'NumOT', 'margin', 'w_elo', 'l_elo', 
        'Team1_Score', 'Team2_Score']

df.drop(columns_to_drop,axis=1,inplace=True)

In [75]:
df.columns

Index(['Season', 'DayNum', 'reg_season', 'Team1_ID', 'Team2_ID', 'Team1_Elo',
       'Team2_Elo', 'Team1_Wins', 'Team1_Seed', 'Team2_Seed', 'Seed_Gap'],
      dtype='object')

In [76]:
# I will change most of these to categorical later
# but as that is model specific, I did not do it here

df.dtypes

Season          int64
DayNum          int64
reg_season      int64
Team1_ID        int64
Team2_ID        int64
Team1_Elo     float64
Team2_Elo     float64
Team1_Wins      int64
Team1_Seed      int64
Team2_Seed      int64
Seed_Gap        int64
dtype: object

In [77]:
df.head()

Unnamed: 0,Season,DayNum,reg_season,Team1_ID,Team2_ID,Team1_Elo,Team2_Elo,Team1_Wins,Team1_Seed,Team2_Seed,Seed_Gap
0,1998,18,1,3104,3202,1521.290691,1478.709309,1,2,17,-15
1,1998,18,1,3221,3163,1492.660415,1507.339585,0,14,2,12
2,1998,18,1,3222,3261,1505.607497,1494.392503,1,17,17,0
3,1998,18,1,3365,3307,1494.392503,1505.607497,0,14,8,6
4,1998,18,1,3411,3349,1469.518837,1530.481163,0,17,17,0


### Save Point 1B
Use this data for the modelling phase

In [78]:
df.to_csv('WNCAA_Women_Data_1B.csv',index=True)

In [79]:
df.head(3)

Unnamed: 0,Season,DayNum,reg_season,Team1_ID,Team2_ID,Team1_Elo,Team2_Elo,Team1_Wins,Team1_Seed,Team2_Seed,Seed_Gap
0,1998,18,1,3104,3202,1521.290691,1478.709309,1,2,17,-15
1,1998,18,1,3221,3163,1492.660415,1507.339585,0,14,2,12
2,1998,18,1,3222,3261,1505.607497,1494.392503,1,17,17,0
