## Google Cloud & NCAA® ML Competition 2018-Men's

### https://www.kaggle.com/c/mens-machine-learning-competition-2018/

### Strategy:
- Short on time, rely on data analysis performed and published in others' Kaggle kernels
    - Missed the first part of the competition so I am behind on model development
    - Wanted to explore some of the fastai NN stuff, but won't have time
    - Going to focus on GBM algorithms
- Employ 538's take on ELO calcs to determine team rating, implement the Kaggle kernel by LiamKerwin 
    - https://en.wikipedia.org/wiki/Elo_rating_system
    - https://www.kaggle.com/lpkirwin
    - https://fivethirtyeight.com/features/how-we-calculate-nba-elo-ratings/
    - https://github.com/fivethirtyeight/nfl-elo-game/blob/master/forecast.py
    - Use the ELO as key features
    - Modify as needed for my own needs
- Focus only on core data
    - Tons of data are provided, external data is allowed
    - Kaggle Kernels and other reading indicate super knowledgeable use of data (including external) may give an edge
    - Otherwise, people are getting competitive results with relatively simple features
    - Will use only the most compact dataset and see how it goes!
    
### Special Notes:
- Calculated ELO for every regular and tournament game chronologically to the very beginning of NCAA Tournament 2018
- Randomized which team was Team1 and Team2, setting label "Team1_Wins" 1 or 0 appropriately
- Gave every non-seeded team a seed of 17
- Calculated the difference in seed between team 1 and team 2 for every regular and tournament game

In [76]:
import numpy as np
import pandas as pd
from sklearn.metrics import log_loss

In [77]:
# Parameters for the ELO calc

# How fast ELO changes
K = 20.

# Bonus for playing at home
HOME_ADVANTAGE = 100.

### Import data
- The "Compact" results are very simple and have
    - season (year)
    - day of that season game takes place
    - ID of winner and loser (Names are in a different table)
    - WLoc is home/away/neutral for winner
    - NumOT is overtime periods for that gamee
    - WScore and LScore are final scores for the game

In [78]:
rs = pd.read_csv("RegularSeasonCompactResults.csv")
ts = pd.read_csv("NCAATourneyCompactResults.csv")
display(rs.head(3))
display(ts.head(3))

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT
0,1985,136,1116,63,1234,54,N,0
1,1985,136,1120,59,1345,58,N,0
2,1985,136,1207,68,1250,43,N,0


In [79]:
# Aggregate TeamIDs and count them

team_ids = set(rs.WTeamID).union(set(rs.LTeamID)).union(set(ts.WTeamID)).union(set(ts.LTeamID))
len(team_ids)

364

In [80]:
# This dictionary will be used as a lookup for current
# scores while the algorithm is iterating through each game

elo_dict = dict(zip(list(team_ids), [1500] * len(team_ids)))

In [81]:
# Elo updates will be scaled based on the margin of victory

rs['margin'] = rs.WScore - rs.LScore
rs['reg_season'] = 1

ts['margin'] = ts.WScore - ts.LScore
ts['reg_season'] = 0

In [82]:
# LiamKerwin's implementation of ELO functions

def elo_pred(elo1, elo2):
    return(1. / (10. ** (-(elo1 - elo2) / 400.) + 1.))

def expected_margin(elo_diff):
    return((7.5 + 0.006 * elo_diff))

def elo_update(w_elo, l_elo, margin):
    elo_diff = w_elo - l_elo
    pred = elo_pred(w_elo, l_elo)
    mult = ((margin + 3.) ** 0.8) / expected_margin(elo_diff)
    update = K * mult * (1 - pred)
    return(pred, update)

- We have data as far back as 1985, and it is already pretty clean
- The ELO algorithm will start with teams in 1985 having 1500 pts apiece
- it will then adjust up and down based on successes
- w/ ELO transferring to the next season.

In [83]:
rs.Season.unique()

array([1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995,
       1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006,
       2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017,
       2018], dtype=int64)

In [84]:
df = pd.concat([rs,ts],axis=0)

In [85]:
df.tail()

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
2112,2017,146,1314,75,1246,73,N,0,2,0
2113,2017,146,1376,77,1196,70,N,0,7,0
2114,2017,152,1211,77,1376,73,N,0,4,0
2115,2017,152,1314,77,1332,76,N,0,1,0
2116,2017,154,1314,71,1211,65,N,0,6,0


In [86]:
# The way we combined rs and ts means that seasons and days aren't sorted
# Sort by year then day

df.sort_values(['Season','DayNum'],axis=0,inplace=True)
display(df.head(3))
display(df.tail(3))

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
0,1985,20,1228,81,1328,64,N,0,17,1
1,1985,25,1106,77,1354,70,H,0,7,1
2,1985,25,1112,63,1223,56,H,0,7,1


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
156086,2018,132,1209,74,1426,61,N,0,13,1
156087,2018,132,1246,77,1397,72,N,0,5,1
156088,2018,132,1335,68,1217,65,N,0,3,1


In [87]:
# Check to ensure everything is in the right order. 

display(df[df.Season == 1985].head(3))
display(df[df.Season == 1985].tail(3))
display(df[df.Season == 2018].head(3))
display(df[df.Season == 2018].tail(3))

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
0,1985,20,1228,81,1328,64,N,0,17,1
1,1985,25,1106,77,1354,70,H,0,7,1
2,1985,25,1112,63,1223,56,H,0,7,1


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
60,1985,152,1207,77,1385,59,N,0,18,0
61,1985,152,1437,52,1272,45,N,0,7,0
62,1985,154,1437,66,1207,64,N,0,2,0


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
150684,2018,11,1104,82,1272,70,N,0,12,1
150685,2018,11,1107,69,1233,67,H,0,2,1
150686,2018,11,1112,101,1319,67,H,0,34,1


Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season
156086,2018,132,1209,74,1426,61,N,0,13,1
156087,2018,132,1246,77,1397,72,N,0,5,1
156088,2018,132,1335,68,1217,65,N,0,3,1


In [88]:
# I'm going to iterate over the games dataframe using 
# index numbers, so want to check that nothing is out
# of order before I do that.

assert np.all(df.index.values == np.array(range(df.shape[0]))), "Index is out of order."

AssertionError: Index is out of order.

In [89]:
# This fixes it.

df.reset_index(inplace=True)
assert np.all(df.index.values == np.array(range(df.shape[0]))), "Index is out of order."

In [90]:
# Implemented by LiamKerwin
# Calculates ELO

preds = []
w_elo = []
l_elo = []

# Loop over all rows of the games dataframe
for row in df.itertuples():
    
    # Get key data from current row
    w = row.WTeamID
    l = row.LTeamID
    margin = row.margin
    wloc = row.WLoc
    
    # Does either team get a home-court advantage?
    w_ad, l_ad, = 0., 0.
    if wloc == "H":
        w_ad += HOME_ADVANTAGE
    elif wloc == "A":
        l_ad += HOME_ADVANTAGE
    
    # Get elo updates as a result of the game
    pred, update = elo_update(elo_dict[w] + w_ad,
                              elo_dict[l] + l_ad, 
                              margin)
    elo_dict[w] += update
    elo_dict[l] -= update
    
    # Save prediction and new Elos for each round
    preds.append(pred)
    w_elo.append(elo_dict[w])
    l_elo.append(elo_dict[l])

In [91]:
df['w_elo'] = w_elo
df['l_elo'] = l_elo

In [92]:
df.tail(3)

Unnamed: 0,index,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,w_elo,l_elo
158203,156086,2018,132,1209,74,1426,61,N,0,13,1,1595.494417,1568.435003
158204,156087,2018,132,1246,77,1397,72,N,0,5,1,2003.936924,1865.591008
158205,156088,2018,132,1335,68,1217,65,N,0,3,1,1584.0344,1588.330915


In [93]:
df.set_index('index',inplace=True)
df.tail(3)

Unnamed: 0_level_0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,w_elo,l_elo
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
156086,2018,132,1209,74,1426,61,N,0,13,1,1595.494417,1568.435003
156087,2018,132,1246,77,1397,72,N,0,5,1,2003.936924,1865.591008
156088,2018,132,1335,68,1217,65,N,0,3,1,1584.0344,1588.330915


#### Save Point 1A:

In [94]:
df.to_csv('NCAA_Men_Data_1A.csv',index=True)

In [95]:
df = pd.read_csv('NCAA_Men_Data_1A.csv',index_col=0)

In [96]:
df.tail(3)

Unnamed: 0_level_0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,w_elo,l_elo
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
156086,2018,132,1209,74,1426,61,N,0,13,1,1595.494417,1568.435003
156087,2018,132,1246,77,1397,72,N,0,5,1,2003.936924,1865.591008
156088,2018,132,1335,68,1217,65,N,0,3,1,1584.0344,1588.330915


In [97]:
df.loc[:,('w_elo','l_elo')].describe()

Unnamed: 0,w_elo,l_elo
count,158206.0,158206.0
mean,1579.104408,1459.187022
std,207.351895,200.308417
min,839.982697,828.832534
25%,1431.720974,1315.760276
50%,1567.685623,1446.402223
75%,1724.113065,1588.76333
max,2216.492272,2200.042587


In [98]:
# Implemented by LiamKerwin

def final_elo_per_season(df, team_id):
    d = df.copy()
    d = d.loc[(d.WTeamID == team_id) | (d.LTeamID == team_id), :]
    d.sort_values(['Season', 'DayNum'], inplace=True)
    d.drop_duplicates(['Season'], keep='last', inplace=True)
    w_mask = d.WTeamID == team_id
    l_mask = d.LTeamID == team_id
    d['season_elo'] = None
    d.loc[w_mask, 'season_elo'] = d.loc[w_mask, 'w_elo']
    d.loc[l_mask, 'season_elo'] = d.loc[l_mask, 'l_elo']
    out = pd.DataFrame({
        'team_id': team_id,
        'season': d.Season,
        'season_elo': d.season_elo
    })
    return(out)

In [99]:
#Also LiamKerwin's

df_list = [final_elo_per_season(df[df.reg_season == 1], id) for id in team_ids]
season_elos = pd.concat(df_list)

In [100]:
season_elos.sample(10)

Unnamed: 0_level_0,season,season_elo,team_id
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3613,1985,1485.24,1319
87737,2005,1468.3,1233
55776,1998,1202.3,1287
43804,1995,1482.44,1226
102766,2008,1753.87,1448
87885,2005,1737.81,1326
69285,2001,1343.92,1225
31654,1992,1585.44,1266
92240,2006,1242.7,1421
69230,2001,1137.92,1170


In [101]:

df_list_t = [final_elo_per_season(df[df.reg_season == 0], id) for id in team_ids]
tourn_elos = pd.concat(df_list_t)

In [102]:
tourn_elos.sample(10)

Unnamed: 0_level_0,season,season_elo,team_id
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1600,2010,1884.6,1266
1559,2009,1866.74,1449
1550,2009,1744.93,1130
438,1991,2067.03,1424
1570,2009,2055.07,1272
1774,2012,1720.48,1325
420,1991,1641.13,1336
1039,2001,1636.61,1218
233,1988,1746.55,1210
1646,2010,1981.31,1452


In [103]:
season_elos.to_csv('season_elos.csv',index=True)
tourn_elos.to_csv('tournament_elos.csv',index=True)

### Seeds
- Need the seed information to add to the table
- Tournament seeds are in a different table
- No seed information for standard Season

Strategy:
- Will split out (and encode) the Division
- Split out the the Seed #

In [104]:
seeds = pd.read_csv('NCAATourneySeeds.csv')

In [105]:
seeds.head(2)

Unnamed: 0,Season,Seed,TeamID
0,1985,W01,1207
1,1985,W02,1210


In [106]:
import re
q1 = re.compile('[a-zA-Z]')
q2 = re.compile('[0-9][0-9]*')

In [107]:
seeds['Div'] = seeds.Seed.apply(lambda x: q1.search(x).group())
seeds['Num'] = seeds.Seed.apply(lambda x: q2.search(x).group())

In [108]:
adict = {'W': 1, 'X': 2, 'Y': 3, 'Z': 4}
seeds.Div = seeds.Div.map(adict)
seeds.head(2)

Unnamed: 0,Season,Seed,TeamID,Div,Num
0,1985,W01,1207,1,1
1,1985,W02,1210,1,2


In [109]:
df.columns

Index(['Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc',
       'NumOT', 'margin', 'reg_season', 'w_elo', 'l_elo'],
      dtype='object')

In [110]:
df2 = df.copy()

In [111]:
df = df2.copy()

### Important Notes:
- Competition requires calc of probability of "1st team" winning
    - 1st team is the team with lowest ID, given two competitors
    - Data is not structured that way
    
  
- In order to train a model...
    - Need to erase the Wteam/Lteam distinction
    - Instead just use Team1 and Team2, with a target label "Team1_Wins"


- If changed as is, all WTeam become Team1 and Team1_Wins is always 1
    - Will screw up model training!
    - Alternative: Randomize which team is Team1
    - Make sure label matches

The requirement for Team1_ID being the lower of the two is not important in this stage. IDs are just bookkeeping, not features, so the model will be blind to them. As long as the model is trained with labels set up correctly so that we are predicting the probability of Team1 victory, this will work.



In [112]:
# The following are a bit cumbersome
# but helped me keep my wits as I made the changes.

seeds['Team1_ID'] = seeds.TeamID
seeds['Team2_ID'] = seeds.TeamID

seeds['Team1_Seed'] = seeds.Num
seeds['Team2_Seed'] = seeds.Num

seeds['Team1_Div'] = seeds.Div
seeds['Team2_Div'] = seeds.Div


In [113]:
# Extract team1 (t1) and team2 (t2) components

seedst1 = seeds[['Season', 'Team1_ID', 'Team1_Seed']]
seedst2 = seeds[['Season', 'Team2_ID', 'Team2_Seed']]

In [114]:
# Creation of empty columns

df['Team1_ID'] = 0
df['Team2_ID'] = 0

df['Team1_Score'] = 0
df['Team2_Score'] = 0

df['Team1_Elo'] = 0
df['Team2_Elo'] = 0

df['Team1_Wins'] = 0

In [116]:
# The randomization happens here

shakeup = np.random.randint(0,2, size=(len(df)))

In [117]:
# Where shakeup has 0, we will use the Winning Team as Team 1
# That means the Losing Team is team2

df.loc[shakeup == 0,'Team1_ID'] = df.loc[shakeup == 0,'WTeamID']
df.loc[shakeup == 0,'Team2_ID'] = df.loc[shakeup == 0,'LTeamID']

df.loc[shakeup == 0,'Team1_Elo'] = df.loc[shakeup == 0,'w_elo']
df.loc[shakeup == 0,'Team2_Elo'] = df.loc[shakeup == 0,'l_elo']

df.loc[shakeup == 0,'Team1_Score'] = df.loc[shakeup == 0,'WScore']
df.loc[shakeup == 0,'Team2_Score'] = df.loc[shakeup == 0,'LScore']

df.loc[shakeup == 0,'Team1_Wins'] = 1

In [122]:
# Where shakeup has 1, we will use the Losing Team as Team 1
# That means the Winning Team is team2

df.loc[shakeup == 1,'Team1_ID'] = df.loc[shakeup == 1,'LTeamID']
df.loc[shakeup == 1,'Team2_ID'] = df.loc[shakeup == 1,'WTeamID']

df.loc[shakeup == 1,'Team1_Elo'] = df.loc[shakeup == 1,'l_elo']
df.loc[shakeup == 1,'Team2_Elo'] = df.loc[shakeup == 1,'w_elo']

df.loc[shakeup == 1,'Team1_Score'] = df.loc[shakeup == 1,'LScore']
df.loc[shakeup == 1,'Team2_Score'] = df.loc[shakeup == 1,'WScore']

# Techincally, this is already 0
df.loc[shakeup == 1,'Team1_Wins'] = 0

#Need a conditional to prevent me from accidentally adding multiple joins
if 'Team1_Seed' not in df.columns:
    df = df.merge(seedst1, how = 'left', on = ('Season', 'Team1_ID'))
    df = df.merge(seedst2, how = 'left', on = ('Season', 'Team2_ID'))

    
# The features come with a mix of text and numbers, and are object type
# The text are actually numbers anyway, so force them to behave that way
df['Team1_Seed'] = pd.to_numeric(df.Team1_Seed,errors='coerce')
df['Team2_Seed'] = pd.to_numeric(df.Team2_Seed,errors='coerce')

# has to come before the seed = 17 below
df.fillna(0,inplace=True)

# Executive decision: Any team that doesn't have a seed gets 17
df.loc[df.Team1_Seed == 0, 'Team1_Seed'] = 17
df.loc[df.Team2_Seed == 0, 'Team2_Seed'] = 17

# Calculate the difference between seeds
df['Seed_Gap'] = df['Team1_Seed'] - df['Team2_Seed']

In [123]:
df.loc[((shakeup == 0) & (df.reg_season == 0))].head(3)

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,...,Team1_ID,Team2_ID,Team1_Score,Team2_Score,Team1_Elo,Team2_Elo,Team1_Wins,Team1_Seed,Team2_Seed,Seed_Gap
3738,1985,136,1120,59,1345,58,N,0,1,0,...,1120,1345,59,58,1569.764771,1571.343539,1,11.0,6.0,5.0
3739,1985,136,1207,68,1250,43,N,0,25,0,...,1207,1250,68,43,1735.690585,1429.401839,1,1.0,16.0,-15.0
3740,1985,136,1229,58,1425,55,N,0,3,0,...,1229,1425,58,55,1583.001327,1567.594442,1,9.0,8.0,1.0


In [124]:
df.loc[shakeup == 1].head(3)

Unnamed: 0,Season,DayNum,WTeamID,WScore,LTeamID,LScore,WLoc,NumOT,margin,reg_season,...,Team1_ID,Team2_ID,Team1_Score,Team2_Score,Team1_Elo,Team2_Elo,Team1_Wins,Team1_Seed,Team2_Seed,Seed_Gap
2,1985,25,1112,63,1223,56,H,0,7,1,...,1223,1112,56,63,1494.392503,1505.607497,0,17.0,10.0,7.0
6,1985,25,1228,64,1226,44,N,0,20,1,...,1226,1228,44,64,1484.491623,1530.155851,0,17.0,3.0,14.0
7,1985,25,1242,58,1268,56,N,0,2,1,...,1268,1242,56,58,1495.168136,1504.831864,0,5.0,3.0,2.0


In [125]:
df3 = df.copy()

In [141]:
df = df3.copy()

In [142]:
#CHeck to see if I left anything Null
df.isnull().sum()

Season         0
DayNum         0
WTeamID        0
WScore         0
LTeamID        0
LScore         0
WLoc           0
NumOT          0
margin         0
reg_season     0
w_elo          0
l_elo          0
Team1_ID       0
Team2_ID       0
Team1_Score    0
Team2_Score    0
Team1_Elo      0
Team2_Elo      0
Team1_Wins     0
Team1_Seed     0
Team2_Seed     0
Seed_Gap       0
dtype: int64

In [143]:
df.columns

Index(['Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc',
       'NumOT', 'margin', 'reg_season', 'w_elo', 'l_elo', 'Team1_ID',
       'Team2_ID', 'Team1_Score', 'Team2_Score', 'Team1_Elo', 'Team2_Elo',
       'Team1_Wins', 'Team1_Seed', 'Team2_Seed', 'Seed_Gap'],
      dtype='object')

### More Notes:
- Keep in mind that the only inputs we can have into the model are:
    - TeamID
    - Seed
    - Anything we can calculate using pre-tournament knowledge (ELO, Seed_Gap)
    
- Start dropping defunct features and things we cannot carry forward

In [144]:

columns_to_drop = ['WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc',
       'NumOT', 'margin', 'w_elo', 'l_elo', 
        'Team1_Score', 'Team2_Score']

df.drop(columns_to_drop,axis=1,inplace=True)

In [145]:
df.columns

Index(['Season', 'DayNum', 'reg_season', 'Team1_ID', 'Team2_ID', 'Team1_Elo',
       'Team2_Elo', 'Team1_Wins', 'Team1_Seed', 'Team2_Seed', 'Seed_Gap'],
      dtype='object')

In [146]:
# I will change most of these to categorical later
# but as that is model specific, I did not do it here

df.dtypes

Season          int64
DayNum          int64
reg_season      int64
Team1_ID        int64
Team2_ID        int64
Team1_Elo     float64
Team2_Elo     float64
Team1_Wins      int64
Team1_Seed    float64
Team2_Seed    float64
Seed_Gap      float64
dtype: object

In [147]:
df.head()

Unnamed: 0,Season,DayNum,reg_season,Team1_ID,Team2_ID,Team1_Elo,Team2_Elo,Team1_Wins,Team1_Seed,Team2_Seed,Seed_Gap
0,1985,20,1,1228,1328,1514.647474,1485.352526,1,3.0,1.0,2.0
1,1985,25,1,1106,1354,1505.607497,1494.392503,1,17.0,17.0,0.0
2,1985,25,1,1223,1112,1494.392503,1505.607497,0,17.0,10.0,7.0
3,1985,25,1,1165,1432,1509.370698,1490.629302,1,17.0,17.0,0.0
4,1985,25,1,1192,1447,1507.756076,1492.243924,1,16.0,17.0,-1.0


### Save Point 1B
Use this data for the modelling phase

In [148]:
df.to_csv('NCAA_Men_Data_1B.csv',index=True)