# Team embeddings

The goal of this notebook is to build an embedding model similar to [word embeddings](https://www.tensorflow.org/tutorials/text/word_embeddings) which creates a vector representation of each teams performance. The idea is then to take these representations and usine [cosine similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) to find teams who perform similarly. Then a KNN model can be trained using the similarity to weight predictions (or perhaps just used in other models in other ways).

In [246]:
import pandas as pd
import numpy as np

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, Dropout

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

## Data loading

In [247]:
retro_df = pd.read_csv('../data/retrosheet.csv')

In [248]:
retro_df.head()

Unnamed: 0,date,away_team,home_team,away_score,home_score,away_off_22,away_off_23,away_off_24,away_off_25,away_off_26,...,home_pitch_69,home_pitch_70,home_pitch_71,home_def_72,home_def_73,home_def_74,home_def_75,home_def_76,home_def_77,home_win
0,2000-03-29,CHN,NYN,5,3,33,12,1,0,2,...,5,1,0,27,12,0,0,4,0,False
1,2000-03-30,NYN,CHN,5,1,37,6,2,0,1,...,5,0,0,33,14,0,0,0,0,False
2,2000-04-03,COL,ATL,0,2,31,6,2,0,0,...,0,0,0,27,12,0,0,1,0,True
3,2000-04-03,MIL,CIN,3,3,22,7,1,0,0,...,2,0,0,16,8,2,0,0,0,False
4,2000-04-03,SFN,MIA,4,6,35,10,2,2,1,...,4,0,0,27,15,0,0,2,0,True


## Data prep

The idea will be to have the input be the team name (encoded), the stats for both teams, and whether the team won or not. We'll repeat this twice, once with "the team" being the home team, and once with it being the away team. This means that technically the model could memorize certain stats and learn which teams were involved, but I'm not sure that's really so likely.

In [249]:
le = LabelEncoder()
retro_df['away_team'] = le.fit_transform(retro_df['away_team'])
retro_df['home_team'] = le.transform(retro_df['home_team'])

In [250]:
retro_df['home_win'] = retro_df['home_win'].astype(int)

In [251]:
retro_df = retro_df.drop('date', axis='columns')

In [252]:
retro_df.head()

Unnamed: 0,away_team,home_team,away_score,home_score,away_off_22,away_off_23,away_off_24,away_off_25,away_off_26,away_off_27,...,home_pitch_69,home_pitch_70,home_pitch_71,home_def_72,home_def_73,home_def_74,home_def_75,home_def_76,home_def_77,home_win
0,6,18,5,3,33,12,1,0,2,5,...,5,1,0,27,12,0,0,4,0,0
1,18,6,5,1,37,6,2,0,1,5,...,5,0,0,33,14,0,0,0,0,0
2,9,2,0,2,31,6,2,0,0,0,...,0,0,0,27,12,0,0,1,0,1
3,15,7,3,3,22,7,1,0,0,2,...,2,0,0,16,8,2,0,0,0,0
4,24,14,4,6,35,10,2,2,1,4,...,4,0,0,27,15,0,0,2,0,1


Separate the home and away columns. This will be used for reorganizing (described below). Since we want to keep the team and the scores in the same spot in both dataframes, we'll pull those out.

In [253]:
away_cols = [c for c in retro_df.columns if c.startswith('away_')]
away_cols.remove('away_score')
away_cols.remove('away_team')

home_cols = [c for c in retro_df.columns if c.startswith('home_')]
home_cols.remove('home_score')
home_cols.remove('home_team')
home_cols.remove('home_win')

In [254]:
y1 = retro_df['away_team']
y2 = retro_df['home_team']

X1 = retro_df.loc[:, retro_df.columns != 'away_team']
X2 = retro_df.loc[:, retro_df.columns != 'home_team']

Rearrange the columns so that the first is the opposing team, followed by team score, then opposing score, then whether or not the team won (note the negation of `X2` below to enable this for the away team). Then rearrange the columns so that the opposing teams stats come first, then the teams.

In [255]:
X1 = X1[['home_team', 'away_score', 'home_score', 'home_win'] + home_cols + away_cols]
X2 = X2[['away_team', 'home_score', 'away_score', 'home_win'] + away_cols + home_cols]

In [256]:
X2['home_win'] = ~X2['home_win'].astype(bool)
X2['home_win'] = X2['home_win'].astype(int)

In [257]:
X1.head()

Unnamed: 0,home_team,away_score,home_score,home_win,home_off_50,home_off_51,home_off_52,home_off_53,home_off_54,home_off_55,...,away_pitch_40,away_pitch_41,away_pitch_42,away_pitch_43,away_def_44,away_def_45,away_def_46,away_def_47,away_def_48,away_def_49
0,18,5,3,0,33,7,1,0,1,3,...,3,3,0,0,27,10,2,0,1,0
1,6,5,1,0,36,5,0,0,0,0,...,0,0,0,0,33,14,2,0,2,0
2,2,0,2,1,30,7,0,0,2,2,...,2,2,1,0,24,10,0,0,1,0
3,7,3,3,0,19,5,1,0,1,3,...,3,3,0,0,15,5,0,0,0,0
4,14,4,6,1,36,12,3,0,0,5,...,4,4,0,0,24,7,2,0,1,0


In [258]:
X2.head()

Unnamed: 0,away_team,home_score,away_score,home_win,away_off_22,away_off_23,away_off_24,away_off_25,away_off_26,away_off_27,...,home_pitch_68,home_pitch_69,home_pitch_70,home_pitch_71,home_def_72,home_def_73,home_def_74,home_def_75,home_def_76,home_def_77
0,6,3,5,1,33,12,1,0,2,5,...,5,5,1,0,27,12,0,0,4,0
1,18,1,5,1,37,6,2,0,1,5,...,5,5,0,0,33,14,0,0,0,0
2,9,2,0,0,31,6,2,0,0,0,...,0,0,0,0,27,12,0,0,1,0
3,15,3,3,1,22,7,1,0,0,2,...,2,2,0,0,16,8,2,0,0,0
4,24,6,4,0,35,10,2,2,1,4,...,4,4,0,0,27,15,0,0,2,0


Renumber the columns to just be numerically named for easy stacking. Also, we're embedding this into a lower dimension, so the column names are useless.

In [259]:
X1.columns = ['team'] + list(range(len(X1.columns)-1))
X2.columns = ['team'] + list(range(len(X2.columns)-1))

In [260]:
y = pd.concat([y1, y2])
X = pd.concat([X1, X2])

Note that it's perfectly valid to use "future" games since all we're doing is looking for patterns in how certain "types" of teams play one another. While it's almost certainly the case that how teams perform change over time, we'll try this setup initially and see if it looks promising or not before refining it.

In [261]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## Model

This will be a super simple embedding model using the ideas in [this Tensorflow tutorial](https://www.tensorflow.org/tutorials/text/word_embeddings#create_a_classification_model).

In [262]:
input_len = X_train.shape[1]
embedding_dim = 30

In [263]:
model = Sequential([Embedding(100, embedding_dim, input_length=input_len, name='embedding'),
                    Dense(128, activation='relu'),
                    Dense(128, activation='relu'),
                    Dropout(0.1),
                    Dense(1)])

In [264]:
model.compile(optimizer='adam',
             loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
             metrics=['accuracy'])

In [265]:
train_ds = tf.data.Dataset.from_tensor_slices((X_train.values, y_train.values))
train_ds = train_ds.shuffle(buffer_size=10**5).batch(32)

test_ds = tf.data.Dataset.from_tensor_slices((X_test.values, y_test.values))
test_ds = test_ds.shuffle(buffer_size=10**5).batch(32)

In [266]:
model.fit(train_ds, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x14ad4c4f0>

## Covariates data

This isn't giving much. What about working with the covariate data instead?

In [280]:
mlb_games_df = pd.read_csv('../data/mlb_games_df.csv')

In [281]:
mlb_games_df.head()

Unnamed: 0,date,Y,M,D,home_team,away_team,home_win,home_pitcher,away_pitcher,home_elo,...,avg_diff,obp_diff,slg_diff,avg_pct_diff,obp_pct_diff,slg_pct_diff,home_rest,away_rest,away_team_season_game_num,home_team_season_game_num
0,2001-04-01,2001,4.0,1.0,TOR,TEX,1.0,loaizes01,helliri01,1499.563,...,-0.00806,-0.010103,0.023271,-2.947374,-2.977845,4.989568,5.0,5.0,0,0
1,2001-04-02,2001,4.0,2.0,SEA,OAK,1.0,garcifr03,hudsoti01,1519.464,...,-0.000864,0.00119,-0.016229,-0.323318,0.331871,-3.70521,5.0,5.0,0,0
2,2001-04-02,2001,4.0,2.0,NYA,KCA,1.0,clemero02,suppaje01,1529.511,...,-0.010188,0.006929,0.024787,-3.703559,1.970596,5.554343,5.0,5.0,0,0
3,2001-04-02,2001,4.0,2.0,CIN,ATL,0.0,harnipe01,burkejo03,1527.274,...,0.003972,-0.001729,0.020216,1.459194,-0.50696,4.555242,5.0,5.0,0,0
4,2001-04-02,2001,4.0,2.0,CHN,WAS,0.0,liebejo01,vazquja01,1462.51,...,-0.010158,0.009335,-0.018992,-3.99634,2.80356,-4.646432,5.0,5.0,0,0


In [282]:
cols_to_drop = ['date', 'Y', 'M', 'D', 'home_pitcher', 'away_pitcher', 'home_rest', 
                'away_rest', 'away_team_season_game_num', 'home_team_season_game_num']
mlb_games_df = mlb_games_df.drop(cols_to_drop, axis='columns')

In [283]:
mlb_games_df['home_team'] = le.transform(mlb_games_df['home_team'])
mlb_games_df['away_team'] = le.transform(mlb_games_df['away_team'])

In [284]:
mlb_games_df.head()

Unnamed: 0,home_team,away_team,home_win,home_elo,away_elo,home_avg,away_avg,home_obp,away_obp,home_slg,...,home_iso,away_iso,elo_diff,elo_pct_diff,avg_diff,obp_diff,slg_diff,avg_pct_diff,obp_pct_diff,slg_pct_diff
0,28,27,1.0,1499.563,1479.163,0.273459,0.281519,0.339283,0.349386,0.466387,...,0.192927,0.161597,20.4,1.360396,-0.00806,-0.010103,0.023271,-2.947374,-2.977845,4.989568
1,23,19,1.0,1519.464,1534.696,0.26728,0.268144,0.358599,0.357409,0.438008,...,0.170727,0.186092,-15.232,-1.002459,-0.000864,0.00119,-0.016229,-0.323318,0.331871,-3.70521
2,17,12,1.0,1529.511,1493.152,0.27508,0.285268,0.351633,0.344703,0.446269,...,0.171189,0.136214,36.359,2.377165,-0.010188,0.006929,0.024787,-3.703559,1.970596,5.554343
3,7,2,0.0,1527.274,1523.864,0.272199,0.268227,0.341041,0.34277,0.443798,...,0.1716,0.155356,3.41,0.223274,0.003972,-0.001729,0.020216,1.459194,-0.50696,4.555242
4,6,29,0.0,1462.51,1461.765,0.254189,0.264347,0.332965,0.32363,0.408734,...,0.154545,0.163379,0.745,0.05094,-0.010158,0.009335,-0.018992,-3.99634,2.80356,-4.646432


In [285]:
home_cols = [c for c in mlb_games_df.columns if c.startswith('home_')]
away_cols = [c for c in mlb_games_df.columns if c.startswith('away_')]
remaining_cols = list(set(mlb_games_df.columns) - set(home_cols).union(set(away_cols)))

home_cols.remove('home_team')
home_cols.remove('home_win')
away_cols.remove('away_team')

home_df = mlb_games_df[['home_team', 'away_team', 'home_win'] + home_cols + away_cols + remaining_cols]
away_df = mlb_games_df[['away_team', 'home_team', 'home_win'] + away_cols + home_cols + remaining_cols]

In [286]:
home_df.columns = ['team', 'opp_team', 'team_win'] + list(range(len(home_df.columns)-3))
away_df.columns = ['team', 'opp_team', 'team_win'] + list(range(len(away_df.columns)-3))

In [287]:
away_df['team_win'] = ~away_df['team_win'].astype(bool)
away_df['team_win'] = away_df['team_win'].astype(int)

In [288]:
home_df.head()

Unnamed: 0,team,opp_team,team_win,0,1,2,3,4,5,6,...,8,9,10,11,12,13,14,15,16,17
0,28,27,1.0,1499.563,0.273459,0.339283,0.466387,0.192927,1479.163,0.281519,...,0.443116,0.161597,1.360396,0.023271,-2.977845,20.4,-0.00806,-2.947374,4.989568,-0.010103
1,23,19,1.0,1519.464,0.26728,0.358599,0.438008,0.170727,1534.696,0.268144,...,0.454237,0.186092,-1.002459,-0.016229,0.331871,-15.232,-0.000864,-0.323318,-3.70521,0.00119
2,17,12,1.0,1529.511,0.27508,0.351633,0.446269,0.171189,1493.152,0.285268,...,0.421482,0.136214,2.377165,0.024787,1.970596,36.359,-0.010188,-3.703559,5.554343,0.006929
3,7,2,0.0,1527.274,0.272199,0.341041,0.443798,0.1716,1523.864,0.268227,...,0.423582,0.155356,0.223274,0.020216,-0.50696,3.41,0.003972,1.459194,4.555242,-0.001729
4,6,29,0.0,1462.51,0.254189,0.332965,0.408734,0.154545,1461.765,0.264347,...,0.427726,0.163379,0.05094,-0.018992,2.80356,0.745,-0.010158,-3.99634,-4.646432,0.009335


In [289]:
away_df.head()

Unnamed: 0,team,opp_team,team_win,0,1,2,3,4,5,6,...,8,9,10,11,12,13,14,15,16,17
0,27,28,0,1479.163,0.281519,0.349386,0.443116,0.161597,1499.563,0.273459,...,0.466387,0.192927,1.360396,0.023271,-2.977845,20.4,-0.00806,-2.947374,4.989568,-0.010103
1,19,23,0,1534.696,0.268144,0.357409,0.454237,0.186092,1519.464,0.26728,...,0.438008,0.170727,-1.002459,-0.016229,0.331871,-15.232,-0.000864,-0.323318,-3.70521,0.00119
2,12,17,0,1493.152,0.285268,0.344703,0.421482,0.136214,1529.511,0.27508,...,0.446269,0.171189,2.377165,0.024787,1.970596,36.359,-0.010188,-3.703559,5.554343,0.006929
3,2,7,1,1523.864,0.268227,0.34277,0.423582,0.155356,1527.274,0.272199,...,0.443798,0.1716,0.223274,0.020216,-0.50696,3.41,0.003972,1.459194,4.555242,-0.001729
4,29,6,1,1461.765,0.264347,0.32363,0.427726,0.163379,1462.51,0.254189,...,0.408734,0.154545,0.05094,-0.018992,2.80356,0.745,-0.010158,-3.99634,-4.646432,0.009335


In [290]:
mlb_data = pd.concat([home_df, away_df])

In [291]:
mlb_data.isna().sum().sum() == 0

True

In [292]:
y = mlb_data.pop('team')
X = mlb_data

In [294]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)