# Part 3: How best to represent our data?
<i style="font-size: 0.94em">This notebook is part of a series detailing the creation of models for predicting NFL game outcomes. For the table of contents listing all notebooks in the series, <a href="0.Introduction_NFL_Prediction.ipynb">click here.</a></i>

In the <a href="2.NFL_Game_Data_Extraction.ipynb">last section, we extracted statistics from NFL play-by-play data </a>representing each NFL game from 2009-2018. The goal is to feed these extracted statistics into various models to predict NFL games. Obviously, a sports prediction can only be based on information that was had (or is had, in the present tense) BEFORE the given match occurred. This raises a simple question that is somewhat harder to answer: What exact information do observers always know about an NFL game before the game starts?

On the surface, the answer is simple: we know the result and the statistics for every single game that happened in the entire NFL before the current game. We also know which team is at home and which is visiting, a fact many observers consider valuable. However, we can't include every stat from every game ever. The resulting model would have too many dimensions, and given that the NFL game data is a small data set, would inevitably be grossly overfit. Hence, we must reduce the vast amount of input into a simpler representation. 


# Extracting Season-average and average-up-to-now stats
We'll start by turning the statistics per-game into summary statistics per season per team. We'll also create a vector of statistics representing the average of all games played in the current season BEFORE a given game. For example, if a game was played on December 17th, 2017 between the Dallas Cowboys and Oakland (now Las Vegas) Raiders, we want a line of statistics representing the average of all games in the 2017 season before then for the Dallas Cowboys and another for the Oakland Raiders. This is a condensed representation of all the knowledge and information someone would have access to before the game with regards to the current season, but not including any information from that game (since a real prediction would have no access to such information).

The code below defines a few functions to convert the per-game statistics extracted in the last Jupyter notebook into per-season summary stats as well as the "season-up-to-now" statistics just described.

In [1]:
import pandas as pd

directory = "./Data/"
input_file_name = "per_game_stats.pkl"
all_game_stats = pd.read_pickle(directory+input_file_name)


def build_per_team_per_seasons_stats(df: pd.DataFrame):
    df = df.drop(["game_id", "game_date", "is_home_team"], axis=1)
    # This line just groups by season and team, averages, then converts to a df
    result = df.groupby(["season", "team"]).mean().reset_index()
    # append season averages from 2008, which aren't included in the big CSV and were taken/adapted from
    # pro-football-reference.com
    season_data_2008 = pd.read_csv(directory+"2008_stats_right_format.csv")
    result = pd.concat([season_data_2008, result])
    return result;

per_season_stats = build_per_team_per_seasons_stats(all_game_stats)
per_season_stats.tail(10)





Unnamed: 0,season,team,point_dif,off_drives,off_total_start_pos,completed_passes,net_passing_yards,passes_attempted,air_yards,yards_after_catch,...,opponent_p_attempts,pass_td_allowed,r_yards_allowed,opponent_r_attempts,rush_td_allowed,pick_6s,hit_their_qb,sacked_their_qb,sack_yards,tackles_for_loss
310,2018,NYG,-3.0,11.357143,852.214286,25.142857,258.857143,42.142857,291.928571,138.214286,...,37.642857,1.357143,138.428571,30.214286,1.142857,0.142857,4.857143,1.785714,-11.571429,3.142857
311,2018,NYJ,-4.928571,12.5,940.357143,19.714286,201.642857,37.357143,289.857143,106.5,...,39.642857,1.785714,142.214286,30.0,1.071429,0.214286,6.928571,2.5,-19.0,2.857143
312,2018,OAK,-11.142857,11.214286,864.214286,25.928571,259.428571,41.214286,257.642857,141.857143,...,32.571429,2.428571,156.214286,31.142857,1.0,0.071429,3.428571,0.928571,-5.642857,2.428571
313,2018,PHI,-0.428571,11.071429,839.428571,27.857143,281.0,42.357143,306.571429,135.428571,...,45.285714,1.5,117.5,23.071429,1.0,0.0,8.357143,2.857143,-20.142857,2.857143
314,2018,PIT,4.785714,11.5,901.571429,29.857143,335.071429,45.857143,336.214286,187.428571,...,41.642857,2.0,104.357143,24.642857,0.785714,0.214286,7.357143,3.642857,-25.428571,2.357143
315,2018,SEA,5.071429,11.071429,854.714286,19.071429,205.214286,31.142857,244.714286,94.642857,...,39.0,1.714286,123.571429,24.214286,0.785714,0.071429,6.428571,2.714286,-16.285714,2.285714
316,2018,SF,-5.071429,11.5,905.428571,22.357143,258.785714,38.642857,250.071429,155.214286,...,39.142857,2.357143,119.714286,27.142857,0.785714,0.071429,5.857143,2.571429,-16.571429,2.5
317,2018,TB,-4.071429,11.642857,913.285714,27.214286,347.428571,44.285714,438.071429,123.857143,...,37.642857,2.214286,137.928571,27.214286,1.214286,0.0,6.214286,2.785714,-16.285714,3.285714
318,2018,TEN,1.357143,10.5,811.142857,19.642857,203.785714,32.357143,224.571429,109.642857,...,39.428571,1.214286,113.785714,26.357143,0.642857,0.0,5.714286,2.785714,-17.285714,2.071429
319,2018,WAS,-3.071429,11.0,813.857143,21.714286,214.0,37.285714,275.428571,106.071429,...,40.142857,1.642857,125.142857,25.785714,0.785714,0.0,6.142857,3.0,-19.571429,1.428571


In [2]:
def game_up_to_now_stats(df: pd.DataFrame):
    result = df[["point_dif", "game_id", "season", "game_date", "team", "is_home_team"]].copy()
    result = result.rename(columns={"point_dif": "gold_label"})

    def reduce_games_up_to_now(row):
        tmp_df = df[(df["season"] == row["season"]) &
                  (df["game_date"] < row["game_date"]) &
                  (df["team"] == row["team"])]
        season_games_to_now = len(tmp_df)
        tmp_df = tmp_df.drop(["game_id", "season", "game_date", "team", "is_home_team"], axis=1)\
                       .mean(numeric_only=True)
        tmp_df["season_games_to_now"] = season_games_to_now
        return tmp_df

    df = df.apply(reduce_games_up_to_now, axis=1)
    result = result.join(df)
    # result.to_pickle("../Outputs/game_up_to_now_stats.pkl")
    return result;

game_to_now_stats = game_up_to_now_stats(all_game_stats)
game_to_now_stats.head(3)




Unnamed: 0,gold_label,game_id,season,game_date,team,is_home_team,point_dif,off_drives,off_total_start_pos,completed_passes,...,pass_td_allowed,r_yards_allowed,opponent_r_attempts,rush_td_allowed,pick_6s,hit_their_qb,sacked_their_qb,sack_yards,tackles_for_loss,season_games_to_now
0,3,2009091000,2009,20090910,PIT,1.0,,,,,...,,,,,,,,,,0.0
1,-3,2009091000,2009,20090910,TEN,0.0,,,,,...,,,,,,,,,,0.0
2,-15,2009091304,2009,20090913,CLE,1.0,,,,,...,,,,,,,,,,0.0


In [3]:
game_to_now_stats.tail(3)

Unnamed: 0,gold_label,game_id,season,game_date,team,is_home_team,point_dif,off_drives,off_total_start_pos,completed_passes,...,pass_td_allowed,r_yards_allowed,opponent_r_attempts,rush_td_allowed,pick_6s,hit_their_qb,sacked_their_qb,sack_yards,tackles_for_loss,season_games_to_now
5049,7,2018121611,2018,20181216,PHI,0.0,-1.0,11.0,847.538462,28.153846,...,1.615385,120.230769,23.461538,0.923077,0.0,8.461538,2.923077,-20.615385,2.769231,13.0
5050,-5,2018121700,2018,20181217,CAR,1.0,-0.615385,10.846154,827.384615,25.307692,...,2.538462,109.461538,25.0,0.923077,0.0,4.769231,2.384615,-15.0,4.0,13.0
5051,5,2018121700,2018,20181217,NO,0.0,12.615385,10.538462,757.307692,26.076923,...,1.923077,86.538462,21.923077,0.846154,0.076923,6.461538,3.461538,-22.461538,2.538462,13.0


# Looks like we have a small issue...
Right above this are the head and tail of our DataFrame of statistical averages calculated relative to a given game. The tail looks fine, but what happened to the head? Why so many NaN?

Looking at a bigger piece of the DataFrame, it becomes clear that the first game for each team on each season is blank. This makes sense, as we calculated the games in the current season "up-to-now". What can we do about these empty games? One option would be to simply drop them from the dataset, accepting the fact that the start of an NFL season is an unpredictable time for any expert or any model. We simply don't know enough to make good predictions.

A different option is to simply use the information we have at hand before the first game of the season is played. In other words, use the season averages from the prior season to fill in the "Up-to-now" calculated statistics for the first game of each season for each team. While the predictions for these season-opening games may not be as accurate, since players and coaches change teams often between seasons, it should still work better than random chance. 

We already calculated the per-season averages for each team above. Hence, it's easy to fill in the first game of each season from the prior season as follows.

In [4]:
def fill_season_openers(games_up_to_now: pd.DataFrame, season_averages: pd.DataFrame):
    
    def fill_first_game_stats(row):
        relevant_season = season_averages[(season_averages["season"] == row["season"] - 1)  
            & (season_averages["team"] == row["team"])]
        row["point_dif":"tackles_for_loss"] = relevant_season.loc[relevant_season.last_valid_index(), "point_dif":"tackles_for_loss"]
        return row
    
    games_up_to_now[games_up_to_now["season_games_to_now"] == 0] = (
        games_up_to_now[games_up_to_now["season_games_to_now"] == 0] 
        .apply(fill_first_game_stats, axis=1))
 
fill_season_openers(game_to_now_stats, per_season_stats)
game_to_now_stats = game_to_now_stats.drop(["season", "game_date", "team"], axis = 1)
game_to_now_stats.head(5)

Unnamed: 0,gold_label,game_id,is_home_team,point_dif,off_drives,off_total_start_pos,completed_passes,net_passing_yards,passes_attempted,air_yards,...,pass_td_allowed,r_yards_allowed,opponent_r_attempts,rush_td_allowed,pick_6s,hit_their_qb,sacked_their_qb,sack_yards,tackles_for_loss,season_games_to_now
0,3,2009091000,1.0,7.8,11.875,818.1875,18.9375,206.3125,31.625,205.5625,...,0.75,80.25,24.375,0.4375,0.092866,5.375,3.1875,21.875,5.1875,0.0
1,-3,2009091000,0.0,8.8,11.9375,806.975,16.5625,176.1875,28.3125,172.70625,...,0.75,93.875,25.1875,0.75,0.092866,8.0,2.75,16.375,6.6875,0.0
2,-15,2009091304,1.0,-7.4,10.75,754.65,14.875,148.75,30.5,115.9,...,1.1875,151.9375,33.8125,1.0,0.092866,3.5625,1.0625,5.625,3.5,0.0
3,15,2009091304,0.0,2.9,11.875,839.5625,16.6875,184.75,28.25,180.8,...,0.9375,76.875,23.1875,0.625,0.092866,6.125,2.8125,19.0,6.1875,0.0
4,19,2009091307,1.0,4.4,11.125,779.8625,25.8125,311.0625,39.75,310.05,...,1.3125,117.8125,27.8125,0.875,0.092866,4.0,1.75,10.0625,4.375,0.0


# Only a couple more steps before we can start training models!
First, let's make a copy of the Pandas up_to_now dataframe and then drop the columns that won't play any part in training our model, such as season, game_date, or team names. These values are helpful to identify a given game and compare it to real events, but are not used when training the model, since they have no relation to each team's performance in-game. Remember, we've summarized each team's actual game performance as a vector of games averaged over the course of a season up to but not including the game to be predicted. Hence, we have all the information we need at hand.

Now, football games require two teams, so to make predictions for a given game, we need to put the data for both teams on one row. We do this by splitting the dataframe into two sections for home and away, adding suffixes to the column labels for both, and then joining on game_id. We can also employ a trick here to effectively double the size of our data set: we'll make one stat line for a game with the home team's stats first and the away team second, and another with the away team first and the home team second. As long as we record if the first team is home or away in one of the columns, this effectively turns the statistics from one game into two observations to use for training, one from each team's point of view. With such a small data set, we need to squeeze every bit of data out that we can, so this doubling trick is quite useful.

After dropping the game IDs, the data can be in a SciKit learn model by simply identifying which column contains the labels. For input into a Pytorch neural network, we need another few lines to convert it to a tensor.


In [5]:
# 'our' and 'their' refer to the point of view of the team who appears first in each line of data
our_stats = (game_to_now_stats[game_to_now_stats["is_home_team"]==1.0] 
                 .add_suffix("_us") )
their_stats = (game_to_now_stats[game_to_now_stats["is_home_team"]==0.0] 
                 .add_suffix("_them") )
home_team_stats = pd.merge(our_stats, their_stats, left_on="game_id_us", right_on="game_id_them")
our_stats = (game_to_now_stats[game_to_now_stats["is_home_team"]==0.0] 
                 .add_suffix("_us") )
their_stats = (game_to_now_stats[game_to_now_stats["is_home_team"]==1.0] 
                 .add_suffix("_them") )
away_team_stats = pd.merge(our_stats, their_stats, left_on="game_id_us", right_on="game_id_them")
labeled_data_df = (pd.concat([home_team_stats, away_team_stats], ignore_index=True)
              .drop(["game_id_us", "gold_label_them", "is_home_team_them", "game_id_them"], axis=1)
              .rename(columns={"gold_label_us":"gold_label", "is_home_team_us":"is_home_team"})
              )
labeled_data_df.head()


Unnamed: 0,gold_label,is_home_team,point_dif_us,off_drives_us,off_total_start_pos_us,completed_passes_us,net_passing_yards_us,passes_attempted_us,air_yards_us,yards_after_catch_us,...,pass_td_allowed_them,r_yards_allowed_them,opponent_r_attempts_them,rush_td_allowed_them,pick_6s_them,hit_their_qb_them,sacked_their_qb_them,sack_yards_them,tackles_for_loss_them,season_games_to_now_them
0,3,1.0,7.8,11.875,818.1875,18.9375,206.3125,31.625,205.5625,111.454935,...,0.75,93.875,25.1875,0.75,0.092866,8.0,2.75,16.375,6.6875,0.0
1,-15,1.0,-7.4,10.75,754.65,14.875,148.75,30.5,115.9,111.454935,...,0.9375,76.875,23.1875,0.625,0.092866,6.125,2.8125,19.0,6.1875,0.0
2,19,1.0,4.4,11.125,779.8625,25.8125,311.0625,39.75,310.05,111.454935,...,1.5625,172.125,33.5,1.9375,0.092866,3.5,1.875,11.9375,4.6875,0.0
3,-13,1.0,2.4,11.8125,798.525,22.1875,226.1875,35.125,221.2875,111.454935,...,1.1875,106.625,25.125,0.6875,0.092866,7.0,3.6875,23.375,5.75,0.0
4,-18,1.0,-1.8,10.8125,759.0375,22.9375,266.6875,34.6875,249.75,111.454935,...,1.4375,94.875,25.4375,0.625,0.092866,4.125,2.5625,15.8125,4.3125,0.0


# Looks great! Now let's export it to use in the next section, where the real fun starts...
Saving it as a pickle will maintain all the data types to make reading it in in the future easier. 

<a href="4.SciKItLearn_NFL_models.ipynb">Click here for the next section, building NFL models using SciKitLearn!</a>

In [6]:
output_file_name = "labeled_data.pkl"
labeled_data_df.to_pickle(directory+output_file_name)