# PREDICTING YARDAGE FOR THE NFL


## 1. BIG PICTURE

The goal in American football is for the offence to run (rush) or throw (pass) the ball to gain yards, move towards and finally across the opposing team's side of the field to score. The defenses' goal is to prevent the offensive team from scoring. 

Her eI use machine learning to find context to what contributes to a successfull run play, and predict the yardage (how many yards the team will gain on a rushing play). 

The data is from the Superbowl season 2017-2018. 

## 2. GET THE DATA

The data was downloaded from https://www.kaggle.com/c/nfl-big-data-bowl-2020/data.

In [1]:
import pandas            as pd
import numpy             as np
import matplotlib.pyplot as plt
from   datetime          import date

train_df = pd.read_csv('./data/train.csv', low_memory=False);

To find out what needed to be done to improve the data I examined whether the features were numerical or not, and how many `NaN` values there were.

I then examined in much more detail the features that were Strings and the ones with `NaN` values. 

In [2]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 509762 entries, 0 to 509761
Data columns (total 49 columns):
GameId                    509762 non-null int64
PlayId                    509762 non-null int64
Team                      509762 non-null object
X                         509762 non-null float64
Y                         509762 non-null float64
S                         509762 non-null float64
A                         509762 non-null float64
Dis                       509762 non-null float64
Orientation               509744 non-null float64
Dir                       509748 non-null float64
NflId                     509762 non-null int64
DisplayName               509762 non-null object
JerseyNumber              509762 non-null int64
Season                    509762 non-null int64
YardLine                  509762 non-null int64
Quarter                   509762 non-null int64
GameClock                 509762 non-null object
PossessionTeam            509762 non-null object
Down   

## 3. PROBLEMS IN THE DATA AND PROCESSING THEM:

After examining the data I found several things that needed to be fixed. I will not go into all the examining here because it was very extensive and quite messy.

However I have here a summary of the different features I needed to change and a short description of what needed to be done:

#### TIMESTAMPS
The feature `GameClock` was given on the format `hh:mm:ss` as a string and I wanted to use numerical information for this. The solution was to transform the timestamp into three new features: `GameClockHour`, `GameClockMinute` and `GameClockSecond`. For this I used the function `timeStampSplit`.

Then I delete the original feature `GameClock`.

In [2]:
def timeStampSplit(df, col):
    """
    Function that takes a timestamp on the format hh:mm:ss as and splits it into new columns
    for the hour, minute and seconds.
    
    Input:
        df  = The original dataframe
        col = The column name for the feature we want to split
        
    Output:
        An updated dataframe.
    """
    # Define the new column names:
    colNames = [col + 'Hour', col + 'Minute', col + 'Second'];
    # Create a new dataframe, the one we want to add:
    new_df = pd.DataFrame(df[col].str.split(':', 2).tolist(), columns=colNames);
    
    # Make the new features numerical:
    for c in colNames:
        new_df[c] = pd.to_numeric(new_df[c]);
    
    # Concat the two dataframes and return it:
    return pd.concat([df, new_df], axis=1, sort=False);

#### HEIGHT

The player heights are given in the feature `PlayerHeight` on the format `foot-inches`. To make this into a numerical value I created the function `heightToCm` which also transforms the height into centimeters.

In [3]:
def heightToCm(str):
    """
    Function used to convert the player heights given on the format foot-inches into cm.
    
    Input: 
        str = the height given as foot-inches, e.g. 5-10
        
    Output:
        The height now in centimeteres, as a float
        
    Usage:
        Here it is used on the PlayerHeight feature.
    """
    if pd.isnull(str):
        return np.nan;
    # Separate the foot and inches:
    arr = str.split('-');
    
    height = int(arr[0])*30.48 + int(arr[1])*2.54;
    return height;

#### Dates

For some features we have dates as information. For these we want to split the strings into year, month and day features. This is relevant for the feature `PlayerBirthDate`. But also for `TimeHandoff` and `TimeSnap` which are given as UTC time stamps. For these we use the function `utcSplit` which uses both `timeStampSplit` and `dateSplit`. 

Then we remove the features `PlayerBirthDate`, `TimeHandoff` and `TimeSnap`.

In [4]:
def dateSplit(df, col, char):
    """
    Functions which take the date-stamp and split it into three new features for the year,
    month and day. The dates can be given on two formats:
        mm/dd/yyyy or yyyy-mm-dd
    
    Input:
        df   = the original dataframe
        col  = the column name of the dataframe we want to split
        char = the character used to split between the year, month and day
        
    Output:
        An updated dataframe
    """
    # The column names depends on the character used for the split:
    if char == '-':
        colNames = [col + 'Year', col + 'Month', col + 'Day'];
    else:
        colNames = [col + 'Month', col + 'Day', col + 'Year'];
        
    # Create a new dataframe, the one we want to add:
    new_df = pd.DataFrame(df[col].str.split(char, 2).tolist(), columns=colNames);
    
    # Make the new features numerical:
    for c in colNames:
        new_df[c] = pd.to_numeric(new_df[c]);
    
    # Concat the two dataframes and return it:
    return pd.concat([df, new_df], axis=1, sort=False);

def utcSplit(df, col):
    """
    Function to split the UTC timestamp into new features for the year, month, day, hour,
    minute and seconds.
    
    Input:
        df  = the original dataframe
        col = the column we want to apply this on
    
    Output:
        An updated dataframe
    """
    # Create a temporary dataframe copy to separate the date and time:
    df1      = df[col].copy();
    # Remove the timestamp:
    df1[col] = df[col].apply(lambda s: s[:s.find('T')]);
    # Split into the new features:
    df1      = dateSplit(df1, col, '-');
    # Remove the original column from this one:
    df1      = df1.drop(col, axis=1);
    
    # Create another temporary dataframe copy to separate the date and time:
    df2      = df[col].copy();
    # Remove the date stamp:
    df2[col] = df[col].apply(lambda s: s[s.find('T')+1:-5]);
    # Split into the new features:
    df2      = timeStampSplit(df2, col)
    # Remove the original column:
    df2      = df2.drop(col, axis=1);
    
    # Concat all the new features:
    df3 = pd.concat([df1, df2], axis=1, sort=False);
    
    # Concat with the original and return:
    return pd.concat([df, df3], axis=1, sort=False);

#### LOCATION

The `Location` feature in the data tells us the location of the game. This feature is riddled with writing errors and information in different formats. To handle this I examined each of the locations and found that the best format is simply the city name in all caps. 

This feature is handled using the `fixLocation` function.

In [5]:
edited_locations = [];
def fixLocation(loc):
    """
    Function to fix the locations so they just describe the city of the game.
    
    Input:
        loc = the current location
        
    Output:
        Correct format for the current location
    """
    old_loc = loc;
    if pd.isnull(loc):
        return np.nan;
    
    # First we remove the ',' from the location:
    if loc.find(',') != -1:
        loc = loc[:loc.find(',')];
    
    # Then if the , has been confused with a .:
    if loc.find('.') != -1 and len(loc[loc.find('.'):]) < 5:
        loc = loc[:loc.find('.')];
        
    # The last option is to see if we find the correct version in the already
    # edited locations:
    if loc.find(' ') != -1:
        words = loc.split(' ');
        word  = words[0]
        if len(word) < 3: 
            word = words[1];
        if word in str(edited_locations):
            for l in edited_locations:
                if word in l:
                    loc = l;
                    break;
        elif len(words) > 2 and len(words[2]) < 3:
            loc = words[0] + words[1];
                    
    if loc not in edited_locations:
        edited_locations.append(loc);
    
    return loc;

#### STADIUM NAMES

Because the `StadiumType` depens on the Stadium name we need to fix this first. Here the main problem is wrong or outdated names on the stadiums. The correct ones were found through googling.

Here names are corrected so they are all in the same format, and updated if needed.

In [6]:
def fixStadiums(stadium):
    """
    Function to change the stadion names to be the correct format, and to update the names if they have changed.
    The source for changing the names was wikipedia.
    
    Input:
        stadium = The stadium name
        
    Output:
        The correct name and format for the stadium
    """
    if pd.isnull(stadium):
        return np.nan;
    
    s = "STADIUM";
    # Handle the misspellings:
    if stadium.endswith("STDIUM"):
        stadium = stadium.replace("STDIUM", s);
    elif stadium.endswith("COLIESUM"):
        stadium = stadium.replace("COLIESUM", "COLISEUM");
    
    # Fix everything for firstenergy:
    if stadium.find("FIRST") != -1:
        stadium = "FIRSTENERGY " + s;
    # Fix every M&T:
    if stadium.find("&") != -1 and stadium[stadium.find("&")] != 'T':
        stadium = "M&T BANK " + s;
    # Fix all the Mercedes-Benz:
    if stadium.find("MERCEDES") != -1:
        stadium = "MERCEDES-BENZ SUPERDOME";
    # Fix all the CenturyLink:
    if stadium.find("CENTURY") != -1:
        stadium = "CENTURYLINK FIELD";
    # Fix all the MetLife:
    if stadium.find("METLIFE") != -1:
        stadium = "METLIFE " + s;  
    # Fix all the Twickenham:
    if stadium.find("TWICKENHAM") != -1:
        stadium = "TWICKENHAM " + s;
    
    # Update the wrong ones:
    if stadium.find("UNIVERSITY") != -1:
        stadium = "STATE FARM " + s;
    elif stadium.find("MILE") != -1:
        stadium = "EMPOWER FIELD";
    elif stadium.find("EVERBANK") != -1:
        stadium = stadium.replace("EVERBANK", "TIAA BANK"); 
    elif stadium.find("STUB") != -1:
        stadium = "DIGNITY HEALTH SPORTS PARK";    
    elif stadium.find("OAKLAND") != -1:
        stadium = "RINGCENTRAL COLISEUM";
        
    return stadium;

#### STADIUM TYPE

For the `StadiumType` feature there were a lot of writing mistakes and different definitions which all could simply be sorted into Indoors or Outdoors. To fix this each element was sorted as either Indoors, and then assigned the value `1` or outdoors with the assigned value `0`. 

The features `StadiumType` and `Stadium` are obviously related, and we also check if the `Stadium` already has a defined `StadiumType` value for the `NaN` instances.

In [7]:
# Before we write the function we also need to define where the StadiumType is nan, 
# And the corresponding stadiums:
nanStadiums     = train_df.Stadium[pd.isnull(train_df.StadiumType)].unique();
nanStadiumTypes = [];
for nsta in nanStadiums:
    if len(train_df.StadiumType[train_df.Stadium == nsta].value_counts()) == 0:
        nanStadiumTypes.append(train_df.StadiumType.iloc[train_df.StadiumType[train_df.Stadium == nanStadiums[0]].index[0]]);
    else:
        nanStadiumTypes.append(train_df.StadiumType[train_df.Stadium == nsta].value_counts().idxmax())

        
def fixStadiumType(stad):
    """
    Function to sort the stadium type in either outdoors = 0, or indoors = 1. Also
    replaces the nan values as much as possible based on the stadium name.
    
    Input:
        stad = the stadium type
        
    Output:
        0 or 1, depending on the sorting.
    """
    if pd.isnull(stad):
        return np.nan;
    
    sType = stad[0];
    sName = stad[1];
    
    
    if pd.isnull(sType):
        ind   = np.where(nanStadiums == sName)[0][0];
        sType = nanStadiumTypes[ind];
    
    if not pd.isnull(sType):
        if sType.find('Ou') != -1 or sType.find('Open') != -1 or sType.find('open') != -1 or \
        sType.find('Field') != -1 or sType.find('Cloudy') != -1 or sType.find('Bowl') != -1:
            return 0;
        else:
            return 1;
        
    return sType;

#### TURF

The turf is the type of grass the field has and can be either Natural, Artificial or a Hybrid. For the `Turf` feature I wrote a function to sort the different types into these three alternatives, after doing some research on the different types of course.

In [8]:
def fixTurf(t):
    """
    Function for sorting the different types of turf into Artificial, Natural and Hybrid.
    
    Input:
        t = the turf definition
        
    Output: 
        Artificial/Hybrid/Natural
        
    Usage:
        For use on the Turf feature.
    """
    if pd.isnull(t):
        return np.nan;
    
    
    if t.find('Turf') != -1 or t.find('turf') != -1 or t.find('Artif') != -1 or t.find('UBU') != -1:
        t = "Artificial";
    elif t.find('DD') != -1 or t.find("SIS") != -1:
        t = "Hybrid";
    else:
        t = "Natural";
    
    return t;

#### WEATHER

The `GameWeather` feature has many different descriptions, but can for the most part be sorted into 6 categories: indoors, sunny, rain, snow, fog and cloudy. For the missing values we estimate the category based on the `Humidity` and `Temperature` features. 

NB: have to estimate median and mean values for these two features before running this function!

In [9]:
# First we fix so the weather indoors is indoors:
train_df.GameWeather[train_df.StadiumType == 1] = "INDOORS";

def fixWeather(w):
    """
    Function sorting the different weather conditions into the following categories:
        INDOORS
        SUNNY
        RAIN
        SNOW
        FOG
        CLOUDY
        
    Quite roughly sorted, the definitions were very different considering details.
    
    Input:
        w = The weather definition
        
    Output: 
        The sorting
        
    Usage:
        For the GameWeather feature.
    """
    if isinstance(w, float):
        return '';
    
    if w.find('INDO') != -1 or w.find('CONTROLLED') != -1 or w.find('T: 51') != -1:
        w = 'INDOORS';
    elif w.find('SUN') != -1 or w.find('CLEAR') != -1 or w.find('FAIR') != -1:
        w = 'SUNNY';
    elif w.find('RAIN') != -1 or w.find('SHOWE') != -1:
        w = "RAIN";
    elif w.find('SNOW') != -1:
        w = "SNOW";
    elif w.find('FOG') != -1 or w.find('HAZY') != -1:
        w = "FOG";
    else:
        w = "CLOUDY";
    
    return w;

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Before moving on also adjust for the `NaN` values based on the `Temperature` and `Humidity` features.

In [10]:

allNanvalues = [train_df["Humidity"][pd.isnull(train_df["GameWeather"])].unique(), \
            train_df["Temperature"][pd.isnull(train_df["GameWeather"])].unique()]

for nanvalues in allNanvalues:
    for val in nanvalues:
        if len(train_df["GameWeather"][train_df["Humidity"] == val].value_counts()) > 0:
            gwHumid = train_df["GameWeather"][train_df["Humidity"] == val].value_counts().idxmax();
            temp    = train_df["Temperature"][train_df["Humidity"] == val].value_counts().idxmax();
            gwTemp  = train_df["GameWeather"][train_df["Temperature"] == temp].value_counts().idxmax();
            if gwHumid == gwTemp:
                inds = train_df["GameWeather"][(pd.isnull(train_df["GameWeather"])) & (train_df["Humidity"] == val)];
                train_df["GameWeather"][inds.index] = gwHumid;

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


#### TEAM

The `Team` feature can be set to either home or away. To make it numerical I created a function to set the the value to `0` if the team is `away` and `1` if the team is `home`.

In [11]:
def fixTeam(t):
    """
    Function to change away/home into numbers. Chose to use 0 = away and 1 = home.
    
    Input:
        t = away or home
        
    Output:
        0 or 1
        
    Usage:
        For the Team feature.
    """
    if t == 'away':
        return 0;
    else:
        return 1;

#### WIND

I created two functions for the two features `WindSpeed` and `WindDirection` because for some of the cases the values were switched between the two. 

In [12]:
dirInSpeed = ['SSW', 'E', 'CALM', 'SE'];
speedInDir = ['13', '8', 'CALM', '1'];
# Function to check if the value can be converted to an int:
def isInt(value):
    try:
        float(value);
        return True;
    except ValueError:
        return False;

def fixWindSpeed(ws):
    """
    Fixes the windspeed, takes the values and turns them into floats. Also removes all instances of units [mph],
    converts from gusts and estimates a mean if there are two values given.
    
    Also switches the instances that have been mixes with the wind direction.
    
    Input: 
        ws = the original windspeed value on the different formats
        
    Output:
        The numerical value for the windspeed
        
    Usage:
        On the feature WindSpeed.
    """
    if isinstance(ws, float):
        return ws;
    if isInt(ws):
        ws = float(ws);
    else:
        if (ws in dirInSpeed):
            replacement = train_df.WindDirection[train_df.WindSpeed == ws].iloc[0];
            if isInt(replacement):
                ws = float(replacement);
            else:
                ws = replacement;
        elif ws.find("MPH") != -1: # For e.g. 13MPH
            ws = float(ws.replace("MPH", ""));
        elif ws.find("-") != -1: # For e.g. 12-20
            val1 = float(ws[0:ws.find("-")]);
            val2 = float(ws[ws.find("-")+1:]);
            ws   = (val1 + val2) / 2;
        else:
            sum   = 0;
            count = 0;
            for val in ws.split(" "):
                if isInt(val):
                    sum = sum + float(val);
                    count = count + 1;
            ws = sum/count;
    return ws;

def fixWindDir(wd):
    """
    Fixes the wind direction feature to have a regular maximum three letter format, and changes the places where
    it was mixed with the wind speeds.
    
    Input:
        wd = the original format for the wind direction
        
    Output:
        The updated correct format for the wind direction
        
    Usage: 
        On the feature WindDirection.
    """
    if isinstance(wd, float):
        return wd;
    
    if (wd in speedInDir):
        wd = dirInSpeed[speedInDir.index(wd)];
    
    wd = wd.replace("NORTH", "N");
    wd = wd.replace("SOUTH", "S");
    wd = wd.replace("EAST", "E");
    wd = wd.replace("WEST", "W");
    wd = wd.replace("FROM", "");
    wd = wd.replace("-", "");
    wd = wd.replace(" ", "");
    
    return wd.strip();

#### PLAYER DIRECTION

The `PlayerDirection` feature is either left or right, so I created a function that sets the feature to `0` if `left` and `1` if `right`. 

In [13]:
def fixPlayDir(dir):
    """
    Encodes the player direction so left = 1 and right = 0.
    
    Input:
        dir = left/right
        
    Output:
        1 / 0
        
    Usage:
        On the feature PlayDirection.
    """
    if dir == 'left':
        return 0;
    else:
        return 1; #right

#### OFFENSEPERSONNEL & DEFENSEPERSONNEL

These two features are quite similar so I created a function two turn them into arrays with the placements for the different players.

In [14]:
def fixOffensePersonnel(str):
    """
    Sorting the offense personell into an array according to the format:
    [QB, OL, RB, TE, WR, DL, LB, DB] where:
        QB = Quarterback
        OL = Offensive Lineman
        RB = Running-back
        TE = Tight End
        WR = Wide Receiver
        DL = Defensive Lineman
        LB = Linebacker
        DB = Defensive back
        
    Input:
        str = string with the number of people in the different positions.
        
    Output:
        An array describing the number of people in the different positions.
        
    Usage: 
        Used on the feature OffensePersonnel
    """
    if pd.isnull(str):
        return np.nan;
    
    res = [0, 0, 0, 0, 0, 0, 0, 0];
    arr = [s.strip() for s in str.split(",")];
    for e in arr:
        if e.find('QB') != -1:
            res[0] = int(e[:e.find(' ')])
        elif e.find('OL') != -1:
            res[1] = int(e[:e.find(' ')])
        elif e.find('RB') != -1:
            res[2] = int(e[:e.find(' ')])
        elif e.find('TE') != -1:
            res[3] = int(e[:e.find(' ')])
        elif e.find('WR') != -1:
            res[4] = int(e[:e.find(' ')])
        elif e.find('DL') != -1:
            res[5] = int(e[:e.find(' ')])
        elif e.find('LB') != -1:
            res[6] = int(e[:e.find(' ')])
        elif e.find('DB') != -1:
            res[7] = int(e[:e.find(' ')])
        else:
            print("HAVE NOT ACCOUNTED FOR: ", e);
    
    
    return res;

# METHOD 14: Describing the defense as an array
def fixDefensePersonnel(str):
    """
    Converting the defense into an array on the format [DL, LB, DB, OL] where:
        DL = Defensive Lineman
        LB = Linebacker
        DB = Defensive back
        OL = Offensive Lineman
        
    Input:
        str = String describing the defense
        
    Output:
        Array describing the defense
        
    Usage:
        On the feature DefensePersonnel
    """
    if pd.isnull(str):
        return np.nan;
    
    res = [0, 0, 0, 0];
    arr = [s.strip() for s in str.split(",")];
    for e in arr:
        if e.find('DL') != -1:
            res[0] = int(e[:e.find(' ')])
        elif e.find('LB') != -1:
            res[1] = int(e[:e.find(' ')])
        elif e.find('DB') != -1:
            res[2] = int(e[:e.find(' ')])
        elif e.find('OL') != -1:
            res[3] = int(e[:e.find(' ')])
        else:
            print("HAVE NOT ACCOUNTED FOR: ", e);
                
    return res;

#### POSSESSIONTEAM & FIELDPOSITION

These features contain the current team name that has the field or the possession, so to simplify I created a function to sort them into `1` if the team is the `home` team, and `0` otherwise.

In [15]:
def homeOrAwayTeam(team):
    """
    Checks if the team is the away team or the home team and returns numerical values 
    accordingly, away = 0, home = 1.
    
    Input: 
        team = the abbreviation for the team
        
    Ouput:
        1 / 0 depending on if the team is home or away
        
    Usage:
        On the features PossessionTeam and FieldPosition.
    """
    # team[0] --> the input team
    # team[1] --> the home team
    diffInInput = ['ARZ', 'BLT', 'CLV', 'HST'];
    diffInHome  = ['ARI', 'BAL', 'CLE', 'HOU'];
    
    if team[0] == team[1]: 
        return 1;
    elif team[0] in diffInInput:
        ind = diffInInput.index(team[0]);
        if (team[1] == diffInHome[ind]):
            return 1;
        else:
            return 0;
    else:
        return 0;

#### OFFENSEFORMATION

The `OffenseFormation` has two names for one of the formation, so I created a small funciton to fix this. 

In [16]:
def fixOffForm(of):
    """
    Collect the two offense formations with different names.
    
    Input:
        of = the offensive formation
        
    Output:
        The updated offensive formation
        
    Usage:
        On the feature OffenseFormation
    """
    if of == "ACE":
        return "SINGLEBACK";
    else:
        return of;

### EXECUTING THE FUNCTIONS

Before executing the functions I have to make some of the feature-values all in uppercase:

In [17]:
upperFeatures = ["Location", "GameWeather", "Stadium", "WindSpeed", "WindDirection"];
for uf in upperFeatures:
    train_df[uf] = train_df[uf].str.upper();

The functions I wrote have two different formats, they are either ran on the feature column or with the dataframe as input. Start with the latter:

In [18]:
train_df = timeStampSplit(train_df, 'GameClock');
train_df = dateSplit(train_df, 'PlayerBirthDate', '/');
train_df = utcSplit(train_df, 'TimeSnap');
# First we need to remove all the places where TimeHandoff has nan value:
train_df = train_df[pd.notnull(train_df.TimeHandoff)];
train_df = utcSplit(train_df, 'TimeHandoff');

  result = result.union(other)
  return self._int64index.union(other)


Before I run the other functions I remove the features that I have replaced:

In [None]:
featuresToDrop = ['GameClock', 'TimeSnap', 'TimeHandoff', 'PlayerBirthDate'];
for feat in featuresToDrop:
    train_df = train_df.drop(feat, axis=1);

Now I run through the rest of the functions:

In [None]:
functions = [heightToCm, fixLocation, fixStadiumType, fixTurf, fixWeather, fixTeam, \
            fixStadiums, fixWindSpeed, fixWindDir, fixPlayDir, fixOffensePersonnel, \
            fixDefensePersonnel, fixOffForm, homeOrAwayTeam];
features  = [["PlayerHeight"], ["Location"], ["StadiumType"], ["Turf"], ["GameWeather"], \
             ["Team"], ["Stadium"], ["WindSpeed"], ["WindDirection"], ["PlayDirection"], \
             ["OffensePersonnel"], ["DefensePersonnel"], ["OffenseFormation"], \
             ["PossessionTeam", "FieldPosition"]];

for i in range(0, len(functions)):
    print(functions[i]);
    if i == len(functions)-1:
        for feat in features[i]:
            train_df[feat] = train_df[[feat, "HomeTeamAbbr"]].apply(functions[i], axis=1);
        print("done!");
    else:
        for feat in features[i]:
            train_df[feat] = train_df[feat].apply(functions[i]);
        print("done!");
    
print("TOTALLY DONE!!");

<function heightToCm at 0x114943ae8>
done!
<function fixLocation at 0x1149436a8>
done!
<function fixStadiumType at 0x114975ea0>
done!
<function fixTurf at 0x114975f28>
done!
<function fixWeather at 0x114981048>
done!
<function fixTeam at 0x114981950>
done!
<function fixStadiums at 0x114975378>
done!
<function fixWindSpeed at 0x114981ae8>
done!
<function fixWindDir at 0x114981ea0>
done!
<function fixPlayDir at 0x114d09598>
done!
<function fixOffensePersonnel at 0x114d09510>
done!
<function fixDefensePersonnel at 0x114d09158>
done!
<function fixOffForm at 0x114d09f28>
done!
<function homeOrAwayTeam at 0x114d09ae8>
done!
TOTALLY DONE!!


We also need to fix the wind direction for inside to be calm:

In [None]:
train_df["WindDirection"][(train_df["StadiumType"] == 1) & (pd.isnull(train_df["WindDirection"]))] = "CALM";

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Now we can drop all the rest of the nan values:

In [None]:
train_df = train_df.dropna();

### Some more changes

The features `OffensePersonnel` and `DefensePersonnel` are now as arrays, but we want simple numerical values for them all, so we create new features and include the numbers there instead:

In [None]:
# OFFENSEPERSONNEL
newNames1 = ['OP_QB', 'OP_OL', 'OP_RB', 'OP_TE', 'OP_WR', 'OP_DL', 'OP_LB', 'OP_DB'];
for n1 in newNames1:
        train_df[n1] = 0;
        
# DEFENSEPERSONNEL
newNames2 = ['DP_DL', 'DP_LB', 'DP_DB', 'DP_OL'];
for n2 in newNames2:
    train_df[n2] = 0;
    
# SOME RELEVANT FUNCTIONS:
def ofStr(str):
    return str.replace('[','').replace(']','').split(',');

def opDivide(arr, i):
    return arr[i];

#train_df['OffensePersonnel'] = train_df['OffensePersonnel'].apply(ofStr);
#train_df['DefensePersonnel'] = train_df['DefensePersonnel'].apply(ofStr);

# Create the new features from OffensePersonnel
count1 = 0;
for n1 in newNames1:
    train_df[n1] = train_df['OffensePersonnel'].apply(opDivide, i=count1);
    count1  = count1 + 1;

# Create the new features from DefensePersonnel
count2 = 0;
for n2 in newNames2:
    train_df[n2] = train_df['DefensePersonnel'].apply(opDivide, i=count2);
    count2  = count2 + 1;
    
# Remove OffensePersonnel and DefensePersonnel:
train_df = train_df.drop('OffensePersonnel', axis=1);
train_df = train_df.drop('DefensePersonnel', axis=1);

The last thing we do processing-wise is to make sure that the coordinates are within the field, and encode the season to 0 and 1 to get more reasonable values. 

In [None]:
train_df = train_df.drop(train_df.X[(train_df.X < 0) | (train_df.X > 120)].index);
train_df = train_df.drop(train_df.Y[(train_df.Y < 0) | (train_df.Y > 53.33)].index);
train_df.Season[train_df.Season == 2017] = 0;
train_df.Season[train_df.Season == 2018] = 1;

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [None]:
train_df = pd.read_csv('./data1/processed.csv', low_memory=False);

Need to perform one more drop of the nan values because not all the NaN values were dropped before:

In [None]:
train_df = train_df.dropna();

Now we can finally save this datastructure that is fully processed:

In [None]:
train_df.to_csv('./data1/processed.csv', index=False);

## 5. SELECTING A MODEL

Now that we start with the model choosing and encoding we just upload the `processed.csv` file that contains the correctly processed dataframe. So from this point we import everything anew so we don't have to run through the entire notebook again.

Because we import everything again we need to reset the variables in the case where we want to run the entire notebook:

In [None]:
%reset





In [6]:
import pandas                  as     pd
import numpy                   as     np
import matplotlib.pyplot       as     plt
from   datetime                import date
from   sklearn.base            import BaseEstimator, TransformerMixin
from   sklearn.preprocessing   import OrdinalEncoder
from   sklearn.pipeline        import Pipeline
from   sklearn.model_selection import StratifiedShuffleSplit
from   sklearn.impute          import SimpleImputer
from   sklearn.preprocessing   import StandardScaler
from   sklearn.pipeline        import FeatureUnion
from   sklearn.ensemble        import RandomForestRegressor
import time
from   sklearn.metrics         import mean_squared_error
import joblib

train_df = pd.read_csv('./data1/processed.csv', low_memory=False);

Before we can encode and scale the data, and apply the regressor we need to split the data into test and training data. To do this I used the StratifiedShuffleSplit function because I wanted the different features to be as truthfully represented, in the training and test data, as possible.

In [7]:
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42);

The split is then performed based on the most important feature. To find this I used the correlation matrix. Here I look at the highest and the lowest values. For high absolute value the features are highly correlated linearly.

I want to find the feature that correlates the most with `Yards` because this is the target feature. 

In [8]:
corr_matrix = train_df.corr();
print("largest: ", corr_matrix["Yards"].nlargest(n=2));
print("smallest:", corr_matrix["Yards"].nsmallest(n=1));

largest:  Yards    1.00000
DP_DB    0.08315
Name: Yards, dtype: float64
smallest: DefendersInTheBox   -0.10709
Name: Yards, dtype: float64


From the correlation matrix results above we see that the feature that correlates the most with `Yards` is `DefendersInTheBox`, and so we choose to perform the split using this feature:

In [9]:
for train_index, test_index in split.split(train_df, train_df["DefendersInTheBox"]):
    data_train = train_df.loc[train_index];
    data_test  = train_df.loc[test_index];

Now we save these training data into separate files so we can easily extract them for testing the encoding methods:

In [10]:
data_train.to_csv('./data1/train1.csv', index=False);
data_test.to_csv('./data1/test1.csv', index=False);

Now we have the training data in one file and the test data in another.

When training the regressor we don't want to use the target feature, so we need to divide the data even further into `data`, which we will train on, and `labels`, the target values.

In [11]:
labels = train_df["Yards"].copy();
data   = train_df.drop("Yards", axis=1);

Before we apply the RandomForest regressor we have to encode the string features and a standard scaler on the numeric features because of their wide range. The encoding and the scaling are both performed in a pipeline. For the pipelines we also need to create a function that chooses either the numerical or the string features.

For this we write the function `DataFrameSelector`.

In [12]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attributeNames):
        self.attributeNames = attributeNames;
    def fit(self, X, y = None):
        return self;
    def transform(self, X):
        return X[self.attributeNames].values;

To be able to use the `DataFrameSelector` we need to separate between the numerical and the string features.

In [13]:
num_attributes = np.array(data.dtypes[(data.dtypes == 'float64') | (data.dtypes == 'int64')].index);
str_attributes = np.array(data.dtypes[data.dtypes == 'object'].index);

For the encoding I chose to use the `OrdinalEncoder`. The features where the values are strings are encoded with integers, and then those integers are converted into binary code. The digits from the binary string are split into separate columns. 

I then created the pipeline for the string features.

In [14]:
str_pipeline = Pipeline([
    ('selector', DataFrameSelector(str_attributes)),
    ('encoder',  OrdinalEncoder()),
]);

For the numeric features I used a pipeline with an imputer to fill in the `NaN` values. Here I used the `mean` strategy because of the lack of outliers in the data. Then I used the `StandardScaler` to scale the data due to its wide range.

In [15]:
# For the numeric values:
num_pipeline = Pipeline([
    ('selector',   DataFrameSelector(num_attributes)),
    ('imputer',    SimpleImputer(strategy='mean')),
    ('std_scaler', StandardScaler()),
]);

The pipelines were combined in a full pipeline, which was then applied to the data.

In [16]:
# Combine the two pipelines in a full pipeline:
full_pipeline = FeatureUnion(transformer_list = [
    ('num_pipeline', num_pipeline),
    ('str_pipeline', str_pipeline),
]);

# Then we fit the data so it is ready for regression:
data_prep = full_pipeline.fit_transform(data);

And now the data is ready for training!

I settled on using a `RandomForestRegressor` because it yielded the best results. I also want to time the training to make sure it doesn't take too much time.

In [17]:
start_time = time.time();
reg        = RandomForestRegressor(n_estimators=10, random_state=42);
reg.fit(data_prep, labels);
stop_time  = time.time();
print("Training time: ", (stop_time - start_time));

Training time:  118.36870288848877


Next I wanted to look at the RMSE for the regression:

In [18]:
rmse_score = np.sqrt(mean_squared_error(labels, reg.predict(data_prep)));
print("RMSE = ", rmse_score);

RMSE =  0.08571512026827019


Now that we have our regressor, and the results are quite good, we move on to tuning the model!

## 6. TUNING THE MODEL

To tune the model we use a validation set. To get this we split the training data into two:

In [19]:
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42);
for train_ind, test_ind in split.split(data_train, data_train["DefendersInTheBox"]):
    val_train = data_train.iloc[train_ind];
    val_test  = data_train.iloc[test_ind];

Now we need to train the model again, meaning we need to the training part of the validation set into data and labels, and also run it through the pipeline.

In [20]:
val_labels = val_train.Yards.copy();
val_data   = val_train.drop("Yards", axis=1);
val_prep   = full_pipeline.fit_transform(val_data);

Now we train the model again:

In [21]:
start_time = time.time();
reg        = RandomForestRegressor(n_estimators=10, random_state=42);
reg.fit(val_prep, val_labels);
stop_time  = time.time();
print("Training time: ", (stop_time - start_time));

Training time:  117.56320977210999


Which gives the RMSE:

In [22]:
rmse_score = np.sqrt(mean_squared_error(val_labels, reg.predict(val_prep)));
print("RMSE = ", rmse_score);

RMSE =  0.17211936070041373


To hypertune the parameters in the regression I used `RandomizedSearchCV`:

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats             import randint

I defined the number of iterations `N`, and the parameter range for `n_estimators`, `max_features` and `max_depth`. 

In [None]:
N = 5;
param_dist = {
    "n_estimators": randint(10, 100), #50, 70
    "max_features": randint(round(val_prep.shape[1]/3), val_prep.shape[1]),
    "max_depth"   : randint(18,40),
    "random_state": [42],
};

In [None]:
start_time   = time.time();
randomsearch = RandomizedSearchCV(reg, param_distributions=param_dist, n_iter=N, cv=5, iid=False, n_jobs=-1);
randomsearch.fit(val_prep, val_labels);
stop_time    = time.time();
print("Search took: ", (stop_time-start_time)/60, "minutes");

The best features found through `RandomSearchCV` had an RMSE of:

In [None]:
print("RMSE = ", np.sqrt(mean_squared_error(val_labels, randomsearch.best_estimator_.predict(val_prep))));

And the best parameters were:

In [None]:
# max_depth = 36, max_features = 49, n_estimators = 77, random_state = 42;
print(randomsearch.best_params_);

Now I extract the best estimator from the `RandomSearchCV`:

In [None]:
reg_best = randomsearch.best_estimator_;

Now that I have the best model I can move on to presenting the solution from the test data!

## 7. PRESENTING THE SOLUTION

For this part I will load the test data from `/data1/test1.csv` and perform a regression on these data, and examine the RMSE.

In [39]:
test_data = pd.read_csv('./data1/test1.csv', low_memory=False);

When fitting the test data with the pipeline I ran into an issue with the `DisplayName` category because there are some instances where the player is only in one play. Because these were not sorted into the training set the pipeline doesn't know how to handle them. 

I would like to figure out a way to fix this, but I just can't figure it out (even after extensive googling), so I just ended up removing these names:

In [45]:
troublemakers = test_data.DisplayName.value_counts()[test_data.DisplayName.value_counts() < 3].index.tolist();
inds = [];
for name in test_data.DisplayName:
    if name in troublemakers:
        inds.append(test_data.DisplayName[test_data.DisplayName == name].index[0]);
test_data = test_data.drop(inds, axis=0);
print(len(troublemakers));

0


Now I can split the data and perform the regressor:

In [46]:
label_test = test_data["Yards"].copy();
data_test  = test_data.drop("Yards", axis=1);

Then I transform the data using the pipeline that has been fitted to the training data:

In [47]:
test_prep = full_pipeline.transform(data_test);

Load the regressor:

In [48]:
reg_best = joblib.load('./data1/reg_best.pkl');

Get the final RMSE:

In [49]:
final_pred = reg_best.predict(test_prep);
print("Final RMSE: ", np.sqrt(mean_squared_error(label_test, final_pred)))

Final RMSE:  0.17406330463247444


Save the final results to a csv:

In [50]:
final_df = pd.DataFrame({'Id': data_test.index.values, 'Yards': final_pred})
final_df.to_csv("./data1/final_results.csv", index=False);