# Feature Generation

Here, I will generate the features that I think may have predictive power.

## Features to Create
* Win/losing streak
* Number of goals in recent (1,2,5,10?) games for the entire team or each line (goals by line probably shouldn't include powerplay or penalty kill)
* Number of goals allowed in recent (1,2,5,10?) games for the entire team or each line 
* Team record up to the current game in the season
* Number of games played recently (e.g. played last night, number in last 3,5,7 days?)
* Others?

## Notebook Output
This notebook generates a new dataframe that can be easily explored and directly input into a machine learning model.

## Messy Data
Let's load in the data. I will start with using the "Messy" data. This data just uses the lineups listed by rotogrinders.com, except that I do have error flags for all lines that listed players (on rotogrinders.com) that did not play (by checking nhl.com/stats). I will not use those lines, and will always skip over that line/game in the analysis.

Let's load in the data:

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_pickle("Scrape/data/FullData_Messy.pkl")

In [3]:
df.tail()

Unnamed: 0,Date,Season,Away Team,Home Team,Away Score,Home Score,OT Win,SO Win,AL1-0,AL1-1,...,+/- HL4-1,PM HL4-1,Goals HL4-2,PP Goals HL4-2,Assists HL4-2,PP Assists HL4-2,Shots HL4-2,+/- HL4-2,PM HL4-2,Lineup Error HL4
1225,2017-04-09,16_17,CAR,PHI,4,3,False,True,Teuvo Teravainen,Jordan Staal,...,-1,0,0,0,0,0,0,-1,0,False
1226,2017-04-09,16_17,COL,STL,2,3,False,False,Rene Bourque,Nathan MacKinnon,...,1,0,1,0,0,0,2,1,2,False
1227,2017-04-09,16_17,BUF,TBL,2,4,False,False,Evander Kane,Ryan O'Reilly,...,0,0,0,0,0,0,0,0,2,False
1228,2017-04-09,16_17,CBJ,TOR,3,2,False,False,Nick Foligno,Brandon Saad,...,0,0,0,0,0,0,1,0,0,False
1229,2017-04-09,16_17,FLA,WSH,2,0,False,False,Jaromir Jagr,Jonathan Huberdeau,...,0,0,0,0,0,0,1,0,0,False


Now, let's build a new dataframe to hold the features.

In [4]:
def init():
    """
    Just a quick function to create the dataframe (so I can re-run a cell down below without re-running this
        every time)
    """
    Ngames = len(df)
    zero_arr = np.zeros(Ngames,dtype=int)

    # Taken directly from data dataframe
    df_f = pd.DataFrame(df['Date'],columns=["Date"])
    #print(len(df_f),Ngames)
    df_f['Season'] = df['Season']
    df_f['Away Team'] = df['Away Team'] 
    df_f['Home Team'] = df['Home Team']
    df_f['Away Score'] = df['Away Score']
    df_f['Home Score'] = df['Home Score']

    # Add some zeroed-out rows
    df_f['Away W/L Streak'] = zero_arr
    df_f['Home W/L Streak'] = zero_arr
    df_f['Away Wins'] = zero_arr
    df_f['Home Wins'] = zero_arr
    df_f['Away Losses'] = zero_arr
    df_f['Home Losses'] = zero_arr
    # And most importantly, did Line 1change?
    false_arr = np.zeros(Ngames,dtype='?')
    df_f['Away L1 Shuffled'] = false_arr
    df_f['Home L1 Shuffled'] = false_arr
    # Another flag to mark if an injury may have occured on Line 1 in previous game 
    # or between the last game and this one (i.e. player left lineup)
    df_f['Away L1 Injury'] = false_arr
    df_f['Home L1 Injury'] = false_arr

    # Let's just consider the past 5 games for now
    pgames = [5]
    for num in pgames:
        df_f['Away Scored Goals p{0}'.format(num)] = 0
        df_f['Home Scored Goals p{0}'.format(num)] = 0
        df_f['Away Allowed Goals p{0}'.format(num)] = 0
        df_f['Home Allowed Goals p{0}'.format(num)] = 0

    return df_f, pgames

In [5]:
def loc2floc(loc):
    """
    Simple function to return "Away" or "Home" given "A" or "H."
    Only made this to remote repeated code.
    INPUT:
        loc: "A" or "H"
    OUTPUT:
        "Away" or "Home"
    """
    if loc == "A":
        floc = "Away"
    elif loc == "H":
        floc = "Home"
    else:
        raise ValueError("Whaaa? How did you mess this up. loc2floc can only take 'A' or 'H' as input.")
    return floc
def otherloc(l):
    """
    Simple function to return the opposite location (regardless if its the full name or
    just the letter) "Away"-> "Home", "A" -> "H", etc.
    INPUT:
        loc: "A", "H", "Away", or "Home"
    OUTPUT:
        "A", "H", "Away", or "Home"
    """
    if l == "A":
        ol = "H"
    elif l == "H":
        ol = "A"
    elif l == "Away":
        ol = "Home"
    elif l == "Home":
        ol = "Away"
    else:
        raise ValueError("Whaaa? How did you mess this up. otherloc can" 
                         "only take 'A','H','Away', or 'Home' as input.")
    return ol

In [6]:
def build_features_pg(df,df_f,i,j,loc,ploc):
    """
    Build the features for the specific team and game that can be built
    by just using the data and features from the previous game.
    INPUT:
        - df: the original dataframe
        - df_f: the feature dataframe
        - i: index of current game
        - j: index of previous game
        - loc: location ("H" or "A") of current game
        - ploc: location ("H" or "A") of previous game
    OUTPUT:
        None
    """
    floc = loc2floc(loc)
    pfloc = loc2floc(ploc)
    poloc = otherloc(ploc)
    pfoloc = otherloc(pfloc)
    # Find result from last game
    # Win?
    lwin = df[pfloc+" Score"].iat[j] > df[pfoloc+" Score"].iat[j]
    if lwin:
        df_f[floc+' Wins'].iat[i] = df_f[pfloc+' Wins'].iat[j] + 1
        df_f[floc+' Losses'].iat[i] = df_f[pfloc+' Losses'].iat[j]
        pWLstreak = df_f[pfloc+' W/L Streak'].iat[j] 
        if pWLstreak >= 0: # 0 == first game of season
            df_f[floc+' W/L Streak'].iat[i] = pWLstreak + 1
        else:
            df_f[floc+' W/L Streak'].iat[i] = 1
    else:
        df_f[floc+' Wins'].iat[i] = df_f[pfloc+' Wins'].iat[j]
        df_f[floc+' Losses'].iat[i] = df_f[pfloc+' Losses'].iat[j] + 1
        pWLstreak = df_f[pfloc+' W/L Streak'].iat[j] 
        if pWLstreak <= 0: # 0 == first game of season
            df_f[floc+' W/L Streak'].iat[i] = pWLstreak - 1
        else:
            df_f[floc+' W/L Streak'].iat[i] = -1
    ## Lineups changed?? ##
    # Did L1 change (its a loop in case I want to consider more lines in the future)
    lL1shake = False
    for line in range(1,2):
        L = []
        pL = []
        for pos in range(3):
            L.append(df["{0}L{1}-{2}".format(loc,line,pos)].iat[i])
            pL.append(df["{0}L{1}-{2}".format(ploc,line,pos)].iat[j])
        lL1shake = (not (sorted(L) == sorted(pL))) or lL1shake
    df_f[floc+' L1 Shuffled'].iat[i] = lL1shake
    
    # Potential Injury?
    lL1inj = False
    # obviously not one if lines are the same
    if lL1shake:
        # If all players in L1 in prev. game are in curr. game then no injury
        # Find all players in L1 in prev. game
        p_pls = []
        for line in range(1,2):
            for pos in range(3):
                p_pls.append(df["{0}L{1}-{2}".format(ploc,line,pos)].iat[j])
        # Find all players in curr. game
        pls = []
        for line in range(1,5):
            for pos in range(3):
                pls.append(df["{0}L{1}-{2}".format(loc,line,pos)].iat[i])

        for p_pl in p_pls:
            if not p_pl in pls:
                lL1inj = True
                break
    df_f[floc+' L1 Injury'].iat[i] = lL1inj   

In [7]:
def build_features_gen(df,df_f,i,loc,pgames):
    """
    Build the features for the specific team and game that cannot be built
    by just using the data and features from the previous game.
    INPUT:
        - df: the original dataframe
        - df_f: the feature dataframe
        - i: index of current game
        - loc: location ("H" or "A") of current game
        - pgames: tuple containing the numbers of games to look into the past
    OUTPUT:
        None
    """ 
    floc = loc2floc(loc)
    
    ##### Goals scored in previous N games: ####
    # Running totals
    Sgoals_run = 0 
    Agoals_run = 0
    pg_ind = []
    pg_loc = []
    jstart = i-1 # Index to search through previous games
    num_comp = 0 # Number of games already done
    for num in pgames:
        #print("Searching for p{0} game".format(num))
        for ind in range(num-num_comp): # find num-num_comp more games
            lfound = False
            #print("Starting index = {0}".format(jstart))
            #print("ind = ",ind)
            for j in range(jstart,-1,-1): # start 
                if df[floc+" Team"].iat[i] == df["Away Team"].iat[j]:
                    # Found the next previous game (team was away)
                    lfound = True
                    pfloc = "Away"
                    #print("Found p{0} game. Index = {1}. On {2}: {3} vs {4}".format(num,j,
                        #df["Date"].iat[j],df["Away Team"].iat[j],df["Home Team"].iat[j]))
                    
                    jstart = j - 1
                    break
                if df[floc+" Team"].iat[i] == df["Home Team"].iat[j]:
                    # Found the next previous game (team was home)
                    lfound = True
                    pfloc = "Home"
                    #print("Found p{0} game. Index = {1}. On {2}: {3} vs {4}".format(num,j,
                        #df["Date"].iat[j],df["Away Team"].iat[j],df["Home Team"].iat[j]))
                    jstart = j - 1
                    break  
            # Save data from that game in running totals
            if lfound:
                #print("Sgoals: {0} = {1} + {2}".format(Sgoals_run+df[pfloc+' Score'].iat[j],Sgoals_run,df[pfloc+' Score'].iat[j]))
                Sgoals_run += df[pfloc+' Score'].iat[j]
                Agoals_run += df[otherloc(pfloc)+' Score'].iat[j]
        # Now running total should be for the previous num games
        #print("Saving Sgoals = {0} Agoals = {1}".format(Sgoals_run,Agoals_run))
        df_f['{0} Scored Goals p{1}'.format(floc,num)].iat[i] = Sgoals_run
        df_f['{0} Allowed Goals p{1}'.format(floc,num)].iat[i] = Agoals_run
        # Save number of previous games found
        num_comp = num

In [8]:
df_f, pgames = init()
for i in range(len(df)):
    date = df["Date"].iat[i]
    #print("Date: {0}".format(date))
    df_f['Date'].iat[i] = date
    for loc in ("A","H"):
        floc = loc2floc(loc)
        # Find previous game that the away/home team played
        # (Multiple features are just generated by stats from the previous game)
        lfound = False
        team = df[floc+" Team"].iat[i]
        #print("Team: {0} (vs {1})".format(team,df[otherloc(floc)+" Team"].iat[i]))
        df_f[floc+" Team"].iat[i] = team
        for j in range(i-1,-1,-1):
            if df[floc+" Team"].iat[i] == df["Away Team"].iat[j]:
                # Found previous game (team was away)
                lfound = True
                ploc = "A"
                #print("Found previous game on {0}: {1} vs {2}".format(
                    #df["Date"].iat[j],df["Away Team"].iat[j],df["Home Team"].iat[j]))
                #print("{0} was {1}".format(team,ploc))
                break
            if df[floc+" Team"].iat[i] == df["Home Team"].iat[j]:
                # Found previous game (team was home)
                lfound = True
                ploc = "H"
                #print("Found previous game on {0}: {1} vs {2}".format(
                    #df["Date"].iat[j],df["Away Team"].iat[j],df["Home Team"].iat[j]))
                break

        if lfound:
            # Build features just using the previous game
            build_features_pg(df,df_f,i,j,loc,ploc)
            
            # Build additional features
            build_features_gen(df,df_f,i,loc,pgames)

In [9]:
df_f.tail()

Unnamed: 0,Date,Season,Away Team,Home Team,Away Score,Home Score,Away W/L Streak,Home W/L Streak,Away Wins,Home Wins,Away Losses,Home Losses,Away L1 Shuffled,Home L1 Shuffled,Away L1 Injury,Home L1 Injury,Away Scored Goals p5,Home Scored Goals p5,Away Allowed Goals p5,Home Allowed Goals p5
1225,2017-04-09,16_17,CAR,PHI,4,3,-5,1,35,39,46,42,False,True,False,False,9,16,19,10
1226,2017-04-09,16_17,COL,STL,2,3,-2,2,22,45,59,36,False,False,False,False,14,18,17,15
1227,2017-04-09,16_17,BUF,TBL,2,4,-1,2,33,41,48,40,False,False,False,False,7,15,15,12
1228,2017-04-09,16_17,CBJ,TOR,3,2,-6,1,49,40,32,41,True,True,False,False,10,16,19,17
1229,2017-04-09,16_17,FLA,WSH,2,0,1,4,34,55,47,26,False,True,False,True,11,15,21,10


## Save Features!!
At this point, I have a number of features. I want to start looking at them, and trying some preliminary models. Check out the Data Exploration notebook for the pretty plots!

In [10]:
df_f.to_pickle("data/InitialFeatures.pkl")