# Feature Generation

Here, I will generate the features that I think may have predictive power.

## Features to Create
* Win/losing streak
* Number of goals in recent (1,2,5,10?) games for the entire team or each line (goals by line probably shouldn't include powerplay or penalty kill)
* Number of goals allowed in recent (1,2,5,10?) games for the entire team or each line 
* Team record up to the current game in the season
* Number of games played recently (e.g. played last night, number in last 3,5,7 days?)
* Others?

## Notebook Output
This notebook generates a new dataframe that can be easily explored and directly input into a machine learning model.

## Messy Data
Let's load in the data. I will start with using the "Messy" data. This data just uses the lineups listed by rotogrinders.com, except that I do have error flags for all lines that listed players (on rotogrinders.com) that did not play (by checking nhl.com/stats). I will not use those lines, and will always skip over that line/game in the analysis.

Let's load in the data:

In [76]:
import pandas as pd
import numpy as np

In [77]:
df = pd.read_pickle("pickles/FullData_Messy.pkl")

In [78]:
df.head()

Unnamed: 0,Date,Away Team,Home Team,Away Score,Home Score,OT Win,SO Win,AL1-0,AL1-1,AL1-2,...,+/- HL4-2,PM HL4-2,Lineup Error AL1,Lineup Error AL2,Lineup Error AL3,Lineup Error AL4,Lineup Error HL1,Lineup Error HL2,Lineup Error HL3,Lineup Error HL4
0,2016-10-12,STL,CHI,5,2,False,False,Paul Stastny,Alexander Steen,Robby Fabbri,...,0,0,False,False,False,False,False,False,False,False
1,2016-10-12,CGY,EDM,4,7,False,False,Kris Versteeg,Johnny Gaudreau,Sean Monahan,...,1,2,False,False,False,False,False,False,False,False
2,2016-10-12,TOR,OTT,4,5,True,False,Milan Michalek,Leo Komarov,Nazem Kadri,...,0,0,False,False,False,False,False,False,False,False
3,2016-10-12,LAK,SJS,1,2,False,False,Anze Kopitar,Dustin Brown,Devin Setoguchi,...,0,0,False,False,False,False,False,False,False,False
4,2016-10-13,MTL,BUF,4,1,False,False,Alex Galchenyuk,Brendan Gallagher,Max Pacioretty,...,-1,0,False,False,False,False,False,False,False,False


Now, let's build a new dataframe to hold the features.

In [128]:
def init():
    Ngames = len(df)
    zero_arr = np.zeros(Ngames,dtype=int)
    df_f = pd.DataFrame(zero_arr,columns=['Away W/L Streak'])
    df_f['Home W/L Streak'] = zero_arr
    df_f['Away Wins'] = zero_arr
    df_f['Home Wins'] = zero_arr
    df_f['Away Losses'] = zero_arr
    df_f['Home Losses'] = zero_arr
    # And most importantly, did Line 1 or 2 change?
    false_arr = np.zeros(Ngames,dtype='?')
    df_f['Away L12 Shuffled'] = false_arr
    df_f['Home L12 Shuffled'] = false_arr
    # Another flag to mark if an injury may have occured on Line 1 or 2 in previous game 
    # or between the last game and this one (i.e. player left lineup)
    df_f['Away L12 Injury'] = false_arr
    df_f['Home L12 Injury'] = false_arr

    pgames = (1,2,5,10)
    for num in pgames:
        df_f['Away Scored Goals p{0}'.format(num)] = 0
        df_f['Home Scored Goals p{0}'.format(num)] = 0
        df_f['Away Allowed Goals p{0}'.format(num)] = 0
        df_f['Home Allowed Goals p{0}'.format(num)] = 0

    #    for num in (1,3,5,7):
    #        df_f['Away Games Played p{0} days'.format(num)] = 0
    #        df_f['Home Games Played p{0} days'.format(num)] = 0

    # And for readability:
    df_f['Date'] = np.datetime64('2009-01-01')
    df_f['Away Team'] = "???"
    df_f['Home Team'] = "???"

    cols = df_f.columns.tolist()
    ncols = cols[-3:] + cols[:-3]
    df_f = df_f[ncols]
    return df_f, pgames

In [129]:
def loc2floc(loc):
    """
    Simple function to return "Away" or "Home" given "A" or "H."
    Only made this to remote repeated code.
    INPUT:
        loc: "A" or "H"
    OUTPUT:
        "Away" or "Home"
    """
    if loc == "A":
        floc = "Away"
    elif loc == "H":
        floc = "Home"
    else:
        raise ValueError("Whaaa? How did you mess this up. loc2floc can only take 'A' or 'H' as input.")
    return floc
def otherloc(l):
    """
    Simple function to return the opposite location (regardless if its the full name or
    just the letter) "Away"-> "Home", "A" -> "H", etc.
    INPUT:
        loc: "A", "H", "Away", or "Home"
    OUTPUT:
        "A", "H", "Away", or "Home"
    """
    if l == "A":
        ol = "H"
    elif l == "H":
        ol = "A"
    elif l == "Away":
        ol = "Home"
    elif l == "Home":
        ol = "Away"
    else:
        raise ValueError("Whaaa? How did you mess this up. otherloc can" 
                         "only take 'A','H','Away', or 'Home' as input.")
    return ol

In [130]:
def build_features_pg(df,df_f,i,j,loc,ploc):
    """
    Build the features for the specific team and game that can be built
    by just using the data and features from the previous game.
    INPUT:
        - df: the original dataframe
        - df_f: the feature dataframe
        - i: index of current game
        - j: index of previous game
        - loc: location ("H" or "A") of current game
        - ploc: location ("H" or "A") of previous game
    OUTPUT:
        None
    """
    floc = loc2floc(loc)
    pfloc = loc2floc(ploc)
    poloc = otherloc(ploc)
    pfoloc = otherloc(pfloc)
    # Find result from last game
    # Win?
    lwin = df[pfloc+" Score"].iat[j] > df[pfoloc+" Score"].iat[j]
    if lwin:
        df_f[floc+' Wins'].iat[i] = df_f[pfloc+' Wins'].iat[j] + 1
        pWLstreak = df_f[pfloc+' W/L Streak'].iat[j] 
        if pWLstreak >= 0: # 0 == first game of season
            df_f[floc+' W/L Streak'].iat[i] = pWLstreak + 1
        else:
            df_f[floc+' W/L Streak'].iat[i] = 1
    else:
        df_f[floc+' Losses'].iat[i] = df_f[pfloc+' Losses'].iat[j] + 1
        pWLstreak = df_f[pfloc+' W/L Streak'].iat[j] 
        if pWLstreak <= 0: # 0 == first game of season
            df_f[floc+' W/L Streak'].iat[i] = pWLstreak - 1
        else:
            df_f[floc+' W/L Streak'].iat[i] = -1
    ## Lineups changed?? ##
    # Did L1 or L2 change
    lL12shake = False
    for line in range(1,3):
        L = []
        pL = []
        for pos in range(3):
            L.append(df["{0}L{1}-{2}".format(loc,line,pos)].iat[i])
            pL.append(df["{0}L{1}-{2}".format(ploc,line,pos)].iat[j])
        lL12shake = (not (sorted(L) == sorted(pL))) or lL12shake
    df_f[floc+' L12 Shuffled'].iat[i] = lL12shake
    
    # Potential Injury?
    lL12inj = False
    # obviously not one if lines are the same
    if lL12shake:
        # If all players in L1 or L2 in prev. game are in curr. game then no injury
        # Find all players in L1 or L2 in prev. game
        for line in range(1,3):
            p_pls = []
            for pos in range(3):
                p_pls.append(df["{0}L{1}-{2}".format(ploc,line,pos)].iat[j])
        # Find all players in curr. game
        for line in range(1,5):
            pls = []
            for pos in range(3):
                pls.append(df["{0}L{1}-{2}".format(ploc,line,pos)].iat[j])
        for p_pl in p_pls:
            if not p_pl in pls:
                lL12inj = True
                break
    df_f[floc+' L12 Injury'].iat[i] = lL12inj   

In [167]:
def build_features_gen(df,df_f,i,loc,pgames):
    """
    Build the features for the specific team and game that cannot be built
    by just using the data and features from the previous game.
    INPUT:
        - df: the original dataframe
        - df_f: the feature dataframe
        - i: index of current game
        - loc: location ("H" or "A") of current game
        - pgames: tuple containing the numbers of games to look into the past
    OUTPUT:
        None
    """ 
    floc = loc2floc(loc)
    
    ##### Goals scored in previous N games: ####
    # Running totals
    Sgoals_run = 0 
    Agoals_run = 0
    pg_ind = []
    pg_loc = []
    jstart = i-1 # Index to search through previous games
    num_comp = 0 # Number of games already done
    for num in pgames:
        print("Searching for p{0} game".format(num))
        for ind in range(num-num_comp): # find num-num_comp more games
            lfound = False
            print("Starting index = {0}".format(jstart))
            print("ind = ",ind)
            for j in range(jstart,-1,-1): # start 
                if df[floc+" Team"].iat[i] == df["Away Team"].iat[j]:
                    # Found the next previous game (team was away)
                    lfound = True
                    pfloc = "Away"
                    print("Found p{0} game. Index = {1}. On {2}: {3} vs {4}".format(num,j,
                        df["Date"].iat[j],df["Away Team"].iat[j],df["Home Team"].iat[j]))
                    
                    jstart = j - 1
                    break
                if df[floc+" Team"].iat[i] == df["Home Team"].iat[j]:
                    # Found the next previous game (team was home)
                    lfound = True
                    pfloc = "Home"
                    print("Found p{0} game. Index = {1}. On {2}: {3} vs {4}".format(num,j,
                        df["Date"].iat[j],df["Away Team"].iat[j],df["Home Team"].iat[j]))
                    jstart = j - 1
                    break  
            # Save data from that game in running totals
            if lfound:
                print("Sgoals: {0} = {1} + {2}".format(Sgoals_run+df[pfloc+' Score'].iat[j],Sgoals_run,df[pfloc+' Score'].iat[j]))
                Sgoals_run += df[pfloc+' Score'].iat[j]
                Agoals_run += df[otherloc(pfloc)+' Score'].iat[j]
        # Now running total should be for the previous num games
        print("Saving Sgoals = {0} Agoals = {1}".format(Sgoals_run,Agoals_run))
        df_f['{0} Scored Goals p{1}'.format(floc,num)].iat[i] = Sgoals_run
        df_f['{0} Allowed Goals p{1}'.format(floc,num)].iat[i] = Agoals_run
        # Save number of previous games found
        num_comp = num

In [170]:
df_f, pgames = init()
for i in range(120):
    date = df["Date"].iat[i]
    print("Date: {0}".format(date))
    df_f['Date'].iat[i] = date
    for loc in ("A","H"):
        floc = loc2floc(loc)
        # Find previous game that the away/home team played
        # (Multiple features are just generated by stats from the previous game)
        lfound = False
        team = df[floc+" Team"].iat[i]
        print("Team: {0} vs {1}".format(team,df[otherloc(floc)+" Team"].iat[i]))
        df_f[floc+" Team"].iat[i] = team
        for j in range(i-1,-1,-1):
            if df[floc+" Team"].iat[i] == df["Away Team"].iat[j]:
                # Found previous game (team was away)
                lfound = True
                ploc = "A"
                #print("Found previous game on {0}: {1} vs {2}".format(
                    #df["Date"].iat[j],df["Away Team"].iat[j],df["Home Team"].iat[j]))
                #print("{0} was {1}".format(team,ploc))
                break
            if df[floc+" Team"].iat[i] == df["Home Team"].iat[j]:
                # Found previous game (team was home)
                lfound = True
                ploc = "H"
                #print("Found previous game on {0}: {1} vs {2}".format(
                    #df["Date"].iat[j],df["Away Team"].iat[j],df["Home Team"].iat[j]))
                break

        if lfound:
            # Build features just using the previous game
            build_features_pg(df,df_f,i,j,loc,ploc)
            
            # Build additional features
            build_features_gen(df,df_f,i,loc,pgames)

Date: 2016-10-12
Team: STL vs CHI
Team: CHI vs STL
Date: 2016-10-12
Team: CGY vs EDM
Team: EDM vs CGY
Date: 2016-10-12
Team: TOR vs OTT
Team: OTT vs TOR
Date: 2016-10-12
Team: LAK vs SJS
Team: SJS vs LAK
Date: 2016-10-13
Team: MTL vs BUF
Team: BUF vs MTL
Date: 2016-10-13
Team: BOS vs CBJ
Team: CBJ vs BOS
Date: 2016-10-13
Team: ANA vs DAL
Team: DAL vs ANA
Date: 2016-10-13
Team: NJD vs FLA
Team: FLA vs NJD
Date: 2016-10-13
Team: NYI vs NYR
Team: NYR vs NYI
Date: 2016-10-13
Team: WSH vs PIT
Team: PIT vs WSH
Date: 2016-10-13
Team: MIN vs STL
Team: STL vs MIN
Searching for p1 game
Starting index = 9
ind =  0
Found p1 game. Index = 0. On 2016-10-12: STL vs CHI
Sgoals: 5 = 0 + 5
Saving Sgoals = 5 Agoals = 2
Searching for p2 game
Starting index = -1
ind =  0
Saving Sgoals = 5 Agoals = 2
Searching for p5 game
Starting index = -1
ind =  0
Starting index = -1
ind =  1
Starting index = -1
ind =  2
Saving Sgoals = 5 Agoals = 2
Searching for p10 game
Starting index = -1
ind =  0
Starting index = -1


ind =  1
Starting index = 1
ind =  2
Saving Sgoals = 10 Agoals = 12
Searching for p10 game
Starting index = 1
ind =  0
Starting index = 1
ind =  1
Starting index = 1
ind =  2
Starting index = 1
ind =  3
Starting index = 1
ind =  4
Saving Sgoals = 10 Agoals = 12
Date: 2016-10-18
Team: FLA vs TBL
Searching for p1 game
Starting index = 44
ind =  0
Found p1 game. Index = 20. On 2016-10-15: DET vs FLA
Sgoals: 4 = 0 + 4
Saving Sgoals = 4 Agoals = 1
Searching for p2 game
Starting index = 19
ind =  0
Found p2 game. Index = 7. On 2016-10-13: NJD vs FLA
Sgoals: 6 = 4 + 2
Saving Sgoals = 6 Agoals = 2
Searching for p5 game
Starting index = 6
ind =  0
Starting index = 6
ind =  1
Starting index = 6
ind =  2
Saving Sgoals = 6 Agoals = 2
Searching for p10 game
Starting index = 6
ind =  0
Starting index = 6
ind =  1
Starting index = 6
ind =  2
Starting index = 6
ind =  3
Starting index = 6
ind =  4
Saving Sgoals = 6 Agoals = 2
Team: TBL vs FLA
Searching for p1 game
Starting index = 44
ind =  0
Found p1

Starting index = 6
ind =  2
Saving Sgoals = 6 Agoals = 8
Searching for p10 game
Starting index = 6
ind =  0
Starting index = 6
ind =  1
Starting index = 6
ind =  2
Starting index = 6
ind =  3
Starting index = 6
ind =  4
Saving Sgoals = 6 Agoals = 8
Date: 2016-10-22
Team: PIT vs NSH
Searching for p1 game
Starting index = 71
ind =  0
Found p1 game. Index = 58. On 2016-10-20: SJS vs PIT
Sgoals: 3 = 0 + 3
Saving Sgoals = 3 Agoals = 2
Searching for p2 game
Starting index = 57
ind =  0
Found p2 game. Index = 40. On 2016-10-18: PIT vs MTL
Sgoals: 3 = 3 + 0
Saving Sgoals = 3 Agoals = 6
Searching for p5 game
Starting index = 39
ind =  0
Found p5 game. Index = 34. On 2016-10-17: COL vs PIT
Sgoals: 6 = 3 + 3
Starting index = 33
ind =  1
Found p5 game. Index = 23. On 2016-10-15: ANA vs PIT
Sgoals: 9 = 6 + 3
Starting index = 22
ind =  2
Found p5 game. Index = 9. On 2016-10-13: WSH vs PIT
Sgoals: 12 = 9 + 3
Saving Sgoals = 12 Agoals = 14
Searching for p10 game
Starting index = 8
ind =  0
Starting in

Found p2 game. Index = 54. On 2016-10-20: WSH vs FLA
Sgoals: 6 = 2 + 4
Saving Sgoals = 6 Agoals = 6
Searching for p5 game
Starting index = 53
ind =  0
Found p5 game. Index = 47. On 2016-10-18: COL vs WSH
Sgoals: 9 = 6 + 3
Starting index = 46
ind =  1
Found p5 game. Index = 28. On 2016-10-15: NYI vs WSH
Sgoals: 11 = 9 + 2
Starting index = 27
ind =  2
Found p5 game. Index = 9. On 2016-10-13: WSH vs PIT
Sgoals: 13 = 11 + 2
Saving Sgoals = 13 Agoals = 10
Searching for p10 game
Starting index = 8
ind =  0
Starting index = 8
ind =  1
Starting index = 8
ind =  2
Starting index = 8
ind =  3
Starting index = 8
ind =  4
Saving Sgoals = 13 Agoals = 10
Team: EDM vs WSH
Searching for p1 game
Starting index = 93
ind =  0
Found p1 game. Index = 79. On 2016-10-23: EDM vs WPG
Sgoals: 3 = 0 + 3
Saving Sgoals = 3 Agoals = 0
Searching for p2 game
Starting index = 78
ind =  0
Found p2 game. Index = 53. On 2016-10-20: STL vs EDM
Sgoals: 6 = 3 + 3
Saving Sgoals = 6 Agoals = 1
Searching for p5 game
Starting i

Saving Sgoals = 15 Agoals = 15
Date: 2016-10-29
Team: PIT vs PHI
Searching for p1 game
Starting index = 117
ind =  0
Found p1 game. Index = 101. On 2016-10-27: NYI vs PIT
Sgoals: 4 = 0 + 4
Saving Sgoals = 4 Agoals = 2
Searching for p2 game
Starting index = 100
ind =  0
Found p2 game. Index = 88. On 2016-10-25: FLA vs PIT
Sgoals: 7 = 4 + 3
Saving Sgoals = 7 Agoals = 4
Searching for p5 game
Starting index = 87
ind =  0
Found p5 game. Index = 72. On 2016-10-22: PIT vs NSH
Sgoals: 8 = 7 + 1
Starting index = 71
ind =  1
Found p5 game. Index = 58. On 2016-10-20: SJS vs PIT
Sgoals: 11 = 8 + 3
Starting index = 57
ind =  2
Found p5 game. Index = 40. On 2016-10-18: PIT vs MTL
Sgoals: 11 = 11 + 0
Saving Sgoals = 11 Agoals = 15
Searching for p10 game
Starting index = 39
ind =  0
Found p10 game. Index = 34. On 2016-10-17: COL vs PIT
Sgoals: 14 = 11 + 3
Starting index = 33
ind =  1
Found p10 game. Index = 23. On 2016-10-15: ANA vs PIT
Sgoals: 17 = 14 + 3
Starting index = 22
ind =  2
Found p10 game. 

In [172]:
tcheck = "VAN"
#df_f.loc[(df_f['Away Team'] == tcheck) | (df_f['Home Team'] == tcheck)]
cols = ["Date","Away Team","Away Scored Goals p2","Away Scored Goals p5","Away Allowed Goals p2","Away Allowed Goals p5"]
print(df_f.loc[(df_f['Away Team'] == tcheck)][cols])
cols = ["Date","Home Team","Home Scored Goals p2","Home Scored Goals p5","Home Allowed Goals p2","Home Allowed Goals p5"]
print(df_f.loc[(df_f['Home Team'] == tcheck)][cols])

         Date Away Team  Away Scored Goals p2  Away Scored Goals p5  \
70 2016-10-22       VAN                     4                    10   
76 2016-10-23       VAN                     5                    13   

    Away Allowed Goals p2  Away Allowed Goals p5  
70                      2                      6  
76                      5                     10  
          Date Home Team  Home Scored Goals p2  Home Scored Goals p5  \
27  2016-10-15       VAN                     0                     0   
31  2016-10-16       VAN                     2                     2   
46  2016-10-18       VAN                     6                     6   
60  2016-10-20       VAN                     6                     8   
92  2016-10-25       VAN                     5                    13   
111 2016-10-28       VAN                     2                     9   

     Home Allowed Goals p2  Home Allowed Goals p5  
27                       0                      0  
31                      

In [173]:
plys = ["{0}L{1}-{2}".format("A",line,pos) for line in range(1,5) for pos in range(3) ]
df.loc[(df['Away Team'] == tcheck)][plys].head()
#(df['Home Team'] == "STL")]

Unnamed: 0,AL1-0,AL1-1,AL1-2,AL2-0,AL2-1,AL2-2,AL3-0,AL3-1,AL3-2,AL4-0,AL4-1,AL4-2
70,Henrik Sedin,Daniel Sedin,Sven Baertschi,Jannik Hansen,Brandon Sutter,Markus Granlund,Jake Virtanen,Loui Eriksson,Bo Horvat,Derek Dorsett,Alexandre Burrows,Brendan Gaunce
76,Daniel Sedin,Henrik Sedin,Sven Baertschi,Jannik Hansen,Brandon Sutter,Markus Granlund,Jake Virtanen,Loui Eriksson,Bo Horvat,Jack Skille,Brendan Gaunce,None None
143,Jannik Hansen,Daniel Sedin,Henrik Sedin,Brandon Sutter,Loui Eriksson,Markus Granlund,Jake Virtanen,Sven Baertschi,Bo Horvat,Alexandre Burrows,Brendan Gaunce,Derek Dorsett
153,Jannik Hansen,Daniel Sedin,Henrik Sedin,Brandon Sutter,Loui Eriksson,Markus Granlund,Jake Virtanen,Sven Baertschi,Bo Horvat,Derek Dorsett,Alexandre Burrows,Brendan Gaunce
171,Jannik Hansen,Henrik Sedin,Daniel Sedin,Brandon Sutter,Markus Granlund,Loui Eriksson,Jake Virtanen,Bo Horvat,Sven Baertschi,Alexandre Burrows,Brendan Gaunce,Jack Skille


In [174]:
plys = ["{0}L{1}-{2}".format("H",line,pos) for line in range(1,5) for pos in range(3) ]
df.loc[(df['Home Team'] == tcheck)][plys].head()

Unnamed: 0,HL1-0,HL1-1,HL1-2,HL2-0,HL2-1,HL2-2,HL3-0,HL3-1,HL3-2,HL4-0,HL4-1,HL4-2
27,Daniel Sedin,Loui Eriksson,Henrik Sedin,Jannik Hansen,Brendan Gaunce,Brandon Sutter,Markus Granlund,Sven Baertschi,Jake Virtanen,Alexandre Burrows,Derek Dorsett,Bo Horvat
31,Daniel Sedin,Henrik Sedin,Loui Eriksson,Jannik Hansen,Markus Granlund,Brandon Sutter,Sven Baertschi,Jake Virtanen,Bo Horvat,Derek Dorsett,Alexandre Burrows,Brendan Gaunce
46,Henrik Sedin,Loui Eriksson,Daniel Sedin,Brendan Gaunce,Jannik Hansen,Brandon Sutter,Markus Granlund,Sven Baertschi,Jake Virtanen,Derek Dorsett,Alexandre Burrows,Bo Horvat
60,Henrik Sedin,Daniel Sedin,Loui Eriksson,Markus Granlund,Jannik Hansen,Brandon Sutter,Jack Skille,Sven Baertschi,Bo Horvat,Brendan Gaunce,Derek Dorsett,Alexandre Burrows
92,Henrik Sedin,Daniel Sedin,Loui Eriksson,Jannik Hansen,Brandon Sutter,Markus Granlund,Jack Skille,Bo Horvat,Sven Baertschi,Jayson Megna,Brendan Gaunce,Jake Virtanen


In [175]:
cols = ["Date","Away Team", "Home Team", "Away Score", "Home Score"]
df.loc[(df['Away Team'] == tcheck) | (df['Home Team'] == tcheck)][cols]

Unnamed: 0,Date,Away Team,Home Team,Away Score,Home Score
27,2016-10-15,CGY,VAN,1,2
31,2016-10-16,CAR,VAN,3,4
46,2016-10-18,STL,VAN,1,2
60,2016-10-20,BUF,VAN,1,2
70,2016-10-22,VAN,LAK,3,4
76,2016-10-23,VAN,ANA,2,4
92,2016-10-25,OTT,VAN,3,0
111,2016-10-28,EDM,VAN,2,0
121,2016-10-29,WSH,VAN,5,2
143,2016-11-02,VAN,MTL,0,3


## Save Features!!
At this point, I have a number of features. I want to start looking at them, and trying some preliminary models. Check out the Data Exploration notebook for the pretty plots!

In [176]:
df_f.to_pickle("data/InitialFeatures.pkl")