## Feature Engineering 

This notebook details the process and code used to transform the project data from it's raw format to the format that will be fed into the learning algorithms. It covers feature construction, feature scaling, and the construction of training, validation and testing data sets.

In [1]:
# importing relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn as sk

In [2]:
# import dataset
raw = pd.read_csv('capstone data raw.csv')
raw.head()

Unnamed: 0,Year,Round,Team,Player,Score,Opposition,Venue
0,2014,1,ESS,Jobe Watson,153,NM,Etihad
1,2014,1,ESS,Dyson Heppell,150,NM,Etihad
2,2014,1,MEL,Nathan Jones,147,STK,Etihad
3,2014,1,GCS,Gary Ablett,141,RIC,Metricon
4,2014,1,STK,Clinton Jones,141,MEL,Etihad


In [3]:
# Create a multiindex to help with filtering of data set
arrays = [list(raw['Year']), list(raw['Round']), list(raw['Player'])]

tuples = list(zip(*arrays))

index = pd.MultiIndex.from_tuples(tuples, names=['year', 'round', 'player'])

raw.set_index(index, inplace=True)

raw.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Year,Round,Team,Player,Score,Opposition,Venue
year,round,player,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2014,1,Jobe Watson,2014,1,ESS,Jobe Watson,153,NM,Etihad
2014,1,Dyson Heppell,2014,1,ESS,Dyson Heppell,150,NM,Etihad
2014,1,Nathan Jones,2014,1,MEL,Nathan Jones,147,STK,Etihad
2014,1,Gary Ablett,2014,1,GCS,Gary Ablett,141,RIC,Metricon
2014,1,Clinton Jones,2014,1,STK,Clinton Jones,141,MEL,Etihad


In [5]:
# test multiindex
raw.loc[(2017, 20, 'Angus Monfries')]

Year                    2017
Round                     20
Team                      PA
Player        Angus Monfries
Score                     22
Opposition               ADE
Venue               Adelaide
Name: (2017, 20, Angus Monfries), dtype: object

In [6]:
raw.loc[(2017)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Year,Round,Team,Player,Score,Opposition,Venue
round,player,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Marc Murphy,2017,1,CAR,Marc Murphy,139,RIC,MCG
1,Kade Simpson,2017,1,CAR,Kade Simpson,126,RIC,MCG
1,Matthew Kreuzer,2017,1,CAR,Matthew Kreuzer,120,RIC,MCG
1,Bryce Gibbs,2017,1,CAR,Bryce Gibbs,105,RIC,MCG
1,Sam Docherty,2017,1,CAR,Sam Docherty,90,RIC,MCG
1,Matthew Wright,2017,1,CAR,Matthew Wright,87,RIC,MCG
1,Lachie Plowman,2017,1,CAR,Lachie Plowman,87,RIC,MCG
1,Caleb Marchbank,2017,1,CAR,Caleb Marchbank,78,RIC,MCG
1,Ed Curnow,2017,1,CAR,Ed Curnow,77,RIC,MCG
1,Patrick Cripps,2017,1,CAR,Patrick Cripps,77,RIC,MCG


### Function Definitions and Testing

In the following code I construct functions that take a row from the raw data set and generate a feature to be used in the final data set. The goal is to use these functions while looping over each observation in the raw dataset to construct the dataset to be used in the project. 

Multiple tests were constructed with each function and the results cross-referenced with data on the AFL Fantasy website (https://fantasy.afl.com.au). Particular emphasis was placed on the testing of fringe values likely to trigger errors or incorrect calculations. In most cases the last conducted test is available after the function definition.

In [14]:
# The purpose of this function is not to directly compute a feature, but to rather be called by other functions 
# such as prev_player_score to provide a proxy value when a more suitable value is not available

def last_season_average(data, yr, player):
    """Returns player's average for the previous year if available, 
        otherwise returns none"""
    if yr == 2014:
        return None
    yr -= 1
    try:
        yr_data = data.loc[(yr)].groupby('Player').get_group(player).mean()['Score']
    except KeyError:
        yr_data = None
    return yr_data

In [15]:
last_season_average(raw, 2016, 'Adam Treloar')

104.0952380952381

In [16]:
# This function is used to directly compute the prev_rd_score feature of the final data set

def prev_player_score(data, yr, rd, player):
    """Returns the most recent score available for the player within the given year. If no previous scores 
        available, prev year average is used as a proxy. If prev year average is not available, None is recorded"""
    while rd > 1:
        rd -= 1
        try:
            return data.loc[(yr, rd, player)]['Score']
        except KeyError:
            continue
    return last_season_average(data, yr, player)

In [17]:
print(prev_player_score(raw, 2017, 1, 'Patrick Dangerfield'))

117.954545455


In [18]:
# This function is used to directly calculate the three_rd_av and five_rd_av features of the final data set

def prev_rounds_average(data, yr, rd, player, rds_to_av):
    """Returns the average score of the player over the most recent rds_to_av number of rounds within the given
        year. If rds_to_av number of previous rounds do not exist, the player's previous year average is used 
        as many times as nessacary to make up rds_to_av observations"""
    values_to_average = []
    while len(values_to_average) < rds_to_av:
        if rd > 1:
            rd -= 1
            try:
                values_to_average.append(data.loc[(yr, rd, player)]['Score'])
            except KeyError:
                continue
        else:
            values_to_average.append(last_season_average(data, yr, player))
            continue
    if None in values_to_average:
        return None
    else:
        return np.array(values_to_average).mean()

In [19]:
print(prev_rounds_average(raw, 2017, 21, 'Nic Naitanui', 3))

84.5333333333


In [20]:
# This function is used to directly calculate the season_av feature in the final data set.

def current_season_average(data, yr, rd, player):
    """Calculates the players rolling season average prior to round, If no data available, previous season 
        average is used"""
    try:
        yr_ave = data.loc[(yr):(yr, rd-1)].groupby('Player').get_group(player).mean()['Score']
    except KeyError:
        yr_ave = last_season_average(data, yr, player)
    return yr_ave

In [21]:
current_season_average(raw, 2017, 1, 'Lachie Neale')

111.13636363636364

In [22]:
# This function is used to directly calculate the prev_against_opp feature in the final data set

def prev_player_score_against_opposition(data, yr, rd, player, opp):
    """Returns player's last known score against opposition, otherwise returns none. The calculation of this value
        is not limited to the current year"""
    filtered_data = data[np.logical_and(data['Player'] == player, np.array(data['Opposition'] == opp))].loc[:(yr, rd-1)]
    if len(filtered_data) == 0:
        return None
    else:
        return int(filtered_data.iloc[-1:]['Score'])

In [23]:
prev_player_score_against_opposition(raw, 2017, 20, 'Nat Fyfe', 'GCS')

121

In [24]:
# This function is used to directly calculate the prev_at_venue feature in the final data set

def prev_player_score_at_venue(data, yr, rd, player, ven):
    """Returns player's last known score at venue, otherwise returns none. The calculation of this value is not
        limited to the current year"""
    filtered_data = data[np.logical_and(data['Player'] == player, np.array(data['Venue'] == ven))].loc[:(yr, rd-1)]
    if len(filtered_data) == 0:
        return None
    else:
        return int(filtered_data.iloc[-1:]['Score'])

In [25]:
prev_player_score_at_venue(raw, 2017, 1, 'Stephen Hill', 'Domain')

129

In [26]:
# The purpose of this function is not to directly compute a feature, but to rather be called by other functions 
# such as prev_team_for_average to provide a proxy value when a more suitable value is not available

def prev_year_av_team_for(data, yr, team):
    """Returns team's average 'points for' for the previous year if available, otherwise returns none"""
    if yr == 2014:
        return None
    yr -= 1
    yr_data = data.loc[(yr)].groupby('Team').get_group(team)
    team_av = yr_data['Score'].sum() / len(yr_data['Round'].unique())
    
    return team_av

In [27]:
prev_year_av_team_for(raw, 2017, 'FRE')

1557.090909090909

In [28]:
raw.loc[(2016)][raw.loc[(2016)]['Team'] == 'FRE']['Score'].sum() / 22

1557.090909090909

In [29]:
# The purpose of this function is not to directly compute a feature, but to rather be called by other functions 
# such as prev_opposition_against_av to provide a proxy value when a more suitable value is not available

def prev_year_av_opposition_against(data, yr, opp):
    """Returns the average 'points against' the opposition team for the previous year if available, 
        otherwise returns none"""
    if yr == 2014:
        return None
    yr -= 1
    yr_data = data.loc[(yr)].groupby('Opposition').get_group(opp)
    team_av = yr_data['Score'].sum() / len(yr_data['Round'].unique())
    
    return team_av

In [30]:
prev_year_av_opposition_against(raw, 2017, 'WCE')

1602.0

In [31]:
raw.loc[(2016)][raw.loc[(2016)]['Opposition'] == 'WCE']['Score'].sum() / 22

1602.0

In [32]:
# This function is used to directly calculate the three_rd_av_team_for feature in the final data set.

def prev_team_for_average(data, yr, rd, team, rds_to_av):
    """Returns the average team 'points for' over rds_to_av number of rounds"""
    values_to_average = []
    while len(values_to_average) < rds_to_av:
        if rd > 1:
            rd -= 1
            try:
                values_to_average.append(data.loc[(yr, rd)].groupby('Team').get_group(team)['Score'].sum())
            except KeyError:
                continue
        else:
            values_to_average.append(prev_year_av_team_for(data, yr, team))
    if None in values_to_average:
        return None
    else:
        return np.array(values_to_average).mean()

In [33]:
prev_team_for_average(raw, 2014, 10, 'NM', 3)

1643.6666666666667

In [34]:
raw.loc[(2017, 16):(2017, 20)].groupby('Team').get_group('FRE').sum()['Score'] / 5

1594.4

In [35]:
# This function is used to directly calculate the three_rd_av_opp_against feature in the final data set.

def prev_opp_against_average(data, yr, rd, opp, rds_to_av):
    """Returns the average team 'points for' for the teams to most recently play the current round opposition team
        over rds_to_av number of rounds"""
    values_to_average = []
    while len(values_to_average) < rds_to_av:
        if rd > 1:
            rd -= 1
            try:
                values_to_average.append(data.loc[(yr, rd)].groupby('Opposition').get_group(opp)['Score'].sum())
            except KeyError:
                continue
        else:
            values_to_average.append(prev_year_av_opposition_against(data, yr, opp))
    if None in values_to_average:
        return None
    else:
        return np.array(values_to_average).mean()

In [36]:
prev_opp_against_average(raw, 2017, 21, 'FRE', 3)

1652.6666666666667

In [37]:
raw.loc[(2017, 18):(2017, 20)].groupby('Opposition').get_group('FRE').sum()['Score'] / 3

1652.6666666666667

In [38]:
# This function is used to directly calculate the last_team_opp feature from the final data set

def prev_team_score_against_opposition(data, yr, rd, team, opp):
    """Returns team's last known score against opposition team. If not available, returns none. The calculation of 
        this value is not limited to the current year"""
    filtered_data = data[np.logical_and(data['Team'] == team, data['Opposition'] == opp)].loc[:(yr, rd-1)]
    if len(filtered_data) == 0:
        return None
    else:
        return int(np.array(filtered_data.groupby(['Year','Round']).sum()['Score'])[-1:])

In [39]:
prev_team_score_against_opposition(raw, 2017, 20, 'FRE', 'GCS')

1526

In [40]:
raw.loc[(2016, 18)].groupby('Team').get_group('FRE').sum()['Score']

1526

In [41]:
# This function is used to directly calculate the last_team_venue feature from the final data set

def prev_team_score_at_venue(data, yr, rd, team, venue):
    """Returns team's last known score at venue. If not available, returns none. The calculation of this value is
        not limited to the current year"""
    filtered_data = data[np.logical_and(data['Team'] == team, data['Venue'] == venue)].loc[:(yr, rd-1)]
    if len(filtered_data) == 0:
        return None
    else:
        return int(np.array(filtered_data.groupby(['Year','Round']).sum()['Score'])[-1:])

In [42]:
prev_team_score_at_venue(raw, 2017, 20, 'FRE', 'Domain')

1480

In [43]:
raw.loc[(2017, 18)].groupby('Team').get_group('FRE').sum()['Score']


1480

In [44]:
# Here I conducted separate tests for each function on the raw data set to check for errors 

results = []

for index, row in raw.iterrows():
    results.append(prev_team_score_at_venue(raw, row['Year'], row['Round'],row['Team'], row['Venue']))
    
print('No of rows equal: ', len(results)==len(raw))

No of rows equal:  True


### Construction of Final Capstone Data Set

Here I use the functions created above to create the data set to be used for the 'Captain Choice Problem' project. The data set is constructed by looping over each observation in the raw data set and applying the functions to each observation. The functions in turn construct the features to be used.

In [45]:
outer = []

for index, row in raw.iterrows():
    inner = []
    inner.append(prev_player_score(raw, row['Year'], row['Round'], row['Player']))
    inner.append(prev_rounds_average(raw, row['Year'], row['Round'], row['Player'], 3))
    inner.append(prev_rounds_average(raw, row['Year'], row['Round'], row['Player'], 5))
    inner.append(current_season_average(raw, row['Year'], row['Round'], row['Player']))
    inner.append(prev_player_score_against_opposition(raw, row['Year'], row['Round'], row['Player'], row['Opposition']))
    inner.append(prev_player_score_at_venue(raw, row['Year'], row['Round'], row['Player'], row['Venue']))
    inner.append(prev_team_for_average(raw, row['Year'], row['Round'], row['Team'], 3))
    inner.append(prev_opp_against_average(raw, row['Year'], row['Round'], row['Opposition'], 3))
    inner.append(prev_team_score_against_opposition(raw, row['Year'], row['Round'], row['Team'], row['Opposition']))
    inner.append(prev_team_score_at_venue(raw, row['Year'], row['Round'], row['Team'], row['Venue']))
    outer.append(inner)

In [65]:
# Adding appropriate row and column index to data
outer = pd.DataFrame(outer, index=raw.index, columns=['prev rd score', 'three rd av', 'five rd av', 'season av', 
                                                      'prev against opp', 'prev at venue', 'three rd av team for', 
                                                      'three rd av opp against', 'last team opp', 'last team venue'])

# Adding target data to data set
outer['score'] = raw['Score']

In [66]:
# saving final data output
outer.to_csv('capstone data final.csv')