# Feature Engineering

In this notebook we will engineer additional features to include in the model. For this study this is accomplished using the player names to evaluate more information on the team demographics. We currently attempt to evaluate:
- Player Gender
- Potentially in future: Player Age

In addition we merge the player data into the episodes dataframe. This involves coming up with summary statistics for players that were succesful or failed in the head to head.  

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

import datetime
TIME_NOW = datetime.datetime.now()

import gender_guesser.detector as gender
GENDER_DETECTOR = gender.Detector()

from agefromname import AgeFromName
AGE_FROM_NAME = AgeFromName()

In [2]:
df_episodes = pd.read_csv("../data/cleaned_episodes.csv", parse_dates=['Date'])
df_players = pd.read_csv('../data/cleaned_players.csv', parse_dates=['Date'])

df_players.head()

Unnamed: 0,Date,PlayerNo.,Name,CashBuilder,LowerOffer,HigherOffer,ChosenOffer,Series,LowerOffer_Selected,MiddleOffer_Selected,HigherOffer_Selected,isHome,LadderDifference
0,2009-06-29,P1,Lisa,5000.0,2000.0,10000.0,2000.0,1,True,False,False,False,-1
1,2009-06-29,P2,Ian,7000.0,2000.0,20000.0,20000.0,1,False,False,True,True,1
2,2009-06-29,P3,Claire,8000.0,2000.0,20000.0,8000.0,1,False,True,False,False,-3
3,2009-06-29,P4,Driss,9000.0,200.0,20000.0,200.0,1,True,False,False,True,5
4,2009-06-30,P1,Bradley,8000.0,4000.0,16000.0,8000.0,1,False,True,False,True,1


## Gender Evaluation

For this we will use the "gender_guesser" python library https://pypi.org/project/gender-guesser/ as this allows us to specifically specify the UK statistics. A function is used to first to give preference to UK naming conventions then Irish, US, Indian. If "unknown" is returned the name will be trialed without a specified country.

If a guess cannot be made we will then use the gender classifier in the "AgeFromName" library https://github.com/JasonKessler/agefromname

Failing that we assign it the label 'unknown'

In [3]:
def custom_gender_guess(input_name):
    
    # We ignore middle- and sur- names for this guess, celebrities with stage names have real names in brackets...
    if "(" in input_name:
        split_name = input_name.split("(")[-1].replace(")","").split()[0]
    else:
        split_name = input_name.split()[0]
    
    # first guess with UK as preference
    name_guess = GENDER_DETECTOR.get_gender(split_name,'great_britain')
    if name_guess != "unknown" and name_guess != 'andy':
        return name_guess

    # next guess with Ireland as preference
    name_guess = GENDER_DETECTOR.get_gender(split_name,'ireland')
    if name_guess != "unknown" and name_guess != 'andy':
        return name_guess

    # next guess with USA as preference
    name_guess = GENDER_DETECTOR.get_gender(split_name,'usa')
    if name_guess != "unknown" and name_guess != 'andy':
        return name_guess
    
    # next guess with india as preference
    name_guess = GENDER_DETECTOR.get_gender(split_name,'india')
    if name_guess != "unknown" and name_guess != 'andy':
        return name_guess
        
    # next guess without preference
    name_guess = GENDER_DETECTOR.get_gender(split_name)
    if name_guess != "unknown":
        return name_guess
    
    if AGE_FROM_NAME.prob_male(split_name) > 0.75:
        return "male"
    if AGE_FROM_NAME.prob_female(split_name) > 0.75:
        return "female"                
    
    return "unknown"

In [4]:
df_players["GenderGuess"] = df_players["Name"].apply(lambda x: custom_gender_guess(x))
df_players["GenderGuess"].value_counts()

male             3800
female           3424
mostly_male       656
mostly_female     495
unknown           212
andy               57
Name: GenderGuess, dtype: int64

We see most names are confidently classified as male or female, and a decent amount of mostly_male, mostly_female and androgenous. 

The fraction of names still not classified is:

In [5]:
print("Percent of Andy: {:.2f}%".format(100*len(df_players[df_players["GenderGuess"] == "andy"]) / len(df_players)))
print("Percent of Unclassified: {:.2f}%".format(100*len(df_players[df_players["GenderGuess"] == "unknown"]) / len(df_players)))

Percent of Andy: 0.66%
Percent of Unclassified: 2.45%


For encoding this information we will:
- Combine male and mostly_male as: Male
- Combine female and mostly_female as: Female
- Combine andy and unknown as: Other

We will then store three binary columns, one hot encoded for each classification

In [6]:
df_players.loc[df_players["GenderGuess"] == "mostly_male", "GenderGuess"] = "male"
df_players.loc[df_players["GenderGuess"] == "mostly_female", "GenderGuess"] = "female"

df_players.loc[df_players["GenderGuess"] == "unknown", "GenderGuess"] = "other"
df_players.loc[df_players["GenderGuess"] == "andy", "GenderGuess"] = "other"

# Perform one-hot encoding using get_dummies() function and add a prefix
one_hot_encoded = pd.get_dummies(df_players['GenderGuess'], prefix='gender')
# Drop the original column
df_players = df_players.drop('GenderGuess', axis=1)
# Concatenate the one-hot encoded columns with the DataFrame
df_players = pd.concat([df_players, one_hot_encoded], axis=1)

In [7]:
df_players.head(20)

Unnamed: 0,Date,PlayerNo.,Name,CashBuilder,LowerOffer,HigherOffer,ChosenOffer,Series,LowerOffer_Selected,MiddleOffer_Selected,HigherOffer_Selected,isHome,LadderDifference,gender_female,gender_male,gender_other
0,2009-06-29,P1,Lisa,5000.0,2000.0,10000.0,2000.0,1,True,False,False,False,-1,1,0,0
1,2009-06-29,P2,Ian,7000.0,2000.0,20000.0,20000.0,1,False,False,True,True,1,0,1,0
2,2009-06-29,P3,Claire,8000.0,2000.0,20000.0,8000.0,1,False,True,False,False,-3,1,0,0
3,2009-06-29,P4,Driss,9000.0,200.0,20000.0,200.0,1,True,False,False,True,5,0,1,0
4,2009-06-30,P1,Bradley,8000.0,4000.0,16000.0,8000.0,1,False,True,False,True,1,0,1,0
5,2009-06-30,P2,Christine,3000.0,1000.0,13000.0,3000.0,1,False,True,False,True,2,1,0,0
6,2009-06-30,P3,Foiz,6000.0,1000.0,25000.0,25000.0,1,False,False,True,True,1,0,0,1
7,2009-06-30,P4,Jill,6000.0,1000.0,14000.0,6000.0,1,False,True,False,False,-2,1,0,0
8,2009-01-07,P1,Lynn,7000.0,4000.0,14000.0,7000.0,1,False,True,False,True,1,1,0,0
9,2009-01-07,P2,Paul,9000.0,3000.0,18000.0,3000.0,1,True,False,False,True,1,0,1,0


## Team Summary (Joining Episodes to Players)

To create a dataset for modelling, we now need to create summary statistics for each episode using the player data. This will be done by parsing:
- Player Column: for a given feature, we have a column for each player number (p1, p2, p3, p4)
- Aggregated Columns: for a given feature we take the minimum, maximum and mean of the feature separately for both successful and failed players.
- Count Columns: for a given (boolean) feature we count the occurences the feature separately for both successful and failed players.

In [8]:
# tags for each player number as found in df_players
PLAYER_TAGS = ['P1', 'P2', 'P3', 'P4']

# will produce a max, min, mean over all players for these columns
AGGREGATE_COLUMNS = ['CashBuilder', 'LowerOffer', 'HigherOffer', 'ChosenOffer', 'LadderDifference']

# will count occurences (sum boolean) over all players for these columns
COUNT_COLUMNS = ['LowerOffer_Selected', 'MiddleOffer_Selected', 'HigherOffer_Selected', 'gender_female', 'gender_male', 'gender_other']

def get_team_summary(df_episodes, df_players, row_index):
    # Step 1: Extract names from the df_episodes Team string
    team_names = df_episodes.loc[row_index, 'Team'].split(', ')
    
    # Step 2: Filter df_players by date (and names for good measure)
    date_filter = df_players['Date'] == df_episodes.loc[row_index, 'Date']
    df_filtered = df_players[date_filter]
    
    # if 4 players can't be found
    if len(df_filtered) != 4:
        return None
    
    df_player_success = df_filtered[df_filtered['isHome']]
    df_player_failed = df_filtered[~df_filtered['isHome']]
    
    team_data = {}
    
    # Player number stats
    for p in PLAYER_TAGS:
        player_data = df_filtered.loc[df_filtered['PlayerNo.'] == p].iloc[0]
        team_data['{}_isHome'.format(p)] = player_data['isHome']
        team_data['{}_ChosenOffer'.format(p)] = player_data['ChosenOffer']
    
    # Summary statistics
    for c in AGGREGATE_COLUMNS:
        team_data['SuccessTeam_min_{}'.format(c)] = df_player_success[c].min()
        team_data['SuccessTeam_max_{}'.format(c)] = df_player_success[c].max()
        team_data['SuccessTeam_mean_{}'.format(c)] = df_player_success[c].mean()

        team_data['FailedTeam_min_{}'.format(c)] = df_player_failed[c].min()
        team_data['FailedTeam_max_{}'.format(c)] = df_player_failed[c].max()
        team_data['FailedTeam_mean_{}'.format(c)] = df_player_failed[c].mean()
        
    for c in COUNT_COLUMNS:
        team_data['SuccessTeam_sum_{}'.format(c)] = df_player_success[c].sum()
        team_data['FailedTeam_sum_{}'.format(c)] = df_player_failed[c].sum()
        
        if len(df_player_success) > 0:
            team_data['SuccessTeam_fraction_{}'.format(c)] = team_data['SuccessTeam_sum_{}'.format(c)] / len(df_player_success)
        else:
            team_data['SuccessTeam_fraction_{}'.format(c)] = np.nan
            
            
        if len(df_player_failed) > 0:
            team_data['FailedTeam_fraction_{}'.format(c)] = team_data['FailedTeam_sum_{}'.format(c)] / len(df_player_failed)
        else:
            team_data['FailedTeam_fraction_{}'.format(c)] = np.nan  

    # Append each entry of team_data dictionary as a separate column
    for key, value in team_data.items():
        df_episodes.loc[row_index, key] = value
    
    return team_data

In [9]:
# a little slow to do this in a loop. For small datasets not really a sizeable bottleneck...
df_merged = pd.DataFrame()
# Iterate over each row in df_episodes
for row_index in df_episodes.index:
    team_data = get_team_summary(df_episodes, df_players, row_index)
    if team_data is not None:
        df_merged = df_merged.append(df_episodes.loc[row_index])

In [10]:
df_merged.head()

Unnamed: 0,Date,Episode,Team,Chaser,NPlayersFinal,PrizeFund,Target,Series,isCelebrity,ChaserWin,...,SuccessTeam_fraction_gender_female,FailedTeam_fraction_gender_female,SuccessTeam_sum_gender_male,FailedTeam_sum_gender_male,SuccessTeam_fraction_gender_male,FailedTeam_fraction_gender_male,SuccessTeam_sum_gender_other,FailedTeam_sum_gender_other,SuccessTeam_fraction_gender_other,FailedTeam_fraction_gender_other
0,2009-06-29,1.0,"Lisa, Ian, Claire, Driss",Mark Labbett,2.0,20200.0,18.0,1.0,0.0,1.0,...,0.0,1.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,2009-06-30,2.0,"Bradley, Christine, Foiz, Jill",Shaun Wallace,3.0,36000.0,20.0,1.0,False,True,...,0.333333,1.0,1.0,0.0,0.333333,0.0,1.0,0.0,0.333333,0.0
2,2009-01-07,3.0,"Lynn, Paul, Liz, Ray",Mark Labbett,4.0,12400.0,22.0,1.0,False,True,...,0.5,,2.0,0.0,0.5,,0.0,0.0,0.0,
3,2009-02-07,4.0,"Mike, Sally, Ciaran, Bette",Shaun Wallace,3.0,16500.0,20.0,1.0,False,False,...,0.666667,0.0,1.0,1.0,0.333333,1.0,0.0,0.0,0.0,0.0
4,2009-03-07,5.0,"Allan, Lisa, Phil, Emma",Shaun Wallace,2.0,16000.0,14.0,1.0,False,True,...,0.5,0.5,1.0,1.0,0.5,0.5,0.0,0.0,0.0,0.0


In [11]:
# drop columns no longer needed
df_merged = df_merged.drop(columns=['Date', 'Episode', 'Team', 'Series'])
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2160 entries, 0 to 2159
Data columns (total 70 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Chaser                                     2160 non-null   object 
 1   NPlayersFinal                              2160 non-null   float64
 2   PrizeFund                                  2160 non-null   float64
 3   Target                                     2160 non-null   float64
 4   isCelebrity                                2160 non-null   object 
 5   ChaserWin                                  2160 non-null   object 
 6   ChaserWinBy                                2160 non-null   float64
 7   ChaserLoseBy                               2160 non-null   float64
 8   P1_isHome                                  2160 non-null   object 
 9   P1_ChosenOffer                             2160 non-null   float64
 10  P2_isHome               

In [12]:
df_merged.to_csv('../data/merged_data.csv', index=False)