# Feature engineering

In this notebook we will focus on creating new features and preprocessing them to a proper form so that classifiers can take them as input data. We will store new data to ./data folder so that it is easily accessible later if we want to use it for analysis or training purposes.



In [1]:
import numpy as np
import pandas as pd

import os

from utils.feature_engineering_utils import CreateNewFeatures, Preprocessor

Let's start by downloading the raw training data.

In [2]:
_DATADIR = './data'
_TRAIN_DATA = os.path.join(_DATADIR, 'train.csv')

In [3]:
train_data = pd.read_csv(_TRAIN_DATA, index_col='PassengerId')
print(train_data.shape)


(891, 11)


### Feature creation

We will now create new features based on the analysis we did in data_analysis.ipynb notebook. After data analysis we added also some new features inspired by Kaggle community discussions:
- FamilyRate: proportion of passenger's family (other than passenger itself) that survived. If no family, then gender average survival rate is used.
- FamilyRate_grouped: FamilyRate grouped to two categories: 0 if FamilyRate < 0.5, else 1
- GroupRate: proportion of passenger's ticket group (other than passenger itself) that survived. If no group, then gender average survival rate is used.
- GroupRate_grouped: GroupRate grouped to two categories: 0 if GroupRate < 0.5, else 1

We have created a sklearn pipeline component, CreateNewFeatures, that will create new features from raw data. For more details, see feature_engineering_utils.py file. 

In [4]:
feature_creator = CreateNewFeatures()
raw_train_data_all_features = feature_creator.fit_transform(train_data, train_data['Survived'])
raw_train_data_all_features

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,...,Title_grouped,FamilySize,FamilySize_grouped,GroupSize,GroupSize_grouped,GroupRate,GroupRate_grouped,FamilyRate,FamilyRate_grouped,Fare_adjusted
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,...,Mr,2,1,2,1,0.155718,0,0.155718,0,3.62500
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,...,Mrs,2,1,2,1,0.785714,1,0.785714,1,35.64165
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,...,Miss,1,0,1,0,0.785714,1,0.785714,1,7.92500
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,...,Mrs,2,1,2,1,0.500000,1,0.500000,1,26.55000
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,...,Mr,1,0,1,0,0.155718,0,0.155718,0,8.05000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,...,Religious,1,0,1,0,0.155718,0,0.155718,0,13.00000
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,...,Miss,1,0,1,0,0.785714,1,0.785714,1,30.00000
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,...,Miss,4,1,4,1,0.000000,0,0.000000,0,5.86250
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,...,Mr,1,0,1,0,0.155718,0,0.155718,0,30.00000


Let's see the column names we have now:

In [5]:
raw_train_data_all_features.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked', 'HasCabin', 'CabinType', 'Surname',
       'Title', 'Title_grouped', 'FamilySize', 'FamilySize_grouped',
       'GroupSize', 'GroupSize_grouped', 'GroupRate', 'GroupRate_grouped',
       'FamilyRate', 'FamilyRate_grouped', 'Fare_adjusted'],
      dtype='object')

In [6]:
data_path = os.path.join(_DATADIR, 'raw_data_all_features.csv')
raw_train_data_all_features.to_csv(data_path)

### Imputation

Go to imputation.ipynb notebook to see how missing values are imputed. After imputation we can continue with data preprocessing.

### Preprocessing the data

Now we have all features created and missing values are imputed. We can continue by downloading the imputed data with all features (that was saved in imputation.ipynb):

In [7]:
imputed_data_all_features = pd.read_csv(os.path.join(_DATADIR, 'imputed_data_all_features.csv'))

In [8]:
imputed_data_all_features.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'HasCabin', 'CabinType',
       'Surname', 'Title', 'Title_grouped', 'FamilySize', 'FamilySize_grouped',
       'GroupSize', 'GroupSize_grouped', 'GroupRate', 'GroupRate_grouped',
       'FamilyRate', 'FamilyRate_grouped', 'Fare_adjusted'],
      dtype='object')

Preprocessing is done by using a pipeline component called Preprocessor (see feature_engineering_utils.py for implementation). Preprocessor takes desired numerical, categorical and ordinal columns as arguments and preprocesses them appropriately:
- numerical columns are standardized
- categorical columns are one-hot encoded
- ordinal columns are encoded using OrdinalEncoder.

Let's specify what columns we want to be numerical, categorical and ordinal:

In [9]:
numerical = ['Age', 'Fare', 'Fare_adjusted', 'GroupRate', 'FamilyRate', 
             'GroupSize', 'FamilySize', 'SibSp', 'Parch']
ordinal = []
categorical = ['Pclass', 'Sex', 'Embarked', 'HasCabin', 'CabinType', 'Title_grouped',
              'FamilySize_grouped', 'GroupSize_grouped', 'GroupRate_grouped','FamilyRate_grouped',
              'Survived'] 

preprocessing_params = {'categorical_cols': categorical, 'numerical_cols': numerical, 'ordinal_cols': ordinal}

In [10]:
preprocessor = Preprocessor(**preprocessing_params)
preprocessed_data = preprocessor.fit_transform(imputed_data_all_features)

In [11]:
preprocessed_data

Unnamed: 0,num__Age,num__Fare,num__Fare_adjusted,num__GroupRate,num__FamilyRate,num__GroupSize,num__FamilySize,num__SibSp,num__Parch,cat__Pclass_1,...,cat__Title_grouped_Religious,cat__FamilySize_grouped_0,cat__FamilySize_grouped_1,cat__FamilySize_grouped_2,cat__GroupSize_grouped_0,cat__GroupSize_grouped_1,cat__GroupSize_grouped_2,cat__GroupRate_grouped_1,cat__FamilyRate_grouped_1,cat__Survived_1
0,-0.549039,-0.502445,-0.651280,-0.644984,-1.223299,-0.061173,0.059160,0.432793,-0.473674,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.617564,0.786845,1.110049,1.044617,1.179442,-0.061173,0.059160,0.432793,-0.473674,1.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0
2,-0.257388,-0.488854,-0.414724,1.044617,1.179442,-0.653621,-0.560975,-0.474545,-0.473674,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0
3,0.398826,0.420730,0.609891,0.305283,0.395715,-0.061173,0.059160,0.432793,-0.473674,1.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0
4,0.398826,-0.486337,-0.407848,-0.644984,0.395715,-0.653621,-0.560975,-0.474545,-0.473674,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,-0.184475,-0.386671,-0.135534,-0.644984,-0.611609,-0.653621,-0.560975,-0.474545,-0.473674,0.0,...,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
887,-0.767777,-0.044381,0.799685,1.044617,0.935386,-0.653621,-0.560975,-0.474545,-0.473674,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0
888,-2.007300,-0.176263,-0.528188,-1.222026,-1.223299,1.123723,1.299429,0.432793,2.008933,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
889,-0.257388,-0.044381,0.799685,-0.644984,-0.611609,-0.653621,-0.560975,-0.474545,-0.473674,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


We can see that new column names have now prefixes telling the feature type. At this point we have already three pipeline components ready: CreateNewFeatures, Imputer and Preprocessor.

We can store the preprocessed data in case we will need it later:

In [12]:
preprocessed_data.to_csv(os.path.join(_DATADIR, 'preprocessed_train_data.csv'))