# Features Engineering

in this notebook, I'll attempt to modify and create features as well as filling missing data

In [1]:
import pandas as pd
import re

In [2]:
train = pd.read_csv('../input/train.csv')
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [3]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

I will fill missing data in the Age, Cabin and Embarked columns

In [4]:
def dummify(df, col, drop=True):
    df_dummies = pd.get_dummies(df[col], prefix=col)
    df = pd.concat([df,df_dummies], axis=1)
    if drop == True:
        df.drop([col], inplace=True, axis=1)
    return df

## Pclass 
There are 3 classes that are important in determining if a passenger survived. I'll use dummy variables here.

In [5]:
train = dummify(train,'Pclass')
train.columns

Index(['PassengerId', 'Survived', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked', 'Pclass_1', 'Pclass_2',
       'Pclass_3'],
      dtype='object')

# Name
From the name, we can extract a title that may contain information about the passenger social status.

In [6]:
def get_titles(df):
    df['Title'] = df['Name'].map(lambda x: re.compile(", (.*?)\.").findall(x)[0])
    df['Title'][df.Title.isin(['Mlle','Ms'])] = 'Miss'
    df['Title'][df.Title == 'Mme'] = 'Mrs'
    df['Title'][df.Title.isin(['Jonkheer','Don','Dr','Capt','Col','Major'])] = 'Sir'
    df['Title'][df.Title.isin(['the Countess','Dona'])] = 'Lady'
    return df

train = get_titles(train)
train = dummify(train,'Title')
train.drop('Name', inplace=True, axis=1)
train.columns

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Index(['PassengerId', 'Survived', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked', 'Pclass_1', 'Pclass_2', 'Pclass_3',
       'Title_Lady', 'Title_Master', 'Title_Miss', 'Title_Mr', 'Title_Mrs',
       'Title_Rev', 'Title_Sir'],
      dtype='object')

## Sex
Sex also play an important role in determining wheter a passenger survived. There are 2 values: male and female. I'll use dummy variables here as well

In [7]:
train = dummify(train,'Sex')
# I drop one of the dummy variables as it is a binary variable
train.drop('Sex_male', inplace=True, axis=1)
train.columns

Index(['PassengerId', 'Survived', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Title_Lady',
       'Title_Master', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Rev',
       'Title_Sir', 'Sex_female'],
      dtype='object')

# Family
I'll keep Sibsp and Parch as they are. They are both numerical variables.  
I'll add a family size variable. We'll see later which one is most influencial.

In [8]:
train['FamilySize'] = train['SibSp'] + train['Parch'] + 1
train.columns

Index(['PassengerId', 'Survived', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Title_Lady',
       'Title_Master', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Rev',
       'Title_Sir', 'Sex_female', 'FamilySize'],
      dtype='object')

## Ticket
I'll drop this column for now

In [9]:
train.drop('Ticket', inplace=True, axis=1)
train.columns

Index(['PassengerId', 'Survived', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin',
       'Embarked', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Title_Lady',
       'Title_Master', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Rev',
       'Title_Sir', 'Sex_female', 'FamilySize'],
      dtype='object')

## Fare
I'll create bins of fares that I'll factorize

In [10]:
def fare_bins(df,drop=True):
    bins = [0,10,20,30,40,50,100,550]
    train['FareBin'] = pd.cut(train['Fare'], bins)
    train['FareBin'] = pd.factorize(train['FareBin'])[0]
    if drop == True:
        train.drop('Fare', inplace=True, axis=1)
    return df

train = fare_bins(train)
train.columns

Index(['PassengerId', 'Survived', 'Age', 'SibSp', 'Parch', 'Cabin', 'Embarked',
       'Pclass_1', 'Pclass_2', 'Pclass_3', 'Title_Lady', 'Title_Master',
       'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Rev', 'Title_Sir',
       'Sex_female', 'FamilySize', 'FareBin'],
      dtype='object')

At this point, I still have to work on Age, Cabin and Embarked. All of those features have missing values.

## Embarked
I'll just fill with the port where most of passengers boarded the boat(Southampton)

In [11]:
train['Embarked'].fillna('S', inplace=True)
train = dummify(train,'Embarked')
train.isnull().sum()

PassengerId       0
Survived          0
Age             177
SibSp             0
Parch             0
Cabin           687
Pclass_1          0
Pclass_2          0
Pclass_3          0
Title_Lady        0
Title_Master      0
Title_Miss        0
Title_Mr          0
Title_Mrs         0
Title_Rev         0
Title_Sir         0
Sex_female        0
FamilySize        0
FareBin           0
Embarked_C        0
Embarked_Q        0
Embarked_S        0
dtype: int64

## Cabin
There is a lot of missing data here. I'll extract the deck from the cabin number and fill the missing data with 'U0'

In [12]:
def get_deck(df, drop=True):
    df['Deck'] = df['Cabin'].astype(str).str[0]
    if drop == True:
        df.drop('Cabin', inplace=True, axis=1)
    df['Deck'][df.Deck == 'n'] = 'U0'
    return df

train = get_deck(train)
train = dummify(train,'Deck')
train.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


PassengerId       0
Survived          0
Age             177
SibSp             0
Parch             0
Pclass_1          0
Pclass_2          0
Pclass_3          0
Title_Lady        0
Title_Master      0
Title_Miss        0
Title_Mr          0
Title_Mrs         0
Title_Rev         0
Title_Sir         0
Sex_female        0
FamilySize        0
FareBin           0
Embarked_C        0
Embarked_Q        0
Embarked_S        0
Deck_A            0
Deck_B            0
Deck_C            0
Deck_D            0
Deck_E            0
Deck_F            0
Deck_G            0
Deck_T            0
Deck_U0           0
dtype: int64

## Age
Now that all the other features are ready, I will use a machine learning algorithm to redict missing ages.

In [13]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

## split the dataset
knownAges = train.loc[(train.Age.notnull())]
unknownAges = train.loc[(train.Age.isnull())]

X = knownAges.values[:,3::]
y = knownAges.values[:,2]

clf = RandomForestRegressor()

In [69]:
params = {
    'n_estimators':[10,100,1000,2000],
    'max_features':['auto','sqrt','log2'],
    'max_depth':[3,5,10],
    'min_samples_leaf': [1, 10, 100]
}

grid_search = GridSearchCV(clf, params, scoring='r2', cv=5, n_jobs=4)
grid_search.fit(X,y)

print("Best score: {}".format(grid_search.best_score_))
print("Best params: {}".format(grid_search.best_params_))

Best score: 0.427532117334181
Best params: {'min_samples_leaf': 10, 'max_features': 'auto', 'n_estimators': 100, 'max_depth': 10}


In [14]:
clf = RandomForestRegressor(min_samples_leaf=10,max_features='auto',n_estimators=100,max_depth=10)
clf.fit(X,y)
predictedAges = clf.predict(unknownAges.values[:, 3::])
train.loc[(train.Age.isnull()), 'Age'] = predictedAges
train.isnull().sum()

PassengerId     0
Survived        0
Age             0
SibSp           0
Parch           0
Pclass_1        0
Pclass_2        0
Pclass_3        0
Title_Lady      0
Title_Master    0
Title_Miss      0
Title_Mr        0
Title_Mrs       0
Title_Rev       0
Title_Sir       0
Sex_female      0
FamilySize      0
FareBin         0
Embarked_C      0
Embarked_Q      0
Embarked_S      0
Deck_A          0
Deck_B          0
Deck_C          0
Deck_D          0
Deck_E          0
Deck_F          0
Deck_G          0
Deck_T          0
Deck_U0         0
dtype: int64

Next, I'll create bins that I'll factorize. I'll create a first bin from 0 to 15 that will include all children.

In [17]:
bins = [0,15,25,35,45,55,65,75,85]

train['AgeBins'] = pd.cut(train['Age'], bins)
train['AgeBins'] = pd.factorize(train['AgeBins'])[0]
train.drop('Age', inplace=True, axis=1)

In [19]:
train.head()

Unnamed: 0,PassengerId,Survived,SibSp,Parch,Pclass_1,Pclass_2,Pclass_3,Title_Lady,Title_Master,Title_Miss,...,Deck_A,Deck_B,Deck_C,Deck_D,Deck_E,Deck_F,Deck_G,Deck_T,Deck_U0,AgeBins
0,1,0,1,0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
1,2,1,1,0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,3,1,0,0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2
3,4,1,1,0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2
4,5,0,0,0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2


In [22]:
train.to_pickle('../output/features.pickle')