## Following the idea of [Ahmed Besbes](http://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html) with my own implementation

I tried to implement some of his feature engineering steps in titanic_tutorial, but it became too messy. Here is a new start from scratch.

### Purpose
* Reproduce his result
* Find out the key to better result

### Summary
* Data exploration
  * Using violine plot, one could check if some features are more important.
  * For Titanic data set, the most important ones are gender and age. Using gender alone can reach about 75% accuracy.
* Data engineering tricks
  * Need to provide missing values.
  * If possible, should combine test features and target features to make the most out of available data.
  * Separate methods were written to process features. Note that it returns a new column. This is more flexible than operating on the original dataframe.
  * _Important_: use dummies to convert multiclass category values to binary numerical classes. As a result, more features are added.
  * _Important_: dimension reduction using SelectFromModel. This is a meta selector that would choose a subset of features based on the importance calcuated from a base model.
  * Scale of features: for tree based models, it shouldn't matter, but it shouldn't hurt either to use normalized features. Ahmed used max to normalize. My experience is that switching to min_max scaling helps with the result.
* Fitting
  * Use RandomForest. It is critical to adjust the parameters. It also works without features reduction using SelectFromModel, although slightly worse.
  * When using full features, adjusting min_samples_leaf helps. But it doesn't help with reduced feature space.

In [1]:
import sys 
import os
sys.path.append(os.path.abspath("/home/yu/MachineLearning/"))
from machine_learning_utility import *

In [2]:
# get the training and test data. combine them together for feature engineering. 
# * Use the features from test can get the most from the available information
# * The test features need to be processed in the same way as training features anyway.
train_df = pd.read_csv('train.csv', header=0)
train_df['set'] = 'train' # category to indicate training data
train_df.drop('Survived', axis=1, inplace=True) # for feature engineering, we don't need the result
test_df = pd.read_csv('test.csv', header=0)
test_df['set'] = 'test' # category to indicate testing data
combined_df = pd.concat([train_df, test_df])
combined_df.reset_index(inplace=True) # after pd.concat the index is from train_df and test_df and needs to be reset
combined_df.head()

Unnamed: 0,index,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,set
0,0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,train
1,1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,train
2,2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,train
3,3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,train
4,4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,train


In [3]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
index          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
set            1309 non-null object
dtypes: float64(2), int64(5), object(6)
memory usage: 102.3+ KB


### We need to fill the following
* Age
* Cabin
* Embarked

In [4]:
# parse title from name
# add titles
# a map of more aggregated titles
import re
def get_title(title):
    Title_Dictionary = {
                        "Capt":       "Officer",
                        "Col":        "Officer",
                        "Major":      "Officer",
                        "Jonkheer":   "Royalty",
                        "Don":        "Royalty",
                        "Sir" :       "Royalty",
                        "Dr":         "Officer",
                        "Rev":        "Officer",
                        "the Countess":"Royalty",
                        "Dona":       "Royalty",
                        "Mme":        "Mrs",
                        "Mlle":       "Miss",
                        "Ms":         "Mrs",
                        "Mr" :        "Mr",
                        "Mrs" :       "Mrs",
                        "Miss" :      "Miss",
                        "Master" :    "Master",
                        "Lady" :      "Royalty"

                        }
    match = re.search(',\s([a-zA-Z\s]+)\.', title)
    if match:
        return Title_Dictionary[match.group(1)]
    else:
        return None

def add_title(df):
    return df['Name'].map(lambda x: get_title(x))
    

combined_df['Title'] = add_title(combined_df)

In [5]:
# more detailed imputation of age
# fill unknown age with median based on other features.

def fill_unknown(df, unknown, features):
    median_unknown = df.groupby(features).median()[unknown]
    # get the median age based on features
    for i, r in df.iterrows(): # loop through the rows   
        if np.isnan(r[unknown]):
            m = median_unknown
            for f in features:  # loop through the features to select the median based on all features
                m = m[r[f]]
            df.loc[i, 'filled'] = m
        else:
            df.loc[i, 'filled'] = r[unknown]
    return df['filled']

features = ['Pclass', 'Sex', 'Title']
combined_df['AgeFilled'] = fill_unknown(combined_df, 'Age', features)
features = ['Pclass']
combined_df['FareFilled'] = fill_unknown(combined_df, 'Fare', features)


In [6]:
# fill embark info with the most frequent
combined_df['EmbardedFilled'] = combined_df['Embarked'].fillna(combined_df['Embarked'].dropna().mode()[0])



In [7]:
# process cabin. unknown cabin is n. known cabin is coded by the first letter
combined_df['CabinFilled'] = combined_df['Cabin'].map(lambda x: str(x)[0])

In [8]:
# convert Sex to numerical value
combined_df['Sex_n'] = combined_df['Sex'].map({'male': 1, 'female': 0})

In [9]:
# process tickets


def add_prefex(df, features):
    # extract the alphabetic prefix from features, if not found, use XXX
    for f in features:
        for (i, r) in df.iterrows():
            x = r[f]
            x = re.sub('(\W+)', '', x) # remove any non-alphabetic characters
            matched = re.match('([a-zA-Z]+)', x)
            if matched:
                df.loc[i, f+'_pre'] = matched.group(0)
            else:
                df.loc[i, f+'_pre'] = 'XXX'
    return df[[f+'_pre' for f in features]]

combined_df['Ticket_pre'] = add_prefex(combined_df, ['Ticket'])

In [10]:
# process family size
combined_df['FamilySize'] = combined_df['Parch'] + combined_df['SibSp'] + 1
combined_df.loc[combined_df.FamilySize == 1, 'FamilyType'] = 'S'
combined_df.loc[(combined_df.FamilySize >= 2) & (combined_df.FamilySize <= 4), 'FamilyType'] = 'M'
combined_df.loc[(combined_df.FamilySize > 4), 'FamilyType'] = 'L'

In [11]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 23 columns):
index             1309 non-null int64
PassengerId       1309 non-null int64
Pclass            1309 non-null int64
Name              1309 non-null object
Sex               1309 non-null object
Age               1046 non-null float64
SibSp             1309 non-null int64
Parch             1309 non-null int64
Ticket            1309 non-null object
Fare              1308 non-null float64
Cabin             295 non-null object
Embarked          1307 non-null object
set               1309 non-null object
Title             1309 non-null object
filled            1309 non-null float64
AgeFilled         1309 non-null float64
FareFilled        1309 non-null float64
EmbardedFilled    1309 non-null object
CabinFilled       1309 non-null object
Sex_n             1309 non-null int64
Ticket_pre        1309 non-null object
FamilySize        1309 non-null int64
FamilyType        1309 non-null object

In [12]:
# Convert categorical data to a collection of binary data
def add_dummies(df, categories):
## return columns named category_i
    dummies = pd.DataFrame()
    for c in categories:
        dummies = pd.concat([dummies, pd.get_dummies(df[c], prefix=c)], axis=1)
    return dummies

combined_df = pd.concat([combined_df, add_dummies(combined_df, ['Title'])], axis=1)
combined_df = pd.concat([combined_df, add_dummies(combined_df, 
                                                  ['EmbardedFilled', 'CabinFilled', 'Pclass', 'Ticket_pre', 'FamilyType'])], axis=1)

In [13]:
# scale all numerical features
scaled_df = combined_df.drop('index', axis=1)
scaled_df = combined_df.drop('PassengerId', axis=1)
features = scaled_df.columns
for f in features:
    if scaled_df[f].dtypes != object:
        scaled_df[f] = scaled_df[f]/scaled_df[f].max()
scaled_df = pd.concat([scaled_df, combined_df['PassengerId']], axis=1)

In [14]:
scaled_df.drop('index', axis=1, inplace=True)
scaled_df.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,...,Ticket_pre_STONO,Ticket_pre_STONOQ,Ticket_pre_SWPP,Ticket_pre_WC,Ticket_pre_WEP,Ticket_pre_XXX,FamilyType_L,FamilyType_M,FamilyType_S,PassengerId
0,1.0,"Braund, Mr. Owen Harris",male,0.275,0.125,0.0,A/5 21171,0.014151,,S,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1
1,0.333333,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,0.475,0.125,0.0,PC 17599,0.139136,C85,C,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2
2,1.0,"Heikkinen, Miss. Laina",female,0.325,0.0,0.0,STON/O2. 3101282,0.015469,,S,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3
3,0.333333,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,0.4375,0.125,0.0,113803,0.103644,C123,S,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,4
4,1.0,"Allen, Mr. William Henry",male,0.4375,0.0,0.0,373450,0.015713,,S,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,5


### The following is essentially copied from Ahmed's post


In [15]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.cross_validation import cross_val_score

def compute_score(clf, X, y,scoring='accuracy'):
    xval = cross_val_score(clf, X, y, cv = 5,scoring=scoring)
    return np.mean(xval)

def recover_train_test_target(df):
    train0 = pd.read_csv('train.csv')
    targets = train0.Survived
    train = df.ix[0:890]
    test = df.ix[891:]
    
    return train,test,targets



In [16]:
scaled_df.drop(['Name', 'Sex', 'Age', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'set', 'Title', 'filled', 
        'EmbardedFilled', 'CabinFilled', 'Ticket_pre', 'Pclass', 'FamilyType'], axis=1, inplace=True)

In [17]:
train,test,targets = recover_train_test_target(scaled_df)
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
clf = ExtraTreesClassifier(n_estimators=200)
clf = clf.fit(train, targets)

In [27]:
features = pd.DataFrame()
features['feature'] = train.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'],ascending=False)

Unnamed: 0,feature,importance
63,PassengerId,0.128301
2,AgeFilled,0.119741
3,FareFilled,0.114454
4,Sex_n,0.107414
8,Title_Mr,0.103413
7,Title_Miss,0.042085
26,Pclass_3,0.039438
9,Title_Mrs,0.038916
23,CabinFilled_n,0.028687
24,Pclass_1,0.022119


In [19]:
model = SelectFromModel(clf, prefit=True)
train_new = model.transform(train)
train_new.shape
test_new = model.transform(test)
test_new.shape

(418, 14)

In [20]:
forest = RandomForestClassifier(max_features='sqrt')

parameter_grid = {
                 'max_depth' : [4,5,6,7,8],
                 'n_estimators': [200,210,240,250],
                 'criterion': ['gini','entropy']
                 }

cross_validation = StratifiedKFold(targets, n_folds=5)

grid_search = GridSearchCV(forest,
                           param_grid=parameter_grid,
                           cv=cross_validation)

grid_search.fit(train_new, targets)

print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))


Best score: 0.832772166105
Best parameters: {'n_estimators': 250, 'criterion': 'gini', 'max_depth': 4}


In [21]:
output = grid_search.predict(test_new).astype(int)
df_output = pd.DataFrame()
df_output['PassengerId'] = test['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('ahmed.csv',index=False)
# leader board: 0.79426. An improvement over my earlier by about 0.02

### TODO
* Learn about SelectFromModel
* Try to write some utility methods
* Modify some obvious weak points (such as the scaling of Fare)
* Try to figure our which step is more important

#### SelectFromModel
Based on the importance field of an estimator, features are selected. In the script from Ahmed, SelectFromModel takes an already fitted ExtraTreeClassifer as input. It can also take a classifier and fit it as part of the invocation of SelectFromModel.

In [25]:
# scale all numerical features by min_max (step 1/3)
minmaxscaled_df = combined_df.drop('index', axis=1)
minmaxscaled_df = combined_df.drop('PassengerId', axis=1)
features = minmaxscaled_df.columns
for f in features:
    if minmaxscaled_df[f].dtypes != object:
        minmaxscaled_df[f] = (minmaxscaled_df[f]-minmaxscaled_df[f].min())/(minmaxscaled_df[f].max()
                                                                            -minmaxscaled_df[f].min())
minmaxscaled_df = pd.concat([minmaxscaled_df, combined_df['PassengerId']], axis=1)
minmaxscaled_df.drop('index', axis=1, inplace=True)
minmaxscaled_df.drop(['Name', 'Sex', 'Age', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'set', 'Title', 'filled', 
        'EmbardedFilled', 'CabinFilled', 'Ticket_pre', 'Pclass', 'FamilyType'], axis=1, inplace=True)

In [26]:
# Extract features (step 2/3)
train,test,targets = recover_train_test_target(minmaxscaled_df)
clf = ExtraTreesClassifier(n_estimators=200)
clf = clf.fit(train, targets)
model = SelectFromModel(clf, prefit=True)
train_new = model.transform(train)
train_new.shape
test_new = model.transform(test)
test_new.shape

(418, 14)

In [36]:
# fit with random forest (step 3/3)
forest = RandomForestClassifier(max_features='sqrt')

parameter_grid = {
                 'max_depth' : [4,5,6,7,8],
                 'n_estimators': [300],
                 'criterion': ['gini','entropy']
                 }

cross_validation = StratifiedKFold(targets, n_folds=5)

grid_search = GridSearchCV(forest,
                           param_grid=parameter_grid,
                           cv=cross_validation)

grid_search.fit(train_new, targets)

print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
output = grid_search.predict(test_new).astype(int)
df_output = pd.DataFrame()
df_output['PassengerId'] = test['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('ahmed.csv',index=False)
# leader board: 0.80861, better than the orignal method by Ahmed

Best score: 0.83164983165
Best parameters: {'n_estimators': 300, 'criterion': 'entropy', 'max_depth': 4}


In [37]:
# fit with random forest with min_samples_leaf(step 3/3)
forest = RandomForestClassifier(max_features='sqrt')

parameter_grid = {
                 'max_depth' : [4,5,6,7,8],
                 'n_estimators': [160,180,200,210],
                 'criterion': ['gini'],
                 'min_samples_leaf': [10, 20, 30, 40]
                 }

cross_validation = StratifiedKFold(targets, n_folds=5)

grid_search = GridSearchCV(forest,
                           param_grid=parameter_grid,
                           cv=cross_validation)

grid_search.fit(train_new, targets)

print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
output = grid_search.predict(test_new).astype(int)
df_output = pd.DataFrame()
df_output['PassengerId'] = test['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('ahmed.csv',index=False)
# leader board: 0.78469, not as good as above. Why min_samples_leaf doesn't make it better?

Best score: 0.83164983165
Best parameters: {'n_estimators': 180, 'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 10}


----

## How about using minmaxscaled directly without feature reduction

In [31]:
# attemp 1: fit with random forest (step 1/1)
train,test,targets = recover_train_test_target(minmaxscaled_df)
forest = RandomForestClassifier(max_features='sqrt')

parameter_grid = {
                 'max_depth' : [4,5,6,7,8],
                 'n_estimators': [200,210,240,250],
                 'criterion': ['gini','entropy']
                 }

cross_validation = StratifiedKFold(targets, n_folds=5)

grid_search = GridSearchCV(forest,
                           param_grid=parameter_grid,
                           cv=cross_validation)

grid_search.fit(train, targets)

print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
output = grid_search.predict(test).astype(int)
df_output = pd.DataFrame()
df_output['PassengerId'] = test['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('ahmed.csv',index=False)
# leader board: 0.78947, better than the score before using Ahmed's method

Best score: 0.832772166105
Best parameters: {'n_estimators': 210, 'criterion': 'gini', 'max_depth': 8}


In [32]:
# attempt 2: fit with random forest (step 1/1), adjust min_samples_leaf
train,test,targets = recover_train_test_target(minmaxscaled_df)
forest = RandomForestClassifier(max_features='sqrt')

parameter_grid = {
                 'max_depth' : [4,5,6,7,8],
                 'n_estimators': [200,210,240,250],
                 'criterion': ['gini'],
                 'min_samples_leaf': [10, 20, 30, 40]
                 }

cross_validation = StratifiedKFold(targets, n_folds=5)

grid_search = GridSearchCV(forest,
                           param_grid=parameter_grid,
                           cv=cross_validation)

grid_search.fit(train, targets)

print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
output = grid_search.predict(test).astype(int)
df_output = pd.DataFrame()
df_output['PassengerId'] = test['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('ahmed.csv',index=False)
# leader board: 0.79426, better than the score before using Ahmed's method. This is also better than attempt 1.
# Does it mean that it could be used to improve Ahmed's method?

Best score: 0.828282828283
Best parameters: {'n_estimators': 200, 'criterion': 'gini', 'max_depth': 8, 'min_samples_leaf': 10}


In [33]:
# attempt 3: fit with random forest (step 1/1), fix min_samples_leaf
train,test,targets = recover_train_test_target(minmaxscaled_df)
forest = RandomForestClassifier(max_features='sqrt')

parameter_grid = {
                 'max_depth' : [4,5,6,7,8],
                 'n_estimators': [200,210,240,250],
                 'criterion': ['gini'],
                 'min_samples_leaf': [30]
                 }

cross_validation = StratifiedKFold(targets, n_folds=5)

grid_search = GridSearchCV(forest,
                           param_grid=parameter_grid,
                           cv=cross_validation)

grid_search.fit(train, targets)

print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
output = grid_search.predict(test).astype(int)
df_output = pd.DataFrame()
df_output['PassengerId'] = test['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('ahmed.csv',index=False)
# leader board: 0.78947. Not as good as attempt 2.

Best score: 0.812570145903
Best parameters: {'n_estimators': 210, 'criterion': 'gini', 'max_depth': 6, 'min_samples_leaf': 30}


## Useful references
* How to tune randome forest (https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/)
  * n_estimator: the higher the better, but takes longer to train
  * min_samples_leaf: avoid overfitting
  * max_features:
* A detailed tutorial to achieve 0.81 on leader board (http://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html)
  * Use stratified kfold
  * Use ExtraTreesClassifier + SelectFromModel to reduce dimensionality
  * How to convert multiclass features to multifeatures of binary class. use dummies.
  * Proper normalization