### Setup

The following block copies much of the code from the previous notebooks - it sets up our notebook with our data that has been "cleaned" and with its new features. See the previous parts for explanations of each step.

It also drops the features Fare and Parch.

In [1]:
#setup
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

train_df = pd.read_csv('train.csv')  
test_df = pd.read_csv('test.csv')
#combine = [train_df, test_df]
test_df_predictions = pd.read_csv('test.csv')

train_df.drop('Cabin', 1, inplace=True)
test_df.drop('Cabin', 1, inplace=True)
train_df.drop('Ticket', 1, inplace=True)
test_df.drop('Ticket', 1, inplace=True)
train_df.drop('PassengerId', 1, inplace=True)
test_df.drop('PassengerId', 1, inplace=True)
train_df.drop('Embarked', 1, inplace=True)
test_df.drop('Embarked', 1, inplace=True)

train_df['Age'].fillna(train_df['Age'].median(),inplace=True)
test_df['Age'].fillna(test_df['Age'].median(),inplace=True)

test_df['Fare'].fillna(test_df['Fare'].median(),inplace=True)

df_sex = pd.get_dummies(train_df[['Sex']])
train_new = pd.concat([train_df,df_sex],axis=1)
train_new.drop('Sex',1,inplace=True)
df_sex_test = pd.get_dummies(test_df[['Sex']])
test_new = pd.concat([test_df,df_sex_test],axis=1)
test_new.drop('Sex',1,inplace=True)

def extract_title():
    train_new['Title'] = train_new['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
    test_new['Title'] = test_new['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())

extract_title()
train_new.drop('Name',1,inplace=True)
test_new.drop('Name',1,inplace=True)

def extractMrs():
    married_female = []
    for i in train_new['Title']:
        if i == "Mrs":
            married_female.append(1)
        else:
            married_female.append(0)
    train_new['married_female'] = married_female
    
    married_female_test = []
    for i in test_new['Title']:
        if i == "Mrs":
            married_female_test.append(1)
        else:
            married_female_test.append(0)
    test_new['married_female_test'] = married_female_test
    
    
extractMrs()
train_new.drop('Title',1,inplace=True)
test_new.drop('Title',1,inplace=True)

train_new["FamilySize"] = train_new['Parch'] + train_new['SibSp'] + 1
test_new["FamilySize"] = test_new['Parch'] + test_new['SibSp'] + 1

def travelAlone():
    travelAlonePassenger = []
    for i in train_new["FamilySize"]:
        if i == 1:
            travelAlonePassenger.append(1)
        else:
            travelAlonePassenger.append(0)
    train_new["travelAlonePassenger"] = travelAlonePassenger
    
    travelAlonePassenger_test = []
    for i in test_new["FamilySize"]:
        if i == 1:
            travelAlonePassenger_test.append(1)
        else:
            travelAlonePassenger_test.append(0)
    test_new["travelAlonePassenger"] = travelAlonePassenger_test

def bigFamily():
    bigFamilyPassenger = []
    for i in train_new["FamilySize"]:
        if i > 3:
            bigFamilyPassenger.append(1)
        else:
            bigFamilyPassenger.append(0)
    train_new["bigFamilyPassenger"] = bigFamilyPassenger
    
    bigFamilyPassenger_test = []
    for i in test_new["FamilySize"]:
        if i > 3:
            bigFamilyPassenger_test.append(1)
        else:
            bigFamilyPassenger_test.append(0)
    test_new["bigFamilyPassenger"] = bigFamilyPassenger_test
    
travelAlone()
bigFamily()

train_new.drop('Parch', 1, inplace=True)
test_new.drop('Parch', 1, inplace=True)

def ticketFareThreshold():
    ticketFareThreshold = []
    for i in train_new["Fare"]:
        if i > 20:
            ticketFareThreshold.append(1)
        else:
            ticketFareThreshold.append(0)
    train_new["ticketFareThreshold"] = ticketFareThreshold
    
    ticketFareThreshold_test = []
    for i in test_new["Fare"]:
        if i > 20:
            ticketFareThreshold_test.append(1)
        else:
            ticketFareThreshold_test.append(0)
    test_new["ticketFareThreshold"] = ticketFareThreshold_test

ticketFareThreshold()
train_new.drop('Fare', 1, inplace=True)
test_new.drop('Fare', 1, inplace=True)

bins = [0,18,45,100]
group_names = ['young','middle','old']
train_new["ageCategories"] = pd.cut(train_new['Age'], bins, labels=group_names)
ageCategoryIntegerConversion = {'young':1, 'middle':2, 'old':3}
train_new['ageGroup'] = train_new['ageCategories'].map(ageCategoryIntegerConversion)
train_new.drop('Age', 1, inplace=True)
train_new.drop('ageCategories', 1, inplace=True)
test_new["ageCategories"] = pd.cut(test_new['Age'], bins, labels=group_names)
test_new['ageGroup'] = test_new['ageCategories'].map(ageCategoryIntegerConversion)
test_new.drop('Age', 1, inplace=True)
test_new.drop('ageCategories', 1, inplace=True)

The minimum supported version is 2.4.6



### Set up training and test sets

In [2]:
X_train = train_new.drop("Survived", axis=1)
Y_train = train_new["Survived"]

X_test = test_new
X_train.shape, Y_train.shape, X_test.shape

((891, 9), (891,), (418, 9))

In [3]:
X_train.head()

Unnamed: 0,Pclass,SibSp,Sex_female,Sex_male,married_female,FamilySize,travelAlonePassenger,ticketFareThreshold,ageGroup
0,3,1,0,1,0,2,0,0,2
1,1,1,1,0,1,2,0,1,2
2,3,0,1,0,0,1,1,0,2
3,1,1,1,0,1,2,0,1,2
4,3,0,0,1,0,1,1,0,2


### Implementing a random forests model

We'll try a random forest classifier method and print out a list of the feature importances. The feature importances tells us how important each of our features was in classifying passengers.

In [4]:
randomForestModel = RandomForestClassifier(n_estimators = 100)
randomForestModel.fit(X_train,Y_train)
Y_predictions = randomForestModel.predict(X_test)
randomForestModel.score(X_train, Y_train)

features = X_train.columns[:12]
list(zip(X_train[features], randomForestModel.feature_importances_))

[('Pclass', 0.16712115885129136),
 ('SibSp', 0.065797359491162499),
 ('Sex_female', 0.22776380470321539),
 ('Sex_male', 0.2257526689994763),
 ('married_female', 0.038018979676196302),
 ('FamilySize', 0.12388216555856836),
 ('travelAlonePassenger', 0.015122497999734381),
 ('ticketFareThreshold', 0.057760043103956729),
 ('ageGroup', 0.078781321616398572)]

In [5]:
scoreRandomForestModel = round(randomForestModel.score(X_train,Y_train)*100,2)
scoreRandomForestModel

84.510000000000005

### Creating a kaggle submission file

In [6]:
kaggleSubmissionFile = pd.DataFrame( { "PassengerId": test_df_predictions["PassengerId"],
                                   "Survived": Y_predictions})
kaggleSubmissionFile.to_csv('submission.csv',index=False)

### Kaggle submission results

The above submission receives a score of 0.77990 with a kaggle submission.