# Source of Inspiration
In this notebook, I'm going to try to take inspiration from someone else's model implementation. I've been curious about random forests for quite a while, so I looked for scripts that implemented them with this dataset specifically.

I'm going to be using this [Random Forest Script](https://www.kaggle.com/amoyakd/titanic/randomforest-method-v1-0) that I found when I was looking through the scripts section of the Kaggle competition. 


In this script, the author implements a random forest model in R, but I'm going to try to replicate this work in python. 

Additionally, I found that they did some really interesting things in terms of filling in missing data and creating new features, so I wanted to try that too. 

In [154]:
import pandas as pd
import re
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor

First, they read in the data.

In [131]:
train = pd.read_csv('./Data/train.csv')
test = pd.read_csv('./Data/test.csv')

Next is cleaning the data. In this particular script, the author mostly does something very similar to what we did in the previous tutorial for the sex and embarked and fare columns.

In [132]:
# Recode Sex data
test.loc[test["Sex"] == "male", "Sex"] = 0
test.loc[test["Sex"] == "female", "Sex"] = 1
train.loc[train["Sex"] == "male", "Sex"] = 0
train.loc[train["Sex"] == "female", "Sex"] = 1

# Recode Embarked Data
test["Embarked"] = test["Embarked"].fillna("S")
test.loc[test["Embarked"] == "S", "Embarked"] = 0
test.loc[test["Embarked"] == "C", "Embarked"] = 1
test.loc[test["Embarked"] == "Q", "Embarked"] = 2

train["Embarked"] = train["Embarked"].fillna("S")
train.loc[train["Embarked"] == "S", "Embarked"] = 0
train.loc[train["Embarked"] == "C", "Embarked"] = 1
train.loc[train["Embarked"] == "Q", "Embarked"] = 2

#Replace one missing fare data
test["Fare"] = test["Fare"].fillna(train["Fare"].median())


Now, in this script, we also extract the title from the name of the person. 

In [133]:
def extractTitle (name):
    title = re.findall(r', \w+\s?\w*\.', name)[0][2:-1]
    if (title in ['Don','Lady','the Countess', 'Jonkheer']):
        return 'Lady'
    elif (title in ['Capt', 'Don', 'Major', 'Sir']):
        return 'Sir'
    
    return title 

In [134]:
train["Title"] = train['Name'].apply(lambda x: extractTitle(x))
test["Title"] = test['Name'].apply(lambda x: extractTitle(x))

titles = train["Title"].unique()

titleColumns = []
for title in titles:
    train["Title" + title] = train["Title"].apply(lambda x: int(x == title))
    test["Title" + title] = test["Title"].apply(lambda x: int(x == title))
    titleColumns.append("Title" + title)

Now, using the title, class, sex, sibsp, parch, and fare data, we run a regression to predict the age. 

In [135]:
#First, we want to select all of the non-null training values so that we can run our regression
ageFilledTrain = train.dropna(subset = titleColumns + ['Age', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Fare'])
ageXTrain = ageFilledTrain[titleColumns + ['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare']]
ageYTrain = ageFilledTrain['Age']

#Also extract the filled test data (we'll use this for later)
ageFilledTest = test.dropna(subset = titleColumns + ['Age', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Fare'])

#Now, we fit a linear regression
regr = LinearRegression()
regr.fit(ageX, ageY)

#Now, we want to predict the missing ages
nullAgesTrain = train[train['Age'].isnull()]
nullAgesTrain['Age'] = regr.predict(nullAgesTrain[titleColumns + ['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare']])
nullAgesTest = test[test['Age'].isnull()]
nullAgesTest['Age'] = regr.predict(nullAgesTest[titleColumns + ['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare']])


#Now, we want to add these back in to the dataframe. (I'm sure that there is a better way to do this!)
#I am not proud of concatening these back together... How do I do this better?
train = pd.concat([ageFilledTrain, nullAgesTrain])
test = pd.concat([ageFilledTest, nullAgesTest])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Now, to continue cleaning the data and recoding fields, we also add a family size and a mother/child  field.

In [147]:
#Family size
train['FamilySize'] = train['SibSp'] + train['Parch'] + 1
test['FamilySize'] = test['SibSp'] + test['Parch'] + 1

#Adding a mother flag
train['Mother'] = ((train['Sex'] == 1) 
                   & (train['Parch'] > 0) 
                   & (train['Age'] > 18)
                   & (train['Title'] != 'Miss'))
test['Mother'] = ((test['Sex'] == 1) 
                   & (test['Parch'] > 0) 
                   & (test['Age'] > 18)
                   & (test['Title'] != 'Miss'))

#Adding a child flag
train['Child'] = ((train['Parch'] > 0) 
                   & (train['Age'] < 18))

test['Child'] = ((test['Parch'] > 0) 
                   & (test['Age'] < 18))

This model also created a family ID field. Add that here, later!

Also, do deck from cabin number

Also, do the cabin position from the cabin number

Now, we need to split our train data into a testing/training set. 

In [159]:
useColumns = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare'] + titleColumns
X_train, X_test, y_train, y_test = train_test_split(train[useColumns], train['Survived'], test_size = 0.5)

Now, we want to create our random forest model. 

In [189]:
forestModel = RandomForestRegressor()
forestModel.fit(X_train, y_train)

y_predict = forestModel.predict(X_test)
y_predict[y_predict > .5] = 1
y_predict[y_predict <=.5] = 0

accuracy = sum([y_predict[i] == y_test.tolist()[i] for i in range(len(y_predict))])/float(len(y_predict))
print accuracy

0.822869955157


In [184]:
y_predict

array([ 0.,  1.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,
        0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,
        0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  0.,  1.,
        0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.,
        0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,
        0.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,
        0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,
        1.,  1.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,
        1.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,
        0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0

Hey, that's not so bad! I'm going to use this to generate another submission to Kaggle

In [205]:
submissionModel = RandomForestRegressor()
submissionModel.fit(train[useColumns], train['Survived'])

predictions = submissionModel.predict(test[useColumns])

predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pd.DataFrame({
        "PassengerId": test["PassengerId"],
        "Survived": predictions.astype(int)
    })

submission.to_csv("kaggle_it2_0.csv", index=False)

With this model, I scored a 0.75120, which is exactly the same as my best score previously.