# Model Iteration 2
Building off the work we have accomplished previously, we are now looking to improve our score and see how to better analyze the data in our dataset. I really like the inclusion of family from previous work, so I am going to continue with that idea. The main steps that we plan on taking are shown below:

I am initially going to start with Kaggle's own method to improving a submission (https://www.dataquest.io/mission/75/improving-your-submission), then will proceed from there.

In [13]:
# importing modules and data
import pandas
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold
import numpy as np
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import re
import operator
from sklearn.feature_selection import SelectKBest, f_classif

titanic = pandas.read_csv('train.csv')

In [14]:
# fixing age column and standardizing sex and embarked column
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
titanic["Embarked"] = titanic["Embarked"].fillna("S")
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
titanic["Family"] = titanic["SibSp"] + titanic["Parch"]

In [15]:
# interest in title length (and assumed wealth with such) (using dataquest tutorial)
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    if title_search:
        return title_search.group(1)
    return ""

titles = titanic["Name"].apply(get_title)

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
    titles[titles == k] = v

titanic["Title"] = titles

In [16]:
# linear regression step
predictors = ["Pclass", "Sex", "Age", "Fare", 
              "Embarked", "Family", "Title"]

alg = RandomForestClassifier(random_state=1, n_estimators=150, 
                             min_samples_split=8, min_samples_leaf=4)

scores = cross_validation.cross_val_score(alg, titanic[predictors], 
                                          titanic["Survived"], cv=3)

print scores.mean()

0.838383838384


Holy machine learning, Batman! We just jumped from 0.78 to a solid 0.83 from previous work by switching algorithm, along with checking titles.

In [17]:
# code to compute information for submission
titanic_test = pandas.read_csv("test.csv")

titanic_test["Age"] = titanic_test["Age"].fillna(titanic_test["Age"].median())
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")
titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2
titanic_test["Family"] = titanic_test["SibSp"] + titanic_test["Parch"]

# getting info for titles (sort of involved)
titles_test = titanic_test["Name"].apply(get_title)
title_mapping_test = {"Mr": 1, "Miss": 2, "Mrs": 3, 
                      "Master": 4, "Dr": 5, "Rev": 6, 
                      "Major": 7, "Col": 7, "Mlle": 8, 
                      "Mme": 8, "Don": 9, "Lady": 10, 
                      "Countess": 10, "Jonkheer": 10, 
                      "Sir": 9, "Capt": 7, "Ms": 2,
                      "Dona": 10}
for k,v in title_mapping_test.items():
    titles_test[titles_test == k] = v
    
titanic_test["Title"] = titles_test

predictors = ["Pclass", "Sex", "Age", "Fare", 
              "Embarked", "Family", "Title"]

alg = RandomForestClassifier(random_state=1, n_estimators=150, 
                             min_samples_split=8, min_samples_leaf=4)

alg.fit(titanic[predictors], titanic["Survived"])

predictions = alg.predict(titanic_test[predictors])

submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })
submission.to_csv("results_improved.csv", index=False)

Now I chose to ignore a couple things from the optimization tutorial, and change some things around. For one, having family name and linking the number of siblings/spouses and such does not provide an increase in score. Along with that, you might have notices our `predictions` containing `["Pclass", "Sex", "Age", "Fare", "Embarked", "Family", "Title"]`. This does not include `SibSp` and `Parch`, as those values only brought the score down as well. 

With all this in mind, when I submitted the score to Kaggle I received a score of: `0.79426`. This seems to follow closely with the prediction, however being more rigorous with the `RandomForestClassifier` (going to `n_estimators=1000` or even higher) may help with results. Along with that, I did not modify how data scaled much. For example, taking the log of age could probably result in a more even score throughout, without skewing other results. Continuing forward, it might be best to start exploring other algorithms that may favor balancing data, and possibly stepping away from the `RandomForestClassifier` for `RandomForestRegressor` or other similar ensemble methods of sorting and classifying our data. On top of that, more interest might be needed in the relation of age, class, and other attributes of the dataset, without pruning data or generalizing mean results.