I read through both Dataquest tutorials, as well as some more tutorials on the kaggle forums. All of them seemed to revolve around using the same variables.

Instead of using age, like most of the tutorials, I've turned that continuous variable into a discrete 'Adult' variable, based on whether a passenger was 16 or older. I use a FamilySize variable that is the sum of SibSp and Parch, and I've turned the Embarked and Sex categories into integers instead of strings.

In [4]:
import pandas, math
import numpy as np
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

titanic = pandas.read_csv('train.csv')
titanic_test = pandas.read_csv('test.csv')

Clean up data, replace strings with integers, create new variables

In [5]:
#Create FamSize variable as sum of Parch and SibSp values
titanic['FamSize'] = titanic.SibSp + titanic.Parch
titanic_test['FamSize'] = titanic_test.SibSp + titanic_test.Parch

#Replace male/female strings with 0/1
titanic.Sex.replace(['male', 'female'], [0, 1], inplace=True)
titanic_test.Sex.replace(['male', 'female'], [0, 1], inplace=True)

#Fill empty values with the median age of the data
titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())
titanic_test['Age'] = titanic_test['Age'].fillna(titanic['Age'].median())

#Create a variable which transforms age to log(age)
titanic['LogAge'] = titanic['Age'].apply(lambda x: math.log(x))
titanic_test['LogAge'] = titanic_test['Age'].apply(lambda x: math.log(x))

#Create a variable that is the length of a passenger's name
titanic['NameLen'] = titanic['Name'].apply(lambda x: len(x))
titanic_test['NameLen'] = titanic['Name'].apply(lambda x: len(x))

#Fill in missing Fare values with median value
titanic_test['Fare'].fillna(titanic_test['Fare'].median(), inplace=True)

#Create Adult variable based on whether a passenger's age is greater/less than 16
titanic['Adult'] = titanic['Age'].apply(lambda(x): 0 if x<16.0 else 1)
titanic_test['Adult'] = titanic_test['Age'].apply(lambda(x): 0 if x<16.0 else 1)

#Fill missing embarked values with most common part
titanic.Embarked.fillna('S', inplace=True)
titanic_test.Embarked.fillna('S', inplace=True)

#Replace string values of ports with ints
titanic['Embarked'].replace(['S', 'C', 'Q'], [0, 1, 2], inplace=True)
titanic_test['Embarked'].replace(['S', 'C', 'Q'], [0, 1, 2], inplace=True)

Use a logistic regression algorithm.

In [8]:
predictors = ["Sex", "Adult", "Embarked", "FamSize"]
alg = LogisticRegression(random_state=1)
#Test Logistic regression model with given predictors and print score
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
print scores.mean()

0.79012345679


Use a Random Forest algorithm

In [9]:
alg = RandomForestClassifier(random_state=1, n_estimators=120, min_samples_split=3, min_samples_leaf=4)
#Test Random Forest model with given predictors and print score
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
print(scores.mean())

0.824915824916


Besides using different variables as parameters in the Random Forest, I also tried to modify the parameters of the Random Forest, to little effect. The most dramatic effect was in modifying the number of estimators, and it almost always worsened the performance.

In [10]:
alg.fit(titanic[predictors], titanic["Survived"])
predictions = alg.predict(titanic_test[predictors])
#Create a submission using alg and predictors, export to csv
submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

submission.to_csv("kaggle.csv", index=False)

Best score = .78469 using a Random Forest with Sex, Adult, Embarked, FamSize, 120 estimators, minimum split of 3, minimum sample of 4

In [380]:
#Ensembling code pulled straight from tutorial
algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors],
    [LogisticRegression(random_state=1), predictors]
]

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
print(scores.mean())

full_predictions = []
for alg, predictors in algorithms:
    alg.fit(titanic[predictors], titanic["Survived"])
    predictions = alg.predict_proba(titanic_test[predictors].astype(float))[:,1]
    full_predictions.append(predictions)

# The gradient boosting classifier generates better predictions, so we weight it higher.
predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4
predictions[predictions>.5] = 1
predictions[predictions<=.5] =  0
predictions = predictions.astype(int)
submission = pandas.DataFrame({'PassengerId': titanic_test['PassengerId'], 'Survived':predictions})
submission.to_csv("kaggle.csv", index=False)

0.79797979798


Best score = .78947 using an ensemble of Gradient Boosting and Logistic Regression. Parameters were Sex, Adult, Embarked, FamSize, and Pclass.

It was very difficult to correlate an improvement in training data score with a correlation in kaggle submission scores. When I was using the Random Forest, reducing the number of parameters almost always gave a better performance (up to a point). This makes sense because I think complex models tend to overfit to the training data. I also just read The Evolution of Cooperation by Robert Axelrod, and that book gives plenty of examples for why models should be as simple as possible.