In this notebook, I begin with a model that is dervied from experience using [DataQuest](https://www.dataquest.io/mission/74/getting-started-with-kaggle), and then annotate a variety of revisions tried since. Accuracies are also reported.

In [294]:
#importing and setting up my dataframes
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

titanic_data = pd.read_csv('data/train.csv', dtype={'Age':np.float64})
test_data = pd.read_csv('data/test.csv', dtype={'Age':np.float64})

# titanic_data.head()
# test_data.head()

print 'Data from the Titanic CSV'
titanic_data.info()
print '----------------------'
print 'Data from the Test CSV'
test_data.info()

Data from the Titanic CSV
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB
----------------------
Data from the Test CSV
<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null in

First, we want to start with our titanic_data set and clean it up, filling in the missing values with either the mean or median values. In the event of categories, we'll fill in the mode of the data. For this example, I'll fill it in with the median values (in the event that some of these distributions are skewed or huge outliers are present).

In [295]:
for name in titanic_data.describe():
    titanic_data[name] = titanic_data[name].fillna(titanic_data[name].median()) #this works for numeric columns

#now for categories that we care about like embarked

titanic_data['Embarked'] = titanic_data['Embarked'].fillna('S') #this is the most frequent value
    
print titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB
None


Further, we'll make all our categories into numeric entities (sex, embarked)...

In [296]:
titanic_data.loc[titanic_data['Sex'] == 'male', 'Sex'] = 0
titanic_data.loc[titanic_data['Sex'] == 'female', 'Sex'] = 1

titanic_data.loc[titanic_data['Embarked'] == 'S', 'Embarked'] = 0
titanic_data.loc[titanic_data['Embarked'] == 'C', 'Embarked'] = 1
titanic_data.loc[titanic_data['Embarked'] == 'Q', 'Embarked'] = 2

Let's predict some things! We'll try out a linear regression first.

In [297]:
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold

predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] #the only things we care about for this one

#initialize algorithm type
algorith = LinearRegression()

#let's generate some cross validation folds 
kf = KFold(titanic_data.shape[0], n_folds=3, random_state=1)

#generate a list of predictions
pred = []
for train, test in kf:
    train_predictors = (titanic_data[predictors].iloc[train,:])
    train_target = titanic_data['Survived'].iloc[train]
    algorith.fit(train_predictors, train_target)
    test_predictions = algorith.predict(titanic_data[predictors].iloc[test,:])
    pred.append(test_predictions)
    
#assess our training
pred = np.concatenate(pred, axis=0)
pred[pred > 0.5] = 1
pred[pred <= 0.5] = 0
accuracy = sum(pred[pred == titanic_data['Survived']] / len(pred))
print accuracy

0.783389450056




In an attempt to be better than this, I will give a logistic regression a shot.

In [298]:
from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation

log_alg = LogisticRegression(random_state=1)
scores = cross_validation.cross_val_score(log_alg, titanic_data[predictors], titanic_data['Survived'], cv=3)
print scores.mean()

0.787878787879


At this point, I will now perform some analysis on the test data provided by kaggle directly, then make my first submission.

In [299]:
#I've already imported this data earlier, so that's nice.
test_data['Age'] = test_data['Age'].fillna(titanic_data['Age'].median())
test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].median())

test_data.loc[test_data['Sex'] == 'male', 'Sex'] = 0
test_data.loc[test_data['Sex'] == 'female', 'Sex'] = 1

test_data['Embarked'] = test_data['Embarked'].fillna('S')
test_data.loc[test_data['Embarked'] == 'S', 'Embarked'] = 0
test_data.loc[test_data['Embarked'] == 'C', 'Embarked'] = 1
test_data.loc[test_data['Embarked'] == 'Q', 'Embarked'] = 2

#apply an algorithm
algo = LogisticRegression(random_state=1)
algo.fit(titanic_data[predictors], titanic_data['Survived'])
predicts = algo.predict(test_data[predictors])

print predicts

#now make a submission dataframe for Kaggle
submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived':predicts})
submission.to_csv('kaggle.csv', index=False)

[0 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 0 1
 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 1
 1 1 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 1 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0
 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0
 0 1 1 1 1 1 0 1 0 0 0]


After submitting, I was told the accuracy of the above model was: 0.75120 (along with my other classmates - what fun!)

Now, I would like to start making revisions to this model. From my exploration earlier, I have some assumptions about the relationships between a variety of variables. I want to use these assumed relationships to add robust predicting to my model. I start with a fresh set of training folds:

In [300]:
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold
import statsmodels.formula.api as smf

titanic = pd.read_csv('data/train.csv', dtype={'Age':np.float64})
test = pd.read_csv('data/test.csv', dtype={'Age':np.float64})

for name in titanic.describe():
    titanic[name] = titanic[name].fillna(titanic[name].median()) #this works for numeric columns
#now for categories that we care about like embarked
titanic['Embarked'] = titanic['Embarked'].fillna('S') #this is the most frequent value
titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 0
titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 1
titanic.loc[titanic['Embarked'] == 'S', 'Embarked'] = 0
titanic.loc[titanic['Embarked'] == 'C', 'Embarked'] = 1
titanic.loc[titanic['Embarked'] == 'Q', 'Embarked'] = 2 

#now, I'm going to create a few factors which weigh some of these elements more than others (like being a young rich woman)

predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] #the only things we care about for this one

#initialize algorithm type
# formula = 'Survived ~ C(Sex) + Age + C(Pclass) + Fare' #80%
# formula = 'Survived ~ C(Sex) + SibSp + C(Pclass)' #80%
# formula = 'Survived ~ Embarked * Fare + Sex * Pclass + Age * Sex + Pclass * Sex + Sex * SibSp' #82% 
formula = 'Survived ~ Sex * Pclass + Sex * Age + SibSp * Sex + Parch * Age + Pclass * Age' #83%
model = smf.logit(formula, data=titanic)
results = model.fit()

#assess our training
pred = results.predict()
pred[pred > 0.5] = 1
pred[pred <= 0.5] = 0
accuracy = sum(pred[pred == titanic['Survived']] / len(pred))
print accuracy

#well...let's try it on the test data and submit to kaggle!
test['Age'] = test['Age'].fillna(titanic['Age'].median())
test['Fare'] = test['Fare'].fillna(test['Fare'].median())

test.loc[test['Sex'] == 'male', 'Sex'] = 0
test.loc[test['Sex'] == 'female', 'Sex'] = 1

test['Embarked'] = test['Embarked'].fillna('S')
test.loc[test['Embarked'] == 'S', 'Embarked'] = 0
test.loc[test['Embarked'] == 'C', 'Embarked'] = 1
test.loc[test['Embarked'] == 'Q', 'Embarked'] = 2

#apply an algorithm
# test_formula = 'Sex * Pclass + Sex * Age + SibSp * Sex + Parch * Age + Pclass * Age' #83%

# test_model = smf.logit(test_formula, data=test)
new = test
predicts = results.predict(new)

predicts[predicts > 0.5] = int(1)
predicts[predicts <= 0.5] = int(0)

final_pred = []
for element in predicts:
    final_pred.append(int(element))

# print final_pred #had to do this for some random readon in which predicts was floating point...?

#now make a submission dataframe for Kaggle
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived':final_pred})
# print submission
submission.to_csv('kaggle_rev1.csv', index=False)

Optimization terminated successfully.
         Current function value: 0.421738
         Iterations 7
0.824915824916




This first revision scored me a 0.78469 on the scoreboard (it informs me this is an improvement of 0.03349 and launched me 1,357 positions on the leaderboard sitting my comfy at 2,003). For this next revision, I'll push a little harder on the sibling/spouse and parch relationship which I thought was powerful in my exploration but didn't maximize here.

In [301]:
# predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] #the only things we care about for this one

# formula = 'Survived ~ (Pclass + Age + SibSp) * Sex + (Parch + Pclass) * Age + (Parch + SibSp + Fare + Sex) * Embarked + (Pclass + Parch + Fare + Age) * SibSp' #84% but the convergence was no good (0.77033)
formula = 'Survived ~ (Pclass + SibSp + Parch + Embarked + Fare) * (Sex + Age + Pclass + SibSp + Fare)' #84% with score only of 0.77512

model = smf.logit(formula, data=titanic)
results = model.fit()

#assess our training
pred = results.predict()
pred[pred > 0.5] = 1
pred[pred <= 0.5] = 0
accuracy = sum(pred[pred == titanic['Survived']] / len(pred))
print accuracy

#let's give it a shot
new = test
predicts = results.predict(new)

predicts[predicts >= 0.5] = int(1)
predicts[predicts < 0.5] = int(0)

final_pred = []
for element in predicts:
    final_pred.append(int(element))

#now make a submission dataframe for Kaggle
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived':final_pred})
# print submission
submission.to_csv('kaggle_rev2.csv', index=False)

Optimization terminated successfully.
         Current function value: 0.395759
         Iterations 8
0.83950617284




This revision did not perform better than the original (0.77512, which is close), however I believe this implies there is something interesting going on in the data. The test data is a little bit different than the training data, so the more specific the training formula, the less useful it is for the test data. I look forward to learning more on how to build intuition about data manipulation. I wonder if creating a recode metric might be interesting.

In [302]:
test['Age_Recode'] = np.log10(test.Age)
titanic['Age_Recode'] = np.log10(titanic.Age)

formula = 'Survived ~ Sex * Pclass + Sex * Age + Parch * Sex' #82%

model = smf.logit(formula, data=titanic)
results = model.fit()

#assess our training
pred = results.predict()
pred[pred > 0.5] = 1
pred[pred <= 0.5] = 0
accuracy = sum(pred[pred == titanic['Survived']] / len(pred))
print accuracy

#let's give it a shot
new = test
predicts = results.predict(new)

predicts[predicts > 0.5] = int(1)
predicts[predicts <= 0.5] = int(0)

final_pred = []
for element in predicts:
    final_pred.append(int(element))

#now make a submission dataframe for Kaggle
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived':final_pred})
# print submission
submission.to_csv('kaggle_rev3.csv', index=False)

Optimization terminated successfully.
         Current function value: 0.432147
         Iterations 7
0.813692480359




The recode didn't work, and neither did trying to up the current function value (this script scored a 0.76077) so clearly this is a balance between function value and accuracy measure. 

In [303]:
test['Age_Recode'] = np.log10(test.Age)
titanic['Age_Recode'] = np.log10(titanic.Age)

formula = 'Survived ~ Sex * Pclass + Sex * Age_Recode + Parch * Sex + Age_Recode * Embarked + Age_Recode * SibSp' #82%

model = smf.logit(formula, data=titanic)
results = model.fit()

#assess our training
pred = results.predict()
pred[pred > 0.5] = 1
pred[pred <= 0.5] = 0
accuracy = sum(pred[pred == titanic['Survived']] / len(pred))
print accuracy

#let's give it a shot
new = test
predicts = results.predict(new)

predicts[predicts > 0.5] = int(1)
predicts[predicts <= 0.5] = int(0)

final_pred = []
for element in predicts:
    final_pred.append(int(element))

#now make a submission dataframe for Kaggle
submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived':final_pred})
# print submission
submission.to_csv('kaggle_rev4.csv', index=False)

Optimization terminated successfully.
         Current function value: 0.407647
         Iterations 7
0.828282828283




This was my last first revision model with score of 0.77512, not bad, but not an improvement. I'm excited to learn about more data mapping techniques to improve my current model!