In this notebook, I begin with a model that is dervied from experience using [DataQuest](https://www.dataquest.io/mission/74/getting-started-with-kaggle), and then annotate a variety of revisions tried since. Accuracies are also reported.

In [51]:
#importing and setting up my dataframes
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

titanic_data = pd.read_csv('data/train.csv', dtype={'Age':np.float64})
test_data = pd.read_csv('data/test.csv', dtype={'Age':np.float64})

# titanic_data.head()
# test_data.head()

print 'Data from the Titanic CSV'
titanic_data.info()
print '----------------------'
print 'Data from the Test CSV'
test_data.info()

Data from the Titanic CSV
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB
----------------------
Data from the Test CSV
<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null in

First, we want to start with our titanic_data set and clean it up, filling in the missing values with either the mean or median values. In the event of categories, we'll fill in the mode of the data. For this example, I'll fill it in with the median values (in the event that some of these distributions are skewed or huge outliers are present).

In [52]:
for name in titanic_data.describe():
    titanic_data[name] = titanic_data[name].fillna(titanic_data[name].median()) #this works for numeric columns

#now for categories that we care about like embarked

titanic_data['Embarked'] = titanic_data['Embarked'].fillna('S') #this is the most frequent value
    
print titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB
None


Further, we'll make all our categories into numeric entities (sex, embarked)...

In [53]:
titanic_data.loc[titanic_data['Sex'] == 'male', 'Sex'] = 0
titanic_data.loc[titanic_data['Sex'] == 'female', 'Sex'] = 1

titanic_data.loc[titanic_data['Embarked'] == 'S', 'Embarked'] = 0
titanic_data.loc[titanic_data['Embarked'] == 'C', 'Embarked'] = 1
titanic_data.loc[titanic_data['Embarked'] == 'Q', 'Embarked'] = 2

Let's predict some things! We'll try out a linear regression first.

In [54]:
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold

predictors = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'] #the only things we care about for this one

#initialize algorithm type
algorith = LinearRegression()

#let's generate some cross validation folds 
kf = KFold(titanic_data.shape[0], n_folds=3, random_state=1)

#generate a list of predictions
pred = []
for train, test in kf:
    train_predictors = (titanic_data[predictors].iloc[train,:])
    train_target = titanic_data['Survived'].iloc[train]
    algorith.fit(train_predictors, train_target)
    test_predictions = algorith.predict(titanic_data[predictors].iloc[test,:])
    pred.append(test_predictions)
    
#assess our training
pred = np.concatenate(pred, axis=0)
pred[pred > 0.5] = 1
pred[pred <= 0.5] = 0
accuracy = sum(pred[pred == titanic_data['Survived']] / len(pred))
print accuracy

0.783389450056




In an attempt to be better than this, I will give a logistic regression a shot.

In [55]:
from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation

log_alg = LogisticRegression(random_state=1)
scores = cross_validation.cross_val_score(log_alg, titanic_data[predictors], titanic_data['Survived'], cv=3)
print scores.mean()

0.787878787879


At this point, I will now perform some analysis on the test data provided by kaggle directly, then make my first submission.

In [56]:
#I've already imported this data earlier, so that's nice.
test_data['Age'] = test_data['Age'].fillna(titanic_data['Age'].median())
test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].median())

test_data.loc[test_data['Sex'] == 'male', 'Sex'] = 0
test_data.loc[test_data['Sex'] == 'female', 'Sex'] = 1

test_data['Embarked'] = test_data['Embarked'].fillna('S')
test_data.loc[test_data['Embarked'] == 'S', 'Embarked'] = 0
test_data.loc[test_data['Embarked'] == 'C', 'Embarked'] = 1
test_data.loc[test_data['Embarked'] == 'Q', 'Embarked'] = 2

#apply an algorithm
algo = LogisticRegression(random_state=1)
algo.fit(titanic_data[predictors], titanic_data['Survived'])
predicts = algo.predict(test_data[predictors])

#now make a submission dataframe for Kaggle
submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived':predicts})
submission.to_csv('kaggle.csv', index=False)

After submitting, I was told the accuracy of the above model was: 0.75120 (along with my other classmates - what fun!)

Now, I would like to start making revisions to this model...