In [1]:
%matplotlib inline
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

# Import Data

In [2]:
train=pd.read_csv("train.csv")
test=pd.read_csv("test.csv")
gs=pd.read_csv("gender_submission.csv")

I like to look at the _tail_, rather than the _head_, of a df, since it gives you the number of instances as well as a peak at the data

# Initial analysis

In [3]:
train.tail(2)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


The train data set has missing values. In particular, the cabin data is particularly sparse.

In [5]:
train.isnull().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

So _Age_, _Cabin_, and _Embarked_ features have missing instances. How many rows have missing data?

In [6]:
train.shape[0]-train.dropna().shape[0]

708

Lets look at this subset of missing features:

In [7]:
train.loc[:, train.isnull().any()]

Unnamed: 0,Age,Cabin,Embarked
0,22.0,,S
1,38.0,C85,C
2,26.0,,S
3,35.0,C123,S
4,35.0,,S
5,,,Q
6,54.0,E46,S
7,2.0,,S
8,27.0,,S
9,14.0,,C


We want to model survival or fatality by building a binary classifier. How many people died on the titanic in our train set?

In [8]:
print("Died {} ({:.0f}%)\nSurvived: {} ({:.0f}%)".format(len(train[train["Survived"]==0]),
                                                (len(train[train["Survived"]==0])/len(train))*100,
                                                len(train[train["Survived"]==1]),
                                               (len(train[train["Survived"]==1])/len(train))*100))


Died 549 (62%)
Survived: 342 (38%)


Our train set is biased by 2:1 for deaths vs survival. Lets do a pairwise plot using only rows with full instances with no NAs

In [9]:
#sns.pairplot(train.dropna(),hue="Survived")

Remember that 1==YES , they survived, and 0==NO. This plot is quite hard to read. Lets narrow it down:

In [10]:
#sns.pairplot(train.dropna(),hue="Survived",vars=["Age","Pclass","Fare","SibSp","Survived"])

In order to build a classifier, we are going to have to convert our catagorical features (sex,ticket,embarked) to vectors.

In [11]:
for i in train.select_dtypes(include=['object']).columns:
    print("{}: {} values, {} unique".format(i,len(train[i]),len(train[i].unique())))

Name: 891 values, 891 unique
Sex: 891 values, 2 unique
Ticket: 891 values, 681 unique
Cabin: 891 values, 148 unique
Embarked: 891 values, 4 unique


The name, ticket and cabin variables have many values. For my first model, I will exclude these. I shall only include "Sex" and "Embarked" of the catagorical values in the model. For this I will need to convert them to floating point values

# Conversion of catagorical data

In [12]:
train["Embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [13]:
train["Sex"].unique()

array(['male', 'female'], dtype=object)

In [14]:
train["Sex"]=train["Sex"].replace(to_replace='male',value=0)
train["Sex"]=train["Sex"].replace(to_replace='female',value=1)
train["Embarked"]=train["Embarked"].replace(to_replace='S',value=0)
train["Embarked"]=train["Embarked"].replace(to_replace='C',value=1)
train["Embarked"]=train["Embarked"].replace(to_replace='Q',value=2)

I am going to test a few different algorithms, using a subset of the data. I expect these to be underfitted, since I have done litte feature engineering up to this point. Iniitally, I will test a few classifiers using _scikitlearn_ to get a baseline accuracy. I will then do some feature engineering to improve the models. 

# Baseline models

In [27]:
features=["Survived","Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]

In [16]:
#X=train.dropna()[features]
#y=train.dropna()["Survived"]
#print(len(X),len(y))
#y.mean()

In [35]:
X=train[features].dropna().drop("Survived",axis=1)
y=train[features].dropna()["Survived"]
print(len(X),len(y))
print(y.mean())

712 712
0.404494382022


So 40% of our input data for the first model have value==1 (ie, they survived). This is fair reflection of the original set, which has a 38% survival rate. For the current train set, if we predict 1 for every instance, we would be correct 40% of the time.

I am going to further split our train data set into a train and test set, so we can test our model properly, since the _test_ set provided by Kaggle does not have an answer we can check so we cannot validate our model against it

In [36]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(len(y_train))
print(len(y_test))
print(y_train.mean())
print(y_test.mean())

534
178
0.397003745318
0.426966292135


In [52]:
scores={}

### Gaussian Naive Bayes

The first method I will trial is Gaussian Naive Bayes

In [37]:
from sklearn.naive_bayes import GaussianNB

In [38]:
gnb = GaussianNB()
gnb=gnb.fit(X_train,y_train)
gnb.score(X_train,y_train)
print(gnb.score(X_train,y_train))

0.784644194757


So GNB is accurate on 78.5% of the training data. What about the test set?

In [54]:
print("GNB accurately predicts {:.2f}% of the test set".format(gnb.score(X_test,y_test)*100))
scores["GNB"]=gnb.score(X_test,y_test)*100

GNB accurately predicts 78.09% of the test set


In [40]:
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print("Number of mislabeled points out of a total {} points : {}".format(X_test.shape[0],(y_test != y_pred).sum()))

Number of mislabeled points out of a total 178 points : 39


In [41]:
from sklearn.metrics import confusion_matrix

In [42]:
tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()
(tn, fp, fn, tp)

(83, 19, 20, 56)

In [48]:
print("Correct deaths: {}\nIncorrect Survival: {}\nIncorrect Death: {}\nCorrect Surivival: {}".format(tn,fp,fn,tp))

Correct deaths: 83
Incorrect Survival: 19
Incorrect Death: 20
Correct Surivival: 56


### Logistic Regression

In [55]:
from sklearn.linear_model import LogisticRegression 

In [56]:
lrmodel=LogisticRegression()
lrmodel=lrmodel.fit(X_train,y_train)
lrmodel.score(X_train,y_train)


0.797752808988764

In [58]:
lrmodel.score(X_test,y_test)
scores["LR"]=lrmodel.score(X_test,y_test)*100

In [59]:
y_pred = lrmodel.fit(X_train, y_train).predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()
(tn, fp, fn, tp)

(83, 19, 19, 57)

In [60]:
print("Correct deaths: {}\nIncorrect Survival: {}\nIncorrect Death: {}\nCorrect Surivival: {}".format(tn,fp,fn,tp))

Correct deaths: 83
Incorrect Survival: 19
Incorrect Death: 19
Correct Surivival: 57


### Decision Tree

In [61]:
from sklearn.tree import DecisionTreeClassifier

In [62]:
treemodel=DecisionTreeClassifier()
treemodel=treemodel.fit(X_train,y_train)
treemodel.score(X_train,y_train)


0.99250936329588013

In [69]:
treemodel.score(X_test,y_test)
scores["DT"]=treemodel.score(X_test,y_test)*100

In [66]:
y_pred = treemodel.fit(X_train, y_train).predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()
(tn, fp, fn, tp)

(84, 18, 22, 54)

In [67]:
print("Correct deaths: {}\nIncorrect Survival: {}\nIncorrect Death: {}\nCorrect Surivival: {}".format(tn,fp,fn,tp))

Correct deaths: 84
Incorrect Survival: 18
Incorrect Death: 22
Correct Surivival: 54


### Gradient Boosting

In [72]:
from  sklearn.ensemble import GradientBoostingClassifier

In [86]:
gb=GradientBoostingClassifier()
gb=gb.fit(X_train,y_train)
gb.score(X_train,y_train)

0.9157303370786517

In [87]:
gb.score(X_test,y_test)
scores["gb"]=gb.score(X_test,y_test)*100

In [107]:
scores


{'DT': 77.528089887640448,
 'GNB': 78.089887640449433,
 'LR': 78.651685393258433,
 'gb': 82.022471910112358}

# Engineering

In [187]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null int64
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null float64
dtypes: float64(3), int64(6), object(3)
memory usage: 83.6+ KB


In [186]:
print("{} in the set, {} features in the training set, {} in the test set. {} sets excluded".format(train.shape[0],y_train.shape[0],y_test.shape[0],(train.shape[0]-(y_train.shape[0]+y_test.shape[0]))))

891 in the set, 534 features in the training set, 178 in the test set. 179 sets excluded


In [189]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [195]:
lrmodel.predict(test[features].drop(["Survived"]))

KeyError: "['Survived'] not in index"