# TITANIC: MACHINE LEARNING FROM DISASTER

This notebook covers how to approach the problem titanic.

The Problem statement is given below:

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.



Suggestions are always welcome and i am also trying to improve it.

## loading the libraries

In [104]:
import numpy as np
import pandas as pd
import sys
# for visualizing the data 
import matplotlib.pyplot as plt

## Loading the data

In [105]:
train=pd.read_csv("train.csv")
test=pd.read_csv("test.csv")
# combining the dataset
train_test=[train,test]

## Understanding the data

In [106]:
train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [107]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [108]:
# missing values in 
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [109]:
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [110]:
train.describe(include=['O'])               # will give about string data type

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Mernagh, Mr. Robert",male,1601,C23 C25 C27,S
freq,1,577,7,4,644


## Finding realtion between survival and other features

## 1 Pclass vs Survival

In [111]:
train.groupby('Pclass').Survived.value_counts()

Pclass  Survived
1       1           136
        0            80
2       0            97
        1            87
3       0           372
        1           119
Name: Survived, dtype: int64

One feature can be Pclass.

In [112]:
pclass_train=np.array(train['Pclass'])
pclass_test=np.array(test['Pclass'])
pclass_train.shape

(891,)

## 2 Age vs Survival

We can see that age has some missing values. First we will try to fill it 

In [113]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [114]:
median=train['Age'].median()
print(median)

28.0


In [115]:
train['Age']=train['Age'].fillna(median)
test['Age']=test['Age'].fillna(median)

In [116]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [117]:
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [118]:
train.loc[ train['Age'] <= 16, 'Age'] = 0
train.loc[(train['Age'] > 16) & (train['Age'] <= 32), 'Age'] = 1
train.loc[(train['Age'] > 32) & (train['Age'] <= 48), 'Age'] = 2
train.loc[(train['Age'] > 48) & (train['Age'] <= 64), 'Age'] = 3
train.loc[ train['Age'] > 64, 'Age'] = 4  

test.loc[ test['Age'] <= 16, 'Age'] = 0
test.loc[(test['Age'] > 16) & (test['Age'] <= 32), 'Age'] = 1
test.loc[(test['Age'] > 32) & (test['Age'] <= 48), 'Age'] = 2
test.loc[(test['Age'] > 48) & (test['Age'] <= 64), 'Age'] = 3
test.loc[ test['Age'] > 64, 'Age'] = 4  

In [119]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,1.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,2.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,1.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,2.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,2.0,0,0,373450,8.05,,S


In [120]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,2.0,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,2.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,3.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,1.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,1.0,1,1,3101298,12.2875,,S


In [121]:
age_train=np.array(train['Age'],dtype=int)
age_test=np.array(test['Age'],dtype=int)

## 3 Gender vs Survival

In [122]:
train.groupby('Sex').Survived.value_counts()

Sex     Survived
female  1           233
        0            81
male    0           468
        1           109
Name: Survived, dtype: int64

It is obvious that from above that female has a advantage of survival

In [123]:
# we have to convert Sex column into numerics 
train['sex_numeric']=-1
train['sex_numeric'][train['Sex']=='male']=0
train['sex_numeric'][train['Sex']=='female']=1
test['sex_numeric']=-1
test['sex_numeric'][test['Sex']=='male']=0
test['sex_numeric'][test['Sex']=='female']=1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [124]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,sex_numeric
0,1,0,3,"Braund, Mr. Owen Harris",male,1.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,2.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,1.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,2.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,2.0,0,0,373450,8.05,,S,0


In [125]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,sex_numeric
0,892,3,"Kelly, Mr. James",male,2.0,0,0,330911,7.8292,,Q,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,2.0,1,0,363272,7.0,,S,1
2,894,2,"Myles, Mr. Thomas Francis",male,3.0,0,0,240276,9.6875,,Q,0
3,895,3,"Wirz, Mr. Albert",male,1.0,0,0,315154,8.6625,,S,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,1.0,1,1,3101298,12.2875,,S,1


In [126]:
sex_numeric_train=np.array(train['sex_numeric'])
sex_numeric_test=np.array(test['sex_numeric'])
print(sex_numeric_train.shape,sex_numeric_test.shape)

(891,) (418,)


## 4 Embarked vs Survival

In [127]:
train.groupby('Embarked').Survived.value_counts()

Embarked  Survived
C         1            93
          0            75
Q         0            47
          1            30
S         0           427
          1           217
Name: Survived, dtype: int64

As you can see that when embarked is S the Survival rate is high.


In [128]:
train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

As you can see that S is frequently occuring 

In [129]:
test['Embarked'].value_counts()

S    270
C    102
Q     46
Name: Embarked, dtype: int64

In [130]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
sex_numeric      0
dtype: int64

In the Embarked column only two are missing.So i will fill these two places with the most frequent that is S.

In [131]:
train['Embarked']=train['Embarked'].fillna('S')

In [132]:
train['Embarked'].value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

In [133]:
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
sex_numeric      0
dtype: int64

As there is no value null for Embarked in test data. So there is no need to fill the missing values.

So it is obvious that embarked can be one more feature. Here we have to convert string data type to numeric values.
Here i will make S as 0 and C as 1 and Q as 2.


In [134]:
train['embar_numeric']=0
train['embar_numeric'][train['Embarked']=='S']=0
train['embar_numeric'][train['Embarked']=='C']=1
train['embar_numeric'][train['Embarked']=='Q']=2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [135]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,sex_numeric,embar_numeric
0,1,0,3,"Braund, Mr. Owen Harris",male,1.0,1,0,A/5 21171,7.25,,S,0,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,2.0,1,0,PC 17599,71.2833,C85,C,1,1
2,3,1,3,"Heikkinen, Miss. Laina",female,1.0,0,0,STON/O2. 3101282,7.925,,S,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,2.0,1,0,113803,53.1,C123,S,1,0
4,5,0,3,"Allen, Mr. William Henry",male,2.0,0,0,373450,8.05,,S,0,0


In [136]:
test['embar_numeric']=0
test['embar_numeric'][test['Embarked']=='S']=0
test['embar_numeric'][test['Embarked']=='C']=1
test['embar_numeric'][test['Embarked']=='Q']=2


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [137]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,sex_numeric,embar_numeric
0,892,3,"Kelly, Mr. James",male,2.0,0,0,330911,7.8292,,Q,0,2
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,2.0,1,0,363272,7.0,,S,1,0
2,894,2,"Myles, Mr. Thomas Francis",male,3.0,0,0,240276,9.6875,,Q,0,2
3,895,3,"Wirz, Mr. Albert",male,1.0,0,0,315154,8.6625,,S,0,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,1.0,1,1,3101298,12.2875,,S,1,0


In [138]:
embarked_train=np.array(train['embar_numeric'])
embarked_test=np.array(test['embar_numeric'])
print(embarked_train.shape,embarked_test.shape)

(891,) (418,)


We have got the features. We will learn from this data

In [139]:
# importing the libraries 
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [140]:
m=len(train['Age'])
print(m)
print(pclass_train.shape,age_train.shape,sex_numeric_train.shape,embarked_train.shape)
pclass_train.shape=(m,1)
age_train.shape=(m,1)
sex_numeric_train.shape=(m,1)
embarked_train.shape=(m,1)

X_train=np.hstack((pclass_train,age_train,sex_numeric_train,embarked_train))
print(X_train.shape)
y_train=np.array(train['Survived'])

891
(891,) (891,) (891,) (891,)
(891, 4)


In [141]:
m_test=len(test['Age'])
print(m_test)
print(pclass_test.shape,age_test.shape,sex_numeric_test.shape,embarked_test.shape)
pclass_test.shape=(m_test,1)
age_test.shape=(m_test,1)
sex_numeric_test.shape=(m_test,1)
embarked_test.shape=(m_test,1)

X_test=np.hstack((pclass_test,age_test,sex_numeric_test,embarked_test))
# we have to predict the survival for the test data

418
(418,) (418,) (418,) (418,)


## Logistic Regression

In [142]:
lr=LogisticRegression()
lr.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [143]:
train_acc_lr=lr.score(X_train,y_train)
print(train_acc_lr)

0.79012345679


## Decision Tree

In [144]:
X_training,X_cross,y_training,y_cross=train_test_split(X_train,y_train,test_size=0.5)

In [147]:
dt=DecisionTreeClassifier()
dt.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [149]:
train_acc_dt=dt.score(X_train,y_train)
print(train_acc_dt)
# cross_acc_dt=dt.score(X_cross,y_cross)
# print(cross_acc_dt)

0.827160493827


## Random forest classifier

## Prediction

as decison tree accuracy is good we will use that.

In [150]:
y_predict_dt=dt.predict(X_test)
print(y_predict_dt.shape)

(418,)


In [151]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,sex_numeric,embar_numeric
0,892,3,"Kelly, Mr. James",male,2.0,0,0,330911,7.8292,,Q,0,2
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,2.0,1,0,363272,7.0,,S,1,0
2,894,2,"Myles, Mr. Thomas Francis",male,3.0,0,0,240276,9.6875,,Q,0,2
3,895,3,"Wirz, Mr. Albert",male,1.0,0,0,315154,8.6625,,S,0,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,1.0,1,1,3101298,12.2875,,S,1,0


In [152]:
prediction=test['PassengerId']
prediction=pd.Series.to_frame(prediction)
prediction.head()

Unnamed: 0,PassengerId
0,892
1,893
2,894
3,895
4,896


In [159]:
prediction['Survived']=pd.Series(y_predict_dt)
prediction.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0


In [158]:
# writing it in new file 
prediction.to_csv("prediction2.csv",index=False)
# got 74.04% accuracy in this. worst than last one. :( :( 
# now got 77.990% in this after changing age feature

In [155]:
y_predict_lr=lr.predict(X_test)
print(y_predict_lr.shape)

(418,)


In [156]:
prediction1=pd.Series.to_frame(test['PassengerId'])
prediction1['Survived']=pd.Series(y_predict_lr)
prediction1.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1


In [157]:
prediction1.to_csv("prediction2_lr.csv",index=False)    
# i have observed this with last one on which i got 76% and prediction2_lr.csv is same as this one. So 
# i have to improve the algorithm. got 90% in decisioin tree but doesn't work well with test data.