# Titanic: Machine Learning from Disaster

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

This notebook shows a simple example of exploring and visualizing the available data in order to predict which passengers where able to the survive the tragedy.

### Import needed packages

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import sklearn as sk
import plotly.plotly as py
py.sign_in('youssef.emad.293', '0gxvde9i5t')



### Read Data

In [2]:
data_frame = pd.read_csv("train.csv")
data_frame.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


### Reforming Non-numeric Data

Most of basic machine learning techniques will not work on strings, and in python they almost always require the data to be an array. The sklearn package doesn't even deal with pandas dataframes.
In order to achieve that we can get rid of the unnecessary attributes and reform the needed ones in numbers.

In [3]:
data_frame = data_frame.drop(['Embarked','Cabin','Name','Ticket'], axis=1)
data_frame['Sex'] = data_frame['Sex'].map({'female':0,'male':1})

In [4]:
data_frame.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,1,0,3,1,22,1,0,7.25
1,2,1,1,0,38,1,0,71.2833
2,3,1,3,0,26,0,0,7.925
3,4,1,1,0,35,1,0,53.1
4,5,0,3,1,35,0,0,8.05


### Handling Missing Data

In [5]:
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 8 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int64
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
dtypes: float64(2), int64(6)

We can observe that the Age column contain missing data.We can handle this by either adding the mean of the ages or adding the median of the ages or by removing the rows including missing the age attribute.

### Handling Missing Age

We can handle the missing age data by either adding the mean of the ages or adding the median of the ages or by removing the rows including missing the age attribute.
It's good to take into consideration the other attributes while filling the missing data.So we can consider the gender and the class while guessing the missing age data.

Calculating the Median.

In [6]:
"""
    2 rows for Gender and 3 columns for Class
"""
median_ages = np.zeros((2,3))
for i in range(0, 2):
    for j in range(0, 3):
        median_ages[i,j] = data_frame[(data_frame['Sex'] == i) & \
                              (data_frame['Pclass'] == j+1)]['Age'].dropna().median()
 
median_ages

array([[ 35. ,  28. ,  21.5],
       [ 40. ,  30. ,  25. ]])

Filling the missing data.

In [7]:
for i in range(0, 2):
    for j in range(0, 3):
        data_frame.loc[ (data_frame.Age.isnull()) & (data_frame.Sex == i) & (data_frame.Pclass == j+1),\
                'Age'] = median_ages[i,j]

In [8]:
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 8 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
dtypes: float64(2), int64(6)

## Exploring Data through Visualization

In [9]:
from plotly.graph_objs import Bar

survived_count = len(data_frame[data_frame['Survived'] == 1])
drowned_count = len(data_frame[data_frame['Survived'] == 0])

py.iplot([Bar(x=['Survived','Drowned'],y=[survived_count , drowned_count])],filename='Titanic_survivor')

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~youssef.emad.293/0 or inside your plot.ly account where it is named 'Titanic_survivor'






In [10]:
from plotly.graph_objs import Layout,Figure

survived = data_frame[data_frame['Survived'] == 1]
drowned = data_frame[data_frame['Survived'] == 0]

survived_male = len(survived[survived['Sex'] == 1])
survived_female = len(survived[survived['Sex'] == 0])

drowned_male = len(drowned[drowned['Sex'] == 1])
drowned_female = len(drowned[drowned['Sex'] == 0])

trace_male = Bar(x=['Survived','Drowned'],y=[survived_male , drowned_male],name='Male')
trace_female = Bar(x=['Survived','Drowned'],y=[survived_female , drowned_female],name='Female')

layout = Layout(barmode='stack')
data = [trace_male,trace_female]

fig = Figure(data=data,layout=layout)
py.iplot(fig,filename='Titanic_survivor')

High five! You successfuly sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~youssef.emad.293/0 or inside your plot.ly account where it is named 'Titanic_survivor'






#### Surviving through Age , Gender and Class 

Histogram of the ages of the survivors

In [11]:
from plotly.graph_objs import Histogram
plt.figure(figsize=(15,5))
survived_ages_histogram_data = np.array(data_frame[data_frame['Survived'] ==1]['Age'])
py.iplot([Histogram(x=survived_ages_histogram_data)], nbinsx=20, filename='titanic_survived_ages_histogram')





<matplotlib.figure.Figure at 0x7f06be02ab50>

The Passengers by class

In [12]:
data_class1_count = len(data_frame[data_frame['Pclass'] == 1])
data_class2_count = len(data_frame[data_frame['Pclass'] == 2])
data_class3_count = len(data_frame[data_frame['Pclass'] == 3])

py.iplot([Bar(x=['Class 1' , 'Class 2' , 'Class 3'],y=[data_class1_count , data_class2_count, data_class3_count])],filename='Titanic_passengers_class_count')





The Passengers by class and age

In [13]:
class1_data = data_frame[data_frame['Pclass'] == 1]
class2_data = data_frame[data_frame['Pclass'] == 2]
class2_data = data_frame[data_frame['Pclass'] == 3]

class1_male = len(class1_data[class1_data['Sex'] == 1])
class1_female = len(class1_data[class1_data['Sex'] == 0])

class2_male = len(class2_data[class2_data['Sex'] == 1])
class2_female = len(class2_data[class2_data['Sex'] == 0])

class3_male = len(class3_data[class3_data['Sex'] == 1])
class3_female = len(class3_data[class3_data['Sex'] == 0])

trace_male = Bar(x=['Low Class','Middle Class', 'High Class'],y=[class1_male , class2_male,class3_male],name='Male')
trace_female = Bar(x=['Low Class','Middle Class', 'High Class'],y=[class1_female , class2_female,class3_female],name='Female')

layout = Layout(barmode='stack')
data = [trace_male,trace_female]

fig = Figure(data=data,layout=layout)
py.iplot(fig,filename='Titanic_passenger_male_female')

NameError: name 'class3_data' is not defined

The Survivors by class

In [None]:
data_survived = data_frame[data_frame['Survived'] == 1]
data_class1_count = len(data_survived[data_survived['Pclass'] == 1])
data_class2_count = len(data_survived[data_survived['Pclass'] == 2])
data_class3_count = len(data_survived[data_survived['Pclass'] == 3])

py.iplot([Bar(x=['Class 1' , 'Class 2' , 'Class 3'],y=[data_class1_count , data_class2_count, data_class3_count])],filename='Titanic_passengers_class_count')


The survivors by class and age

In [None]:
data_survived = data_frame[data_frame['Survived'] == 1]

class1_data = data_survived[data_survived['Pclass'] == 1]
class2_data = data_survived[data_survived['Pclass'] == 2]
class3_data = data_survived[data_survived['Pclass'] == 3]

class1_male = len(class1_data[class1_data['Sex'] == 1])
class1_female = len(class1_data[class1_data['Sex'] == 0])

class2_male = len(class2_data[class2_data['Sex'] == 1])
class2_female = len(class2_data[class2_data['Sex'] == 0])

class3_male = len(class3_data[class3_data['Sex'] == 1])
class3_female = len(class3_data[class3_data['Sex'] == 0])

trace_male = Bar(x=['Low Class','Middle Class', 'High Class'],y=[class1_male , class2_male,class3_male],name='Male')
trace_female = Bar(x=['Low Class','Middle Class', 'High Class'],y=[class1_female , class2_female,class3_female],name='Female')

layout = Layout(barmode='stack')
data = [trace_male,trace_female]

fig = Figure(data=data,layout=layout)
py.iplot(fig,filename='Titanic_passenger_male_female')

## Prediction using Sklearn

Sklearn doesn't deal with Pandas dataframes so we have to convert the dataframes to numpy arrays

In [14]:
data_frame.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,1,0,3,1,22,1,0,7.25
1,2,1,1,0,38,1,0,71.2833
2,3,1,3,0,26,0,0,7.925
3,4,1,1,0,35,1,0,53.1
4,5,0,3,1,35,0,0,8.05


In [15]:
data = data_frame.drop(['PassengerId','Survived'],axis=1).values
labels = data_frame['Survived'].values

Divide the data into train and test

In [16]:
from sklearn import cross_validation
train_data, test_data, train_labels, test_labels = cross_validation.train_test_split(data, labels, test_size=0.2, random_state=0)

### Random Forest

In [17]:
from sklearn.ensemble import RandomForestClassifier 

forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit(train_data,train_labels)
output = forest.predict(test_data)

In [18]:
from sklearn.metrics import accuracy_score
accuracy_score(test_labels,output)

0.83240223463687146

## Logistic Regression

In [19]:
from sklearn.linear_model import LogisticRegression

log = LogisticRegression()
log = log.fit(train_data,train_labels)
output2 = log.predict(test_data)

In [20]:
from sklearn.metrics import accuracy_score
accuracy_score(test_labels,output2)

0.79329608938547491

### Effect of Adding New features on Accuracy

We can add new features using simple calculations.

In [21]:
data_frame.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,1,0,3,1,22,1,0,7.25
1,2,1,1,0,38,1,0,71.2833
2,3,1,3,0,26,0,0,7.925
3,4,1,1,0,35,1,0,53.1
4,5,0,3,1,35,0,0,8.05


In [22]:
data_frame['ClassAge'] = data_frame['Age'] * data_frame['Pclass']
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
ClassAge       891 non-null float64
dtypes: float64(3), int64(6)

In [23]:
data = data_frame.drop(['PassengerId','Survived'],axis=1).values
labels = data_frame['Survived'].values
from sklearn import cross_validation
train_data, test_data, train_labels, test_labels = cross_validation.train_test_split(data, labels, test_size=0.2, random_state=0)

#### Random Forest Prediction

In [24]:
forest_fe = RandomForestClassifier(n_estimators = 100)
forest_fe = forest.fit(train_data,train_labels)
output_fe = forest.predict(test_data)
accuracy_score(test_labels,output_fe)

0.82681564245810057