# TITANIC: MACHINE LEARNING FROM DISASTER

 In this notebook we will try to solve titanic problem from kaggle 

The problem statement is given below:

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.


In [23]:
# importing the libraries
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

First we will try to see the data and try to understand what is going on.

In [24]:
# loading data using pandas

data=pd.read_csv("train.csv")
data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In this we will find out the missing values

In [25]:
print("The MISSING VALUES is given below")
data.isnull().sum()

The MISSING VALUES is given below


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

As you can see there are some Nan values in the data ( at row 6 age coloumn). So first we have to clean the data.
From the problem statement survival depends on Pclass,Sex,Age. 

As they have given that female has more advantage over male in survival. we assign female as 1 and male as 0.
and make new column

In [26]:
l=len(data['Sex'])
fem=np.zeros(l,dtype='int64')
fem[(data['Sex']=='female')]=1

In [27]:
median=data['Age'].median()
print(median)
data=data.fillna(median)
age=np.array(data['Age'])
print(age.shape)
data.head(10)

28.0
(891,)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,28,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,28,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,28,S
5,6,0,3,"Moran, Mr. James",male,28.0,0,0,330877,8.4583,28,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,28,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,28,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,28,C


In the problem statement it is given that if p_class=1 implies that is a upper class.


In [28]:
p_class=np.array(data['Pclass'])

In [29]:
y=np.array(data['Survived'])
length=len(y)
# we will count how many of survived in the data
number=np.count_nonzero(y==1)
print(number)

342


In [30]:
count1=0
count2=0
count3=0
for i in range(length):
    if p_class[i]==3 and y[i]==1:
        count1=count1+1
    if p_class[i]==2 and y[i]==1:
        count2=count2+1
    if p_class[i]==1 and y[i]==1:
        count3=count3+1
        
print(count1/number,count2/number,count3/number)


0.347953216374269 0.2543859649122807 0.39766081871345027


In [31]:
m=len(fem)
fem=fem.reshape(m,1)
age=age.reshape(m,1)
p_class=p_class.reshape(m,1)

# its fem,age,pclass........(same must be in test data)

X_train=np.hstack((fem,age))
X_train=np.hstack((X_train,p_class))
print(X_train.shape,y.shape)

(891, 3) (891,)


In [32]:
model=LogisticRegression()
model.fit(X_train,y)
w=model.coef_
b=model.intercept_
print(w.shape,b.shape)

(1, 3) (1,)


In [33]:
# score
k=model.score(X_train,y)
print(k)

0.794612794613


In [56]:
# we will try to predict it with decision tree

decision_tree=DecisionTreeClassifier()
decision_tree.fit(X_train,y)
acc=decision_tree.score(X_train,y)
print(acc)

0.877665544332


In [35]:
# we have to predict the survival in test data. So i am loading the test data from the file
test_data=pd.read_csv('test.csv')
test_data.head(20)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


In [36]:
test_data=test_data.fillna(median)
test_age=np.array(test_data['Age'])

test_p_class=np.array(test_data['Pclass'])

length=len(test_data['Sex'])
test_fem=np.zeros(length,dtype='int64')
test_fem[(test_data['Sex']=='female')]=1

In [37]:
m_test=len(test_age)


test_p_class=test_p_class.reshape(m_test,1)
test_age=test_age.reshape(m_test,1)
test_fem=test_fem.reshape(m_test,1)

X_predict=np.hstack((test_fem,test_age))
X_predict=np.hstack((X_predict,test_p_class))
print(X_predict.shape)


(418, 3)


In [38]:
# predicton using the parameters learned
Y_predict=model.predict(X_predict)
print(Y_predict.shape)

(418,)


In [39]:
# we have predictions we have to write this in new csv file
# in the file we have to write only passenger id and survived.
answer=test_data['PassengerId']                        # this will give a series 
answer=pd.Series.to_frame(answer)                      # converting that series to data frame

In [40]:
s=pd.Series(Y_predict)
answer['Survived']=s
answer

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


In [41]:
# will create new file and store these values
answer.to_csv("predict_simple.csv",index=False)

In [46]:
# we will try to predict using decision tree this time.
y_decision=decision_tree.predict(X_predict)
print(y_decision.shape)

(418,)


In [47]:
decisiontree=test_data['PassengerId']
decisiontree=pd.Series.to_frame(decisiontree)
decisiontree.head(10)

Unnamed: 0,PassengerId
0,892
1,893
2,894
3,895
4,896
5,897
6,898
7,899
8,900
9,901


In [48]:
decisiontree['Survived']=pd.Series(y_decision)
decisiontree.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,1
3,895,1
4,896,1


In [50]:
decisiontree.to_csv("predict_decision_tree.csv",index=False)