# Titanic Survival Problem
### Titanic: Machine Learning from Disaster 

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

1) To start with we will begin by importing the libraries like Pandas and Numpy

In [4]:
import pandas as pd
import numpy as np

2) Next we will import the data from the dataset from a CSV file into a dataframe

In [5]:
data=pd.read_csv('train.csv')

3) Inorder to see a glimpse of the data we use head() function on dataframe

In [6]:
data.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


4) Next we declare two vectors as target_data and train_data which specify which parameters are included as input attributes and which are output. Here Survived is alone a target_data attribute. Rest all can be used as input

In [7]:
target_data=['Survived']

train_data=['Pclass','Sex','Age','Fare']

5)Then we declare input and output vectors here X is an input Data vector while Y is used as Output vector.

In [8]:
X=data[train_data]
Y=data[target_data]

In [9]:
X

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,male,22.0,7.2500
1,1,female,38.0,71.2833
2,3,female,26.0,7.9250
3,1,female,35.0,53.1000
4,3,male,35.0,8.0500
5,3,male,,8.4583
6,1,male,54.0,51.8625
7,3,male,2.0,21.0750
8,3,female,27.0,11.1333
9,2,female,14.0,30.0708


5) Now we need to verify which column has NaN values. We need to modify them and replace them with the values of either median, mode or mean. We will prefer median as it provides more closer result to possible outcomes

In [10]:
X['Pclass'].isnull().sum()

0

In [11]:
X['Age'].isnull().sum()

177

6) Now we need to replace the NaN values with Median of the Age column.

In [13]:

pd.options.mode.chained_assignment = None
X['Age']=X['Age'].fillna(X['Age'].median())

7) Verifying the values 

In [25]:
X['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5      28.0
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17     28.0
18     31.0
19     28.0
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26     28.0
27     19.0
28     28.0
29     28.0
       ... 
861    21.0
862    48.0
863    28.0
864    24.0
865    42.0
866    27.0
867    31.0
868    28.0
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878    28.0
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888    28.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [26]:
X['Age'].isnull().sum()

0

8) Now we declare a dictionary with 0 value for male and 1 value for female.

In [14]:
d_sex={'male':0,'female':1}

In [15]:
d_sex

{'male': 0, 'female': 1}

9) Now we need to convert the categorical data into variable based data. For this we need to convert the gender attribute into variable data.

In [37]:
X['Sex']=X['Sex'].apply(lambda x:d_sex[x])
X['Sex'].head()

0    0
1    1
2    1
3    1
4    0
Name: Sex, dtype: int64

In [38]:
X.head()

Unnamed: 0,Pclass,Sex,Age,Fare
0,3,0,22.0,7.25
1,1,1,38.0,71.2833
2,3,1,26.0,7.925
3,1,1,35.0,53.1
4,3,0,35.0,8.05


10) Now we split the data into test and training set and used the train_test_split function() for this. We declare the test size to be 33%.

In [39]:
from sklearn.model_selection import train_test_split

In [40]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.33,random_state=42)

In [41]:
X_train

Unnamed: 0,Pclass,Sex,Age,Fare
6,1,0,54.0,51.8625
718,3,0,28.0,15.5000
685,2,0,25.0,41.5792
73,3,0,26.0,14.4542
882,3,1,22.0,10.5167
328,3,1,31.0,20.5250
453,1,0,49.0,89.1042
145,2,0,19.0,36.7500
234,2,0,24.0,10.5000
220,3,0,16.0,8.0500


In [42]:
X_test

Unnamed: 0,Pclass,Sex,Age,Fare
709,3,0,28.0,15.2458
439,2,0,31.0,10.5000
840,3,0,20.0,7.9250
720,2,1,6.0,33.0000
39,3,1,14.0,11.2417
290,1,1,26.0,78.8500
300,3,1,28.0,7.7500
333,3,0,16.0,18.0000
208,3,1,16.0,7.7500
136,1,1,19.0,26.2833


11) Now we use Naive Bayes Classifier from sklearn library

In [45]:
from sklearn.naive_bayes import GaussianNB

12) Now we declare a classifier and initialize it with GaussianNB() 

In [47]:
clf=GaussianNB()

13) We now fit the model using training data and allow the model to process the training data set

In [51]:
clf.fit(X_train,Y_train.values.ravel())

GaussianNB(priors=None, var_smoothing=1e-09)

In [53]:
print (clf)


GaussianNB(priors=None, var_smoothing=1e-09)


14) Now let's see if the model can predict the result for a given set of input in test set

In [55]:
print (clf.predict(X_test[0:10]))

[0 0 0 1 1 1 1 0 1 1]


15) Now we find the accuracy of the model using score() function. 

In [56]:
print (clf.score(X_test,Y_test))

0.7864406779661017


11) Now we use Support Vector Machine from sklearn library

In [58]:
from sklearn import svm

In [60]:
clf2=svm.LinearSVC()

In [62]:
print (clf2)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)


In [63]:
clf2.fit(X_train,Y_train)

  y = column_or_1d(y, warn=True)


LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [64]:
clf2.score(X_test,Y_test)

0.7389830508474576

## Thus by using Naive Bayes we get an accuracy of 78.65% however by using Linear SVC the accuracy was only about 73.90%