# Introduction

In this notebook we will try to make the most suitable Machine Learning model that would predict the chances of survival for a passenger of the ship Titanic. \
On 15th April 1912 the Titanic sank on its maiden voyage. The ship coasted $7.5 M to build, and it sank after colliding with a iceberg. That accident took life of approximately 1500 of its passenger out of total passenger around 2200.
 

# Workflow

1. Import Data
2. Exploratory Data Analysis (EDA)
3. Data Modelling
4. Prediction Model
5. Submit Prediction

# 1. Import Data

Firstly we need to import some libraries like Pandas, Numpy, Matplotlib and Seaborn.\
This libraries will help us in importing data and also in EDA

In [2]:
# data analysis and wrangling
import pandas as pd
import numpy as np

# visualization
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
# importing training data
data_train = data=pd.read_csv('../input/titanic/train.csv')

# First 5 rows of training data
data_train.head()

In [5]:
# importing test data
data_test = data=pd.read_csv('../input/titanic/test.csv')

# First 5 rows of training data
data_test.head()

In [8]:
print("Rows and columns of training data : " ,data_train.shape)
print("Rows and columns of testing data : " ,data_test.shape)

From the shape of the Data, we can see that training data has 12 columns and testing data has 11 column.\
"Survived" column is missing from testing data, which is our **target variable**.

In [9]:
data_train.info()
print('-'*40)
data_test.info()

Both training and testing data has null values.
1. Training Data : Age, Cabin and Embarked 
2. Testing Data : Age, Cabin and Fare



# 2. EDA

In [13]:
data_train.describe(include=['O'])

In [14]:
data_train.describe()

**Disrtibution on Features:**
\
**Numerical Features: **
* Passenger ID is unique key (Primary key / index).
* Around 38% samples survived.
* Survived is a categorical feature with 0 or 1 values.
* Fares varied significantly with few passengers (<1%) paying as high as $512.
* Few elderly passengers (<1%) within age range 65-80. 
\
 **Categorical Feature:**
* Names are unique across the dataset.
* Sex variable as two possible values with 65% male.
* Cabin values have several dupicates across samples. Alternatively several passengers shared a cabin.
* Embarked takes three possible values. S port used by most passengers.
* Ticket feature also has duplicate values.

## Survived

In [17]:
#Create two blank plot in a single row (1,2) of figure size (18,8)
f,ax=plt.subplots(1,2,figsize=(18,6))

# Use ax = ax[0] to plot on 1st plot
data_train['Survived'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%', labels = ["Did't Survived" , "Survived"], ax = ax[0])
ax[0].set_title('Percentage of Passengers Survived')
ax[0].set_ylabel(' ')

# Use ax = ax[1] to plot on 1st plot
sns.countplot(x = 'Survived',data=data_train,ax=ax[1])
ax[1].set_title('Number of Passengers Survived')

plt.show()

# Count of survived
data_train['Survived'].value_counts().to_frame()

## Pclass

In [18]:
pd.crosstab(data_train.Pclass , data_train.Survived).style.background_gradient(cmap='summer_r')

In [20]:
f,ax=plt.subplots(1,2,figsize=(18,8))

data_train['Pclass'].value_counts().plot.pie(autopct='%1.1f%%', ax=ax[0], labels = [3,1,2])
ax[0].set_title('Percent Of Passengers in every Pclass')
ax[0].set_ylabel('Count')

sns.countplot(x = 'Pclass',hue='Survived',data=data_train,ax=ax[1])
ax[1].set_title('Pclass:Survived vs Dead')

plt.show()


**Pclass:**
1. Here we can see most people are form Pclass 3.
2. In Pcalss 1, no of people survived (136) is more than no of people didn't survived (80).
3. For Pclass 2, the number of people that survived (87) and did't survived (97) is almost equal.
4. For Pclass 3, almost thrice  didn't survived (372) then survied (119)

## Sex

In [21]:
f,ax=plt.subplots(1,2,figsize=(20,5))

data_train['Sex'].value_counts().plot.pie(autopct='%1.1f%%', ax=ax[0])
ax[0].set_title('No of Male vs Female Passengers')
ax[0].set_ylabel('Sex')

data_train[['Sex','Survived']].groupby('Sex').mean().plot.bar(color = ['purple'] , ax = ax[1])
ax[1].set_title('Chance of survival of passengers')
ax[1].set_ylabel('Chacne')

plt.show()

**Sex**
1. More male passengers (64.8% ) than female passenger (35.2%) .
2. Female passengers have way more chance of survival (0.74) than male passengers (0.18).

## Age

In [22]:
sns.boxplot(x = data_train['Age'])

In [23]:
f,ax=plt.subplots(1,2,figsize=(20,5))

data_train[data_train['Survived']==0].Age.plot.hist(ax=ax[0],bins=25,edgecolor='black',color='red')
ax[0].set_title('Survived= 0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)

data_train[data_train['Survived']==1].Age.plot.hist(ax=ax[1],color='green',bins=25,edgecolor='black')
ax[1].set_title('Survived= 1')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)

plt.show()

In [24]:
data_train[['Age','Survived']].groupby('Age').mean().plot.area(color = 'blue').set_title("Age vs Chance of survival")

**Age**
1. Here we can see that passengers from lower age group and higher age group have more chance of survial.
2. Most passenger who didnt survived are in middle age group.

## SibSp

In [27]:
data_train['SibSp'].value_counts().to_frame()

In [28]:
data_train[['SibSp' , 'Survived']].groupby('SibSp').mean().plot.bar(color = 'green') 

**SibSp**
1. For passengers travelling with 1 and 2  has more chances of survivial compared to others.
2. Passengers travleeing in larger groups of 5 and 8 didn't survivied.

## Parch

In [29]:
Parch_Sur = pd.DataFrame()
Parch_Sur['Survived Passengers'] = data_train[['Parch' , 'Survived']].groupby('Parch').sum()
Parch_Sur['Toal Passengers'] = data_train[['Parch' , 'Survived']].groupby('Parch').count()
Parch_Sur['Chance of survival'] = data_train[['Parch' , 'Survived']].groupby('Parch').mean()
Parch_Sur

In [30]:
plt.bar(Parch_Sur.index , Parch_Sur['Chance of survival'] , color = 'blue')
plt.ylabel("Chance of Survival")
plt.xlabel("Parch")
plt.title("Plot")

## Fare

In [31]:
sns.distplot(a = data_train['Fare'])

In [32]:
data_train[['Fare' , 'Survived']].corr()

## Embarked

In [34]:
data_train['Embarked'].value_counts()

In [37]:
sns.countplot(x = data_train['Embarked'] , hue = data_train['Survived'] )

In [39]:
data_train[['Embarked','Survived']].groupby('Embarked').mean()

**Embarked**
1. Most passener embarked from "S"
2. Chance of survival for Embarked at C is most (55%).
3. Althoogh most passengers embarked from S, chance of survival is least for them (0.33)

# 3. Data Modeling

Now after geeting some insight from our data. We now will prepare our data for modeling.\
\
**Null Values** 
1. We want to fill null values for Age column.
2. We will drop the whole cabin column as it has too many null values.
3. We will drop rows with Embarked as missing value as only 2 rows are missing.

**Continous Features**
1. We will make Age column categorical.
2. We will make Fare column categorical.

**Numerical Feature**
1. We will make Sex column numerical.
2. We will make Embarked numerical

**Combined Feature**
1. We will make a new column "Alone" by using "SibSp" and "Parch"


**Drop Column**
1. We will drop Name column
2. We will drop Ticket Column

In [41]:
data = data_train.copy()
data.head()

## Age

First we will fill null values in Age column.\
We can fill it with mean of Age.

In [42]:
data['Age'].describe()

In [43]:
data[['Age' , 'Sex']].groupby('Sex').mean()

But we can see that mean age of "Male" is 30.7 and mean age of "Female" is 27.9\
Hence instead of filling null values with mean of Age.\
Filling it with mean of Age based on gender is more suitbale.

But , we have another coulmn that we can use : Name\
Observe that "Name" also have initials in it like Mr. , Mrs., Miss.,.
If we can find mean age for all initials we can use that to fill null age value and that would be more appropiate.

In [48]:
data['Initial']=0
for i in data:
    data['Initial']=data.Name.str.extract('([A-Za-z]+)\.')
    
data.info()

We have initial for all passengers. Nows lets find mean age for initials.

In [49]:
data[['Initial','Age']].groupby('Initial').mean()

In [53]:
pd.crosstab(data.Initial,data.Sex).T.style.background_gradient(cmap='summer_r')

Observe we have some missspelled terms as well like mlle or mms. And we also dont need that much Initials.\
We will keep , Mr. Mrs. Master, miss and rest we will group as others.

In [54]:
data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)

In [55]:
data.groupby('Initial')['Age'].mean()

Now we can use this to fill null Values.

In [56]:
data.loc[(data.Age.isnull())&(data.Initial=='Mr'),'Age']=33
data.loc[(data.Age.isnull())&(data.Initial=='Mrs'),'Age']=36
data.loc[(data.Age.isnull())&(data.Initial=='Master'),'Age']=5
data.loc[(data.Age.isnull())&(data.Initial=='Miss'),'Age']=22
data.loc[(data.Age.isnull())&(data.Initial=='Other'),'Age']=46

In [57]:
data.Age.isnull().any()

Now We have no null values in Age section.\
What we can do is, we can now make a new column for age band.\
This is called binning. To convert continous feature to categorical.

In [58]:
data['Age_band']=0
data.loc[data['Age']<=16,'Age_band']=0
data.loc[(data['Age']>16)&(data['Age']<=32),'Age_band']=1
data.loc[(data['Age']>32)&(data['Age']<=48),'Age_band']=2
data.loc[(data['Age']>48)&(data['Age']<=64),'Age_band']=3
data.loc[data['Age']>64,'Age_band']=4
data.head()

## Alone

We will combine "SibSp" and "Parch" because they both represent how many passengers were traveling together.\
Instead we will make a new column "Family Size" that will be the sum of "SibSp" and "Parch".\
And other column, "Alone" that will show passenger we alone or not.

In [60]:
data['Family_Size']=0
data['Family_Size']=data['Parch']+data['SibSp']
data['Alone']=0
data.loc[data.Family_Size==0,'Alone']=1

In [64]:
sns.factorplot(x = 'Alone', y = 'Survived',data=data)

In [65]:
sns.factorplot('Family_Size','Survived',data=data)

## Fare

For Fare we will use "qcut" from pandas to convert fare from continous to ordinal.

In [66]:
data['Fare_Range']=pd.qcut(data['Fare'],5)
data.groupby(['Fare_Range'])['Survived'].mean().to_frame().style.background_gradient(cmap='summer_r')

Fare still plays a important role after spliting it and making it ordinal. Hence we can do that.

In [67]:
data['Fare_cat']=0
data.loc[data['Fare']<=7.85,'Fare_cat']=0
data.loc[(data['Fare']>7.85)&(data['Fare']<=10.5),'Fare_cat']=1
data.loc[(data['Fare']>10.5)&(data['Fare']<=21.68),'Fare_cat']=2
data.loc[(data['Fare']>21.68)&(data['Fare']<=39.69),'Fare_cat']=3
data.loc[(data['Fare']>39.69)&(data['Fare']<=512.33),'Fare_cat']=4

In [69]:
sns.factorplot('Fare_cat','Survived',data=data,hue='Sex')
plt.show()

With increase in fare price, survival rate increase for both male and female

## String to Numeric

In [70]:
data['Sex'].replace(['male','female'],[0,1],inplace=True)
data['Embarked'].replace(['S','C','Q'],[0,1,2],inplace=True)
data['Initial'].replace(['Mr','Mrs','Miss','Master','Other'],[0,1,2,3,4],inplace=True)

In [71]:
data.head()

In [74]:
data.info()

## Drop 

**First we will drop "Cabin"**

In [77]:
data = data.drop(['Cabin'], axis = 1)

**Now we will drop rows with Embarked as null value**

In [78]:
data = data.dropna(subset = ['Embarked'])
data.info()

## Date for modeling

Now we will keep only the columns that we will use for modeling.

In [81]:
data.head()

In [84]:
data.drop(['Name','Age','Ticket','Fare','Fare_Range','PassengerId'],axis=1,inplace=True)
data.head()

Now as "Family_Size" and "Alone" are deriver from "Parch" and "SibSp", we can drop later two.\
Also "Initial" and "Age_band" contains same info we can drop one of them.

In [85]:
data.drop(['Parch','SibSp','Initial'],axis=1,inplace=True)
data.head()

# Predicition Model

Import all models, metrics and other modules from Sci-kit Learn

In [86]:
#importing all the required ML packages
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn import svm #support vector Machine
from sklearn.ensemble import RandomForestClassifier #Random Forest
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.tree import DecisionTreeClassifier #Decision Tree
from sklearn.model_selection import train_test_split #training and testing data split
from sklearn import metrics #accuracy measure
from sklearn.metrics import confusion_matrix #for confusion matrix

In [87]:
X = data[['Pclass' , 'Sex' , 'Embarked' , 'Age_band' , 'Family_Size' , 'Alone' , 'Fare_cat']]

In [88]:
Y = data[['Survived']]

In [89]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,train_size = 0.7,random_state = 0)

## Logistic Regression

In [97]:
model_LR = LogisticRegression()
model_LR.fit(X_train,Y_train)
Log_R_pred = model_LR.predict(X_test)
as_LR = metrics.accuracy_score(Log_R_pred ,Y_test)
print('The accuracy of the Logistic Regression is',as_LR)

## Decision Tree

In [102]:
model_DT = DecisionTreeClassifier()
model_DT.fit(X_train,Y_train)
DT_pred = model_DT.predict(X_test)
as_DT = metrics.accuracy_score(DT_pred ,Y_test)
print('The accuracy of the Decision Tree is',as_DT)

## K-Nearest Neighbour (KNN)

In [99]:
model_KNN = KNeighborsClassifier()
model_KNN.fit(X_train,Y_train)
KNN_pred = model_KNN.predict(X_test)
as_KNN = metrics.accuracy_score(KNN_pred ,Y_test)
print('The accuracy of the KNN is',as_KNN)

## Random Forest

In [100]:
model_RF = RandomForestClassifier(n_estimators=100)
model_RF.fit(X_train,Y_train)
RF_pred = model_RF.predict(X_test)
as_RF = metrics.accuracy_score(RF_pred ,Y_test)
print('The accuracy of the RF is', as_RF)

**We compare all this Models below:**

In [104]:

model_score = { 'Logistic Regression' : as_LR , 'Random Forest' : as_RF , 'KNN' : as_KNN ,'Decision Tree' : as_DT }
model_score


In [108]:
models = pd.DataFrame(list(model_score.items()) , columns = ['Model' , 'Score'])
models.sort_values(by='Score', ascending=False)

# Result

We can see that Random Forest is giving us the maximum accuracy score (0.78). \
Hence we will use Random Forest as our prediction model.