# Titanic Survival Prediction

<img src="flow_chart.png" height=200px width=800px></img>

## Framing the problem

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

## Obtain Data

#### Importing the basic required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as ms
%matplotlib inline

### Reading the data from CSV file

In [None]:
data = pd.read_csv('titanic.csv')

## Analyze Data

#### Obtaining a glimpse of data

In [None]:
data.head(3)

In [None]:
data.tail(3)

In [None]:
type(data)

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe()

#### Visualization of the Data

In [None]:
ms.matrix(data)

In [None]:
data.info()

In [None]:
sns.jointplot(x='Fare',y='Age',data=data)

In [None]:
sns.distplot(data['Fare'])

In [None]:
data.corr()

In [None]:
sns.heatmap(data.corr(),cmap='coolwarm',xticklabels=True)
plt.title('data.corr()')

In [None]:
sns.swarmplot
sns.swarmplot(x='Pclass',y='Age',data=data,palette='Set2')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',data=data,palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=data,palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data = data,palette='rainbow')

In [None]:
data['Age'].hist(bins = 40, color = 'darkred', alpha = 0.8)

In [None]:
sns.countplot(x = 'SibSp', data = data)

In [None]:
data['Fare'].hist(color = 'green', bins = 40, figsize = (8,3))

## Cleaning of data

#### Fill the missing values in the obtained data

In [None]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=data,palette='winter')


The average age for each of the classes are estimated to be as follows:
  
  * For **Class 1** - The median age is 37
  * For **Class 2** - The median age is 29
  * For **Class 3** - The median age is 24
  
Let's impute these values into the age column.



In [None]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        # Class-1
        if Pclass == 1:
            return 37
        # Class-2 
        elif Pclass == 2:
            return 29
        # Class-3
        else:
            return 24

    else:
        return Age



Applying the function.

In [None]:
data['Age'] = data[['Age','Pclass']].apply(impute_age,axis=1)

Now let's visualize the missing values.

In [1]:
ms.matrix(data)

NameError: name 'ms' is not defined

The Age column is imputed sucessfully.

Let's drop the Cabin column and the row in the Embarked that is NaN.

In [None]:
data.drop('Cabin', axis = 1,inplace=True)

In [None]:
data.head()

In [None]:
data.dropna(inplace = True)

In [None]:
ms.matrix(data)

In [None]:
data.info()

### Categorical value conversion

In [None]:
data.info()

In [None]:
data['Sex'].unique()

In [None]:
data['Sex'].value_counts()

In [None]:
sex_df = pd.get_dummies(data['Sex'],drop_first=3)
sex_df.head()

In [None]:
data['Embarked'].unique()

In [None]:
data['Embarked'].value_counts()

In [None]:
embark_df = pd.get_dummies(data['Embarked'],drop_first=True)
embark_df.head()

In [None]:
old_data = data.copy()
data.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)
data.head()

In [None]:
old_data.info()

In [None]:
data = pd.concat([data,sex_df,embark_df],axis=1)

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

## Model Selection

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.drop('Survived',axis=1), 
                                                    data['Survived'], test_size=0.30, 
                                                    random_state=101)

### Logistic Regression

#### Training the model

In [None]:
from sklearn.linear_model import LogisticRegression

# Build the Model.
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

### Predicting the model on the test set

In [None]:
predicted =  logmodel.predict(X_test)

## Evaluate the predictions

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

### Confusion Matrix

True positive   |	False positive,  
________________|________________  
                |                  
False negative  |	True negative    `

In [None]:
print(confusion_matrix(y_test, predict))

#### Precision Score

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The best value is 1 and the worst value is 0.



In [None]:
from sklearn.metrics import precision_score

print(precision_score(y_test,predict))

#### Recall score

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The best value is 1 and the worst value is 0.



In [None]:
from sklearn.metrics import recall_score

print(recall_score(y_test,predict))



#### f1_score

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:
                F1 = 2 \* (precision \* recall) / (precision + recall)

In [None]:
from sklearn.metrics import f1_score

print(f1_score(y_test,predict))

In [None]:
To get all the above metrics at one go, use the following function:

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test,predict))

## Predicting on Validation set

In [None]:
prod_data=pd.read_csv('production.csv')

In [None]:
prod_data.info()

In [None]:
prod_data.head()

In [None]:
ms.matrix(prod_data)

### Data Cleaning

In [None]:
prod_data['Age'] = prod_data[['Age','Pclass']].apply(impute_age,axis=1)

ms.matrix(prod_data)

prod_data.drop('Cabin', axis = 1, inplace= True)

ms.matrix(prod_data)

prod_data.fillna(prod_data['Fare'].mean(),inplace=True)

prod_data.info()

ms.matrix(prod_data)

sex = pd.get_dummies(prod_data['Sex'], drop_first=True)
embark = pd.get_dummies(prod_data['Embarked'], drop_first=True)

prod_data.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)

prod_data = pd.concat([prod_data,sex,embark],axis=1)

prod_data.head()

### Predicting on New Dataset

In [None]:
predict1=logmodel.predict(prod_data)

predict1

df1=pd.DataFrame(predict1,columns=['Survived'])

df2=pd.DataFrame(prod_data['PassengerId'],columns=['PassengerId'])

df2.head()

result = pd.concat([df2,df1],axis=1)
result.head()

### Writing to CSV File

In [None]:
result.to_csv('result.csv',index=False)

---
                                    THE END