# Titanic Survivors Prediction

This is a project about predicting if a person would survive to the sinking of the Titanic.

This project will use an ensemble learning approach, which will include the following models:
1. Decision Trees
2. Support Vector Machines (SVM)
3. Naive Bayes
4. Logistic Regresion

The model will be the average of the four models described below.

## Loading Libraries

In [1]:
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## Titanic Data

The dataset for this project is called "Titanic" which include the following features:

| Variable	         | Definition	                              | Key                    |
|--------------------|--------------------------------------------|------------------------|
| Id                 | Passenger Identificator                    |                        |
| Name               | Passenger Name                             |                        |
| Age	             | Age in years                               |                        |
| SibSp              | # of siblings / spouses aboard the Titanic |                        |
| ParCh              | # of parents / children aboard the Titanic |                        |
| Ticket             | Ticket number	                          |                        |
| Fare               | Passenger fare                             |                        |
| Cabin              | Cabin number                               |                        |
| Embarked           | Port of Embarkation	                      |C = Cherbourg, Q = Queenstown, S = Southampton|
| Class              | Ticket class                               | Lower, Middle, Upper   |
| Sex                | Sex                                        |                        |
| Survived           | Survived                                   | N = No, Y = Yes        |

The variable we want to predict is **Survived** which is a binary feature taking values:
* N: Didn't survived
* Y: Survived

In [2]:
#Loading dataset "Titanic"
titanic = pd.read_csv('data_titanic_proyecto.csv')
titanic.columns = ['Id', 'Name', 'Age', 'SibSp', 'ParCh', 'Ticket', 'Fare', 'Cabin', 'Embarked', 
                   'Class', 'Sex', 'Survived']

In [3]:
titanic.head()

Unnamed: 0,Id,Name,Age,SibSp,ParCh,Ticket,Fare,Cabin,Embarked,Class,Sex,Survived
0,1,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,,S,Lower,M,N
1,2,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,C,Upper,F,Y
2,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,,S,Lower,F,Y
3,4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,C123,S,Upper,F,Y
4,5,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,,S,Lower,M,N


Now we're going to separate the data set into two:
* X: Contains all features that we're going to use in the model
* y: Contains the variable we want to predict

In [21]:
X = titanic.drop(['Name', 'Ticket', 'Cabin', 'Survived'], axis = 1)
y = pd.DataFrame(titanic['Survived'])
print('Number of rows for dataset X: ', X.shape[0], sep = '')
print('Number of rows for dataset y: ', y.shape[0], sep = '')

Number of rows for dataset X: 891
Number of rows for dataset y: 891


Check for NA values for each feature

In [22]:
total = X.isnull().sum().sort_values(ascending=False)
percent = (X.isnull().sum()/X.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

Unnamed: 0,Total,Percent
Age,177,0.198653
Embarked,2,0.002245
Sex,0,0.0
Class,0,0.0
Fare,0,0.0
ParCh,0,0.0
SibSp,0,0.0
Id,0,0.0


## Feature Engineering: Part 1

### Dealing with NAs

For the column **Embarked** we will use a mode imputation. This means that the 2 missing values will get the mode for the whole column. In this case is **S: Southampton**

In [23]:
X['Embarked'].fillna(X['Embarked'].mode()[0], inplace=True)

For the column **Age** we'll use another approach: a linear model. But first, we need to change categorical features to numerical values. This is because the model will not contemplate ***string type*** tensors.

In [24]:
#For Embarked
embarked_label = np.unique(X['Embarked'].dropna())
embarked_index = np.arange(0,len(embarked_label))
embarked_dict = {'embarked_index': embarked_index, 'label': embarked_label}
embarked_dict = pd.DataFrame(embarked_dict, index=embarked_label)

#For Passenger Class
class_label = np.unique(X['Class'].dropna())
class_index = np.arange(0,len(class_label))
class_dict = {'class_index': class_index, 'label': class_label}
class_dict = pd.DataFrame(class_dict, index=class_label)

#For Passenger Sex
sex_label = np.unique(X['Sex'].dropna())
sex_index = np.arange(0,len(sex_label))
sex_dict = {'sex_index': sex_index, 'label': sex_label}
sex_dict = pd.DataFrame(sex_dict, index=sex_label)

#For Passenger Survival
survived_label = np.unique(y['Survived'].dropna())
survived_index = np.arange(0,len(survived_label))
survived_dict = {'survived_index': survived_index, 'label': survived_label}
survived_dict = pd.DataFrame(survived_dict, index=survived_label)

In [25]:
new_list = [None]*len(X['Embarked'])
for i in range(0,len(X['Embarked'])):
    temp = embarked_dict[embarked_dict.index==X['Embarked'][i]]
    new_list[i] = temp['embarked_index'][0]
len(new_list)
X['embarked_code'] = new_list

new_list = [None]*len(X['Class'])
for i in range(0,len(X['Class'])):
    temp = class_dict[class_dict.index==X['Class'][i]]
    new_list[i] = temp['class_index'][0]
len(new_list)
X['class_code'] = new_list

new_list = [None]*len(X['Sex'])
for i in range(0,len(X['Sex'])):
    temp = sex_dict[sex_dict.index==X['Sex'][i]]
    new_list[i] = temp['sex_index'][0]
len(new_list)
X['sex_code'] = new_list

new_list = [None]*len(y['Survived'])
for i in range(0,len(y)):
    temp = survived_dict[survived_dict.index==y['Survived'][i]]
    new_list[i] = temp['survived_index'][0]
len(new_list)
y['survived_code'] = new_list

In [26]:
X.head()

Unnamed: 0,Id,Age,SibSp,ParCh,Fare,Embarked,Class,Sex,embarked_code,class_code,sex_code
0,1,22.0,1,0,7.25,S,Lower,M,2,0,1
1,2,38.0,1,0,71.2833,C,Upper,F,0,2,0
2,3,26.0,0,0,7.925,S,Lower,F,2,0,0
3,4,35.0,1,0,53.1,S,Upper,F,2,2,0
4,5,35.0,0,0,8.05,S,Lower,M,2,0,1


In [27]:
y.head()

Unnamed: 0,Survived,survived_code
0,N,0
1,Y,1
2,Y,1
3,Y,1
4,N,0


Now, we need to generate the linear model in order to find the missing values for the age. This linear model will take **Age** as the variable we want to predict, using the following features:
* Embarked
* Class
* Sex
* SibSp
* Parch

In [119]:
def age_prediction(x,y):
    modelB = LinearRegression()
    modelB.fit(X=x, y=y)
    y_hat = modelB.predict(x)
    error = (1/2*np.mean((y_hat-y)**2))
    
    #building df
    df_dict = {'Id': X.dropna().Id, 'y_hat': y_hat}
    df = pd.DataFrame(df_dict)
    return(modelB.features, modelB.coef_, modelB.intercept_, error)

In [120]:
age_lm = age_prediction(X.dropna().drop(['Id', 'Embarked', 'Class', 'Sex', 'Age'], axis = 1), X['Age'].dropna())

NameError: name 'modelB_features' is not defined

Now that we have the linear model for the prediction of age in place, we'll predict the new values for **Age** in a new column called ***Age_lm***

In [116]:
#SibSp	ParCh	Fare	embarked_code	class_code	sex_code
X['age_lm'] = age_lm[0][0]*X['SibSp'] + age_lm[0][1]*X['ParCh'] + age_lm[0][2]*X['Fare'] + age_lm[0][3]*X['embarked_code'] + age_lm[0][4]*X['class_code'] + age_lm[0][5]*X['sex_code']
X

Unnamed: 0,Id,Age,SibSp,ParCh,Fare,Embarked,Class,Sex,embarked_code,class_code,sex_code,age_lm
0,1,22.0,1,0,7.2500,S,Lower,M,2,0,1,1.082543
1,2,38.0,1,0,71.2833,C,Upper,F,0,2,0,9.308701
2,3,26.0,0,0,7.9250,S,Lower,F,2,0,0,1.770440
3,4,35.0,1,0,53.1000,S,Upper,F,2,2,0,11.462532
4,5,35.0,0,0,8.0500,S,Lower,M,2,0,1,4.908152
5,6,,0,0,8.4583,Q,Lower,M,1,0,1,3.958749
6,7,54.0,0,0,51.8625,S,Upper,M,2,2,1,18.457609
7,8,2.0,3,1,21.0750,S,Lower,M,2,0,1,-7.615631
8,9,27.0,0,2,11.1333,S,Lower,F,2,0,0,0.082436
9,10,14.0,1,0,30.0708,C,Middle,F,0,1,0,2.817478


Now we can check again the missing values

In [None]:
total = X.isnull().sum().sort_values(ascending=False)
percent = (X.isnull().sum()/X.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

## Train, Val, and Test Sets

We're going to split the dataset in three different sets:
* Train
* Validation (Val)
* Test

For these datsets, we'll use the following features:
1. Age
2. SibSp
3. Parch
4. Fare
5. Embarked
6. Class
7. Sex

And the variable we want to predict
1. Survived

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=17)

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=17)

In [None]:
print('Number of rows in Train: ', len(X_train), ' (', round(100*len(X_train)/len(titanic),1), '%)', sep = '')
print('Number of rows in Val: ', len(X_val), ' (', round(100*len(X_val)/len(titanic),1), '%)', sep = '')
print('Number of rows in Train: ', len(X_test), ' (', round(100*len(X_test)/len(titanic),1), '%)', sep = '')

## Exploration

### Histograms for numeric features

In [None]:
plt.figure(figsize=(12,12))
sns.set_style("whitegrid")
plt.subplot(2,2,1)
sns.distplot(X_train['Age'][~np.isnan(X_train['Age'])], color="green", axlabel='Age of the Passengers')
plt.subplot(2,2,2)
sns.distplot(train['SibSp'], color='red', axlabel='Sibling/Spouse')
plt.subplot(2,2,3)
sns.distplot(train['Parch'], color='blue', axlabel='Parent/Children')
plt.subplot(2,2,4)
sns.distplot(train['Fare'], color='orange', axlabel='Passengers Fare')

In [None]:
temp = passenger_class_dict[passenger_class_dict.index==X_train['passenger_class'][8]]
temp
#temp['passenger_class_index'][0]