# Titanic Dataset
# Goal

**The machine learning model is supposed to predict who survived or not. A typical classification problem and we will build a machine learning model using Decision Trees.**

# Data Dictionary

Let's look at the dataset information of which column contains what.

![data dictionary](datasets/titanic/titanic_data_dictionary.png)

* First things first, for machine learning algorithms to work, dataset must be converted to numeric data. 
* You have to encode all the categorical lables to column vectors with binary values.  
* Missing values or NaNs in the dataset is an annoying problem. You have to either drop the missing rows or fill them up with a mean or interpolated  values.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
%matplotlib inline

In [2]:
# Reading data from csv file using pandas
df = pd.read_csv('datasets/titanic/dataset.csv',header=0, index_col="PassengerId")
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Lets take a look at the data format below

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


If you carefully observe the above summary of pandas, there are total 891 rows, Age shows only 714 (177 missing), Embarked (2 missing) and Cabin missing a lot as well. Object data types are non-numeric so we have to find a way to encode them to numerical values.

Lets try to drop some of the columns which many not contribute much to our machine learning model such as Name, Ticket, Cabin etc.

In [4]:
# Here inplace means do the change in the dataset itself. 
# If it is not mentioned then the changed dataset will be returned without affecting the original dataset
cols = ['Name','Ticket','Cabin']
df.drop(cols, axis=1, inplace=True) 
df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0,3,male,22.0,1,0,7.25,S
2,1,1,female,38.0,1,0,71.2833,C
3,1,3,female,26.0,0,0,7.925,S
4,1,1,female,35.0,1,0,53.1,S
5,0,3,male,35.0,0,0,8.05,S


Next if we want we can drop all rows in the data that has missing values (NaN).
This can be done by the command **df.dropna(inplace=True)**
But we will lose atleast 177 rows (which contains other data) which is significant amount of information which machine learning can use to learn.
So we fill the values instead.

Since Age contains continuous values, we can replace the na values with mean or median of the data.
Since Embarked column contains categorical values, we can replace na values with mode (most occuring value) in the data

In [5]:
df['Age'].fillna(df['Age'].mean(), inplace=True) # Filling the NaN values in Age column with mean of the Age data
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True) # Filling the NaN values in Embarked column with most occuring value

## One hot encoding

One hot encoding is basically extracting the unique values in the column to their own columns with boolean data type which will represent if a value is relevant to the row.

For example one hot encoding for **Pclass** will be create **1, 2, 3** columns. If **1** is true for a passenger then that passenger belongs to the __1st class__

Pandas has an inbuilt method for this purpose as follows:

In [6]:
dummies_1 = pd.get_dummies(df[["Sex", "Embarked"]]) # one hot encoding for columns with object type
dummies_2 = pd.get_dummies(df['Pclass']) # one hot encoding for column with numeric type

Now that we have one hot encoded values, we do not need the old columns anymore so we drop those and add the new columns.

In pandas two dataframe can be concated with **pd.concat()** method

In [7]:
df.drop(["Pclass", "Sex", "Embarked"], axis=1, inplace=True)

In [8]:
df = pd.concat([df, dummies_1, dummies_2], axis=1)
df.head()

Unnamed: 0_level_0,Survived,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,1,2,3
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,22.0,1,0,7.25,0,1,0,0,1,0,0,1
2,1,38.0,1,0,71.2833,1,0,1,0,0,1,0,0
3,1,26.0,0,0,7.925,1,0,0,0,1,0,0,1
4,1,35.0,1,0,53.1,1,0,0,0,1,1,0,0
5,0,35.0,0,0,8.05,0,1,0,0,1,0,0,1


Since we need to predict the **Survived**, we can treat it as **label** and other columns as **features**.

We will separate features and label into **X** and __y__ variables respectivey as follows.

In [9]:
X = df.drop(['Survived'], axis=1) # Survived will be used as label
y = df["Survived"]

### Now that we have out features and labels we can train out model.
### But we dont have any data to test out model's accuracy on.
### Model's accuracy means for how many number of data is the model able to predict the correct value.
### For this purpose we will use a module from sklearn which will split the data into two sets: Training set and Testing set

In [11]:
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.1)

### Let's train out Decision Tree model on training data

In [12]:
model = DecisionTreeClassifier()
model.fit(trainX, trainy)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

### We will use the model to predict for the testing set and store the predictions in a variable

In [13]:
predictions = model.predict(testX)

### Let's look at some of the predicted as well as actual values

In [17]:
pd.DataFrame(list(zip(predictions, testy)), columns=["Prediction", "Actual"]).head(10)

Unnamed: 0,Prediction,Actual
0,0,1
1,0,0
2,1,1
3,1,1
4,1,1
5,0,1
6,1,1
7,1,1
8,1,1
9,1,1


### Let's use accuracy_score function from sklearn to see how good our model is.

This function will compare the predicted values with actual values and return what percentage of our prediction matches with the actual values.

In [14]:
accuracy_score(testy, predictions)

0.8111111111111111