# The Titanic Challenge (Beginner)

#### Note: You do not need to understand the Python code or be able to write code to complete this tutorial and pass the Challenge.
#### Remember to hit Shift+Enter in all the code cells to execute the code.

<div class="alert alert-block alert-info">A cell like this indicates a question you need to answer for this Challenge on the U4I platform. Please answer the question <b>before</b> continuing through the notebook.</div>

## Table of contents

[1. Introduction](#1.-Introduction)

[2. Get familiar with the data](#2.-Get-familiar-with-the-data)

[3. Prepare the data](#3.-Prepare-the-data)
   - [3a. Remove some features](#3a.-Remove-some-features)
   - [3b. Replace strings](#3b.-Replace-strings)
   - [3c. Fill in missing data](#3c.-Fill-in-missing-data)
   - [3d. Combine features](#3d.-Combine-features)
   
[4. Visualize the data](#4.-Visualize-the-data)

[5. Create a machine learning model](#5.-Create-a-machine-learning-model)

## 1. Introduction

[[ go back to the top ]](#Table-of-contents)

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered "unsinkable" RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren't enough lifeboats for everyone on board, resulting in the death of 1502 of the 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some were more likely to survive than others.

In this Challenge, you will build a predictive model to answer the question "<b>Who was more likely to survive on the Titanic?</b>" using passenger data. 

To do this, we will use a subset of the passenger data (891 of the 2224 passengers and crew on board), which includes passenger survival. Once we have trained our predictive model, we will test it on a separate subset of the passenger data (418 of the passengers and crew on board), which does not include passenger survival, to determine the prediction accuracy of the model we developed. 

Source: https://www.kaggle.com/c/titanic

## 2. Get familiar with the data

[[ go back to the top ]](#Table-of-contents)

Before we start exploring, we need to import some libraries that will help us with our calculations, visualizations, and machine learning models. 

In [None]:
# Import data analysis libraries
import pandas as pd
import numpy as np
import random as rnd

# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

# Import machine learning libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

Now, let's import the data set `Titanic_1` and take a look at it:

In [None]:
# Import data set, call it "data"
data = pd.read_csv('Titanic_1.csv')

# Show first 15 rows of data set
data.head(15)

We can see that each line in the list corresponds to one passenger. These data are a mixture of <b>categorical</b> and <b>numerical</b> features:

`PassengerId`: Unique ID of the passenger\
`Survived`: Survived (1) or died (0)\
`Pclass`: Passenger's class (1st, 2nd, or 3rd)\
`Name`: Passenger's name\
`Sex`: Passenger's sex (male or female)\
`Age`: Passenger's age\
`SibSp`: Number of siblings / spouses aboard the Titanic\
`Parch`: Number of parents / children aboard the Titanic\
`Ticket`: Ticket number\
`Fare`: Fare paid for ticket\
`Cabin`: Cabin number\
`Embarked`: Where the passenger got on the ship (C - Cherbourg, S - Southampton, Q - Queenstown)

From this we can already discern some information about passengers. For example, Braund Owen Harris was a 22-year-old man traveling in 3rd class who did not survive.  

Note: "NaN" is the abbreviation for "Not a Number". This is how Python represents missing data. 

<div class="alert alert-block alert-info">Pause! Answer <b>Q1 on the U4I platform</b>. 
    
Question 1: Why could missing data (NaNs) be problematic for machine learning models? </div>

Let's now get an overview of the numerical features of the data set.

In [None]:
# do we need this?
data.describe()

In [None]:
# do we need this?
data.describe(include=['O'])

## 3. Prepare the data

[[ go back to the top ]](#Table-of-contents)

Before we work with any machine learning models, we need to ensure the data is prepared.

#### 3a. Remove some features

The first step is to <b>remove some features</b> by answering the following questions:

##### Which features contain blank, null, or empty values?

In [None]:
# Show missing values in data set
column_names = data.columns
for column in column_names:
    print(column + ': ' + str(data[column].isnull().sum()))

We can see that `Cabin` has the most missing values (687), followed by `Age` (177), and then `Embarked` (2).

##### Which features are mixed data types?
`Ticket` is a mix of numeric and alphanumeric data types. `Cabin` is alphanumeric.

##### Which features may contain errors or typos?
`Name` might contain errors as are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.

Therefore, we will remove the features `Ticket`, `Cabin`, and `Name`.

In [None]:
# Remove features Ticket, Cabin, Name from data set
data = data.drop(['Ticket', 'Cabin', 'Name'], axis=1)

<div class="alert alert-block alert-info">Pause! Answer <b>Q2 on the U4I platform</b>.
    
    Why did we keep PassengerId in the data set? Remember: We want to train a model to predict survival and test our model on a data set that does not include survival.
   

<div class="alert alert-block alert-info">Pause! Answer <b>Q3 on the U4I platform</b>.
    
    If you could delete another varaible or column from the data set, which one would you delete and why?

#### 3b. Replace strings 

Next, we need to <b>replace strings (text or letter sequences) with numbers</b> 

This is because the machine learning algorithms we will use cannot process words. We will replace female with 1, male with 0, S with 0, C with 1, and Q with 2.

In [None]:
# Replace strings with numbers

data['Sex'].replace('female', 1,inplace=True)
data['Sex'].replace('male', 0 ,inplace=True)
data['Embarked'].replace('S', 0,inplace=True)
data['Embarked'].replace('C', 1,inplace=True)
data['Embarked'].replace('Q', 2,inplace=True)

#### 3c. Fill in missing data

Finally, we need to deal with in </b>missing data</b>.

Data records are not always complete and this is also true for our data set. Missing data can interfere with machine learning algorithms so we need to <b>fill in the missing data</b>. One way to fill in missing values by using the available values in the data set (e.g., mean or average value, median or middle value, mode or most common value), and approximating a value.

In [None]:
# Supplement missing data in Age with median and Embarked with mode (most common value)

data['Age'].fillna(data['Age'].dropna().median(), inplace=True)
freq_port = data.Embarked.dropna().mode()[0]
data['Embarked'].fillna(freq_port, inplace=True)

#### 3d. Combine features 

Sometimes it can be useful to <b>combine features into new features</b> for visualizations and calculations.

In [None]:
# Create new data categories for Age and Fare

# Create 5 age groups

data['AgeBand'] = pd.cut(data['Age'], 5)
data[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)
data.loc[ data['Age'] <= 16, 'Age'] = 0
data.loc[(data['Age'] > 16) & (data['Age'] <= 32), 'Age'] = 1
data.loc[(data['Age'] > 32) & (data['Age'] <= 48), 'Age'] = 2
data.loc[(data['Age'] > 48) & (data['Age'] <= 64), 'Age'] = 3
data.loc[ data['Age'] > 64, 'Age'] = 4
data = data.drop(['AgeBand'], axis=1)

# Create 4 fare groups

data['Fare'].fillna(data['Fare'].dropna().median(), inplace=True)
data['FareBand'] = pd.qcut(data['Fare'], 4)
data[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)
data.loc[ data['Fare'] <= 7.91, 'Fare'] = 0
data.loc[(data['Fare'] > 7.91) & (data['Fare'] <= 14.454), 'Fare'] = 1
data.loc[(data['Fare'] > 14.454) & (data['Fare'] <= 31), 'Fare'] = 2
data.loc[ data['Fare'] > 31, 'Fare'] = 3
data['Fare'] = data['Fare'].astype(int)
data = data.drop(['FareBand'], axis=1)

<div class="alert alert-block alert-info">Pause! Answer <b>Q4 on the U4I platform</b>.
    
    Why did we group Age and Fare into groups? Answer: More data per class 

<div class="alert alert-block alert-info">Pause! Answer <b>Q5 on the U4I platform (Bonus Question)</b>.
    
    The port designations S, C and Q were replaced by 0, 1 and 2. Linear models, such as perceptrons, assume that at a higher variable value the survival probability either increases or decreases. What is the problem with our approach (replacing port designations with 0, 1, and 2)? How can we avoid this problem? Think about how to do it better. 

In [None]:
# Show first 15 rows of new data set
data.head(10)

## 4. Visualize the data

[[ go back to the top ]](#Table-of-contents)

Visualizing data is a great way to gain some insights and see some trends before applying any machine learning models.

Because we want to predict survival, it makes sense to visualize the relationship between some of the factors and survival.

### Class vs. Survival

In [None]:
data.groupby('Pclass').Survived.mean().plot(kind='bar')
sns.barplot(x='Pclass', y='Survived', data=data)
plt.show()

### Sex vs. Survival

In [None]:
data.groupby('Sex').Survived.mean().plot(kind='bar')
sns.barplot(x='Sex', y='Survived', data=data)
plt.show()

### Class and Sex  vs. Survival

In [None]:
tab = pd.crosstab(data['Pclass'], data['Sex'])
tab.div(tab.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True)
plt.xlabel('Pclass')
plt.ylabel('Percentage')
plt.show()

### Embarked vs. Survival

In [None]:
data.groupby('Embarked').Survived.mean().plot(kind='bar')
sns.barplot(x='Embarked', y='Survived', data=data)
plt.show()

### Embarked, Class, and Sex vs. Survival

In [None]:
sns.factorplot(x='Pclass', y='Survived', hue='Sex', col='Embarked', data=data)
plt.show()

### Age vs. Survival

In [None]:
data.groupby('Age').Survived.mean().plot(kind='bar')
sns.barplot(x='Age', y='Survived', data=data)
plt.show()

### Age and Embarked vs. Survival

In [None]:
fig = plt.figure(figsize=(15,5))
ax1 = fig.add_subplot(131)
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)

sns.violinplot(x="Embarked", y="Age", hue="Survived", data=data, split=True, ax=ax1)
sns.violinplot(x="Pclass", y="Age", hue="Survived", data=data, split=True, ax=ax2)
sns.violinplot(x="Sex", y="Age", hue="Survived", data=data, split=True, ax=ax3)

plt.show()

<mark>#HF how many more visualizations do we need/should we have. Either way I think at the end of this section we should have a Q that asks the users to use the visualizations to make some conclusions as to what affects survival. This can be a basis for how to improve the model in the last question. </mark> 

## 5. Create a machine learning model

[[ go back to the top ]](#Table-of-contents)

The goal of a machine learning model is to make accurate predictions on new, previously unseen data. If we are building a model using the data set that contains what we want to predict, we need to divide the data set into two:

- A <b>training</b> subset to train a model, which contains the information we are trying to predict 
- A <b>test</b> subset to test the model, which does not contain the information we are trying to predict

In [None]:
# Separate the data into training set and test set
train_df, test_df = train_test_split(data, test_size=0.3)

Next, we need to separate survival, the outcome, from rest of the factors in the data set.

In [None]:
# Divide each data set (training and test) into two parts: X & Y

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("Survived", axis=1)
Y_test = test_df["Survived"]

Now we can train a model. 
There are numerous predictive modeling algorithms but not all apply to our problem. Our problem is a classification and regression problem: we want to identify the relationship between passenger survival with other features (e.g., sex, age, class). We are also perfoming a category of machine learning called supervised learning as we are training our model with a given data set. Given this, we will take a closer look at 3(4?) algorithms: 

### Logistic regression

Logistic regression is a useful early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (in our case, Survival) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. 

Source: https://en.wikipedia.org/wiki/Logistic_regression

In [None]:
# Logistic regression
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

We can use Logistic Regression to confirm our assumptions by calculating the coefficient of the features in the function.\
Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).

In [None]:
coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)

<b>Sex is highest positive coefficient</b>, implying that as the Sex value increases (male = 0 to female = 1), the probability of Survived = 1 increases the most.\
<b>Pclass is the highest negative coefficient</b>, implying that as class increases (1-3), probability of Survived = 1 decreases the most.

### Decision tree classifier

The decision tree classifier maps features (tree branches) to conclusions about the target value (tree leaves, in our case, Survival). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. 

Source: https://en.wikipedia.org/wiki/Decision_tree

In [None]:
# Decision tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

In [None]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

### K-nearest neighbors classifier (K-NN)

The K-NN classifier is a non-parametric method used for classification and regression. A sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. 

Source:https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

Now, we will see how well the chosen model predicts our data.\
The function scoretakes the values of the test data set (X_test), calculates with the model the corresponding values for the survival status, and compares them with the correct values (Y_test). The output value `acc_logof` is the probability that the model predicted survival status correctly.

In [None]:
# Validate model and calculate accuracy

acc_random_forest = round(model.score(X_test, Y_test) * 100, 2)

print("\n  \nThe accuracy of the model with respect to the test data is:")
print(acc_log)

<div class="alert alert-block alert-info">Pause! Answer <b>Q6 on the U4I platform (Bonus Question)</b>.
    
    Run any of machine learning algorithm several rimes in a row (without changing the code). Why do you get a different accuracy each time? More here?

### Congratulations! You have completed the Titanic Challenge (Beginner)! Remember to submit the exercise on the U4I platform.

## Sources:

[[ go back to the top ]](#Table-of-contents)

https://www.kaggle.com/startupsci/titanic-data-science-solutions