# The Titanic Challenge (Beginner)

#### Note: You do not need to understand the Python code or be able to write code to complete this tutorial and pass the Challenge.
#### Remember to hit Shift+Enter in all the code cells to execute the code.

<div class="alert alert-block alert-info">A cell like this indicates a question you need to answer for this Challenge on the U4I platform. Please answer the question <b>before</b> continuing through the notebook.</div>

## Table of Contents
* [Introduction](#1)
* [Get familiar with the data](#2)
* [Prepare the data](#3)
* [Create a machine learning mode](#4)

<a id=1></a>
## 1. Introduction

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 of the 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some were more likely to survive than others.

In this Challenge, you will use paasenger data to build predictive model to answer the question: Who was more likely to survive on the Titanic?

Source: https://www.kaggle.com/c/titanic

<a id=2></a>
## 2. Get familiar with the data

Before we start exploring the data set, we need to import it, and some libraries that will help us with our calculations and machine learning models. 

In [None]:
# Data analysis
import pandas as pd
import numpy as np
import random as rnd

# Machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [None]:
# Import data set
data = pd.read_csv('train.csv')

Now, let's look at our data set.

In [None]:
# Show first 10 rows of data set
data.head(10)

We can see our data set has a mixture of catergorical and numerical variables:\
`PassengerId`: Unique ID of the passenger\
`Survived`: Survived (1) or died (0)\
`Pclass`: Passenger's class (1st, 2nd, or 3rd)\
`Name`: Passenger's name\
`Sex`: Passenger's sex (male or female)\
`Age`: Passenger's age\
`SibSp`: Number of siblings/spouses aboard the Titanic\
`Parch`: Number of parents/children aboard the Titanic\
`Ticket`: Ticket number\
`Fare`: Fare paid for ticket\
`Cabin`: Cabin number\
`Embarked`: Where the passenger got on the ship (C - Cherbourg, S - Southampton, Q - Queenstown)

From this we can already discern some information about passengers. For example, Braund Owen Harris was a 22-year-old man in 3rd class who did not survive.  

Note: "NaN" is the abbreviation for "Not a Number". This is how Python represents missing data. 

<div class="alert alert-block alert-info">Pause! Answer <b>Q1 on the U4I platform</b>.
    
    Why could missing data (NaNs) be problematic for machine learning models?

<a id=3></a>
## 3. Prepare the data

Before we work with any machine learning models, we need to ensure the data set is prepared.

There are some variables that are irrelevant for our purposes, namely `Ticket`, `Cabin`, and `Name`. We need to remove these so they do not affect the accuracy of our model.

In [None]:
# Remove irrelevant data

data = data.drop(['Ticket', 'Cabin', 'Name'], axis=1)

The machine learning algorithms we will use cannot process words so we need to replace strings (text or letter sequences) with numbers. We will replace female with 1, male with 0, S with 0, C with 1, and Q with 2.

<div class="alert alert-block alert-info">Pause! Answer <b>Q2 on the U4I platform</b>.
    
    Why did we keep PassengerId in the data set? What must this column be used for in the model?

<div class="alert alert-block alert-info">Pause! Answer <b>Q3 on the U4I platform</b>.
    
    If you could delete another varaible or column from the data set, which one would you delete and why?

In [None]:
# Replace strings with numbers

data['Sex'].replace('female', 1,inplace=True)
data['Sex'].replace('male', 0 ,inplace=True)
data['Embarked'].replace('S', 0,inplace=True)
data['Embarked'].replace('C', 1,inplace=True)
data['Embarked'].replace('Q', 2,inplace=True)

Data records are not always complete and this is also true for our data set. Missing data can interfere with machine learning algorithms so we need to fill in the missing data. One way to fill in missing values by using the available values in the data set (e.g., mean, median, mode), and approimating a value.

In [None]:
# Supplement missing data in Age with median and Embarked with mode (most common value)

data['Age'].fillna(data['Age'].dropna().median(), inplace=True)
freq_port = data.Embarked.dropna().mode()[0]
data['Embarked'].fillna(freq_port, inplace=True)

Sometimes it can be useful to combine data categories into new data categories for visualizations and calculations.

In [None]:
# Create new data categories for Age and Fare

# Create 5 age groups

data['AgeBand'] = pd.cut(data['Age'], 5)
data[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)
data.loc[ data['Age'] <= 16, 'Age'] = 0
data.loc[(data['Age'] > 16) & (data['Age'] <= 32), 'Age'] = 1
data.loc[(data['Age'] > 32) & (data['Age'] <= 48), 'Age'] = 2
data.loc[(data['Age'] > 48) & (data['Age'] <= 64), 'Age'] = 3
data.loc[ data['Age'] > 64, 'Age'] = 4
data = data.drop(['AgeBand'], axis=1)

# Create 4 fare groups

data['Fare'].fillna(data['Fare'].dropna().median(), inplace=True)
data['FareBand'] = pd.qcut(data['Fare'], 4)
data[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)
data.loc[ data['Fare'] <= 7.91, 'Fare'] = 0
data.loc[(data['Fare'] > 7.91) & (data['Fare'] <= 14.454), 'Fare'] = 1
data.loc[(data['Fare'] > 14.454) & (data['Fare'] <= 31), 'Fare'] = 2
data.loc[ data['Fare'] > 31, 'Fare'] = 3
data['Fare'] = data['Fare'].astype(int)
data = data.drop(['FareBand'], axis=1)

<div class="alert alert-block alert-info">Pause! Answer <b>Q4 on the U4I platform</b>.
    
    Describe how the variables Age and Fare were grouped?

<div class="alert alert-block alert-info">Pause! Answer <b>Q5 on the U4I platform (Bonus Question)</b>.
    
    The port designations S, C and Q were replaced by 0, 1 and 2. Linear models, such as perceptrons, assume that at a higher variable value the survival probability either increases or decreases. What is the problem with our approach (replacing port designations with 0, 1, and 2)? How can we avoid this problem? Think about how to do it better. 

In [None]:
# Show first 15 rows of altered data set
data.head(15)

<a id=4></a>
## 4. Create a machine learning model

Neural networks need to be trained first. Insert why here! To do this, we will divide our data set into two: one for training the network, and one to test the network. Typically, 70% of the data is used as a training set and 30% as the test set. Insert why here!

In [None]:
# Separate data into training set and test set
train_df, test_df = train_test_split(data, test_size=0.3)

<div class="alert alert-block alert-info">Pause! Answer <b>Q6 on the U4I platform</b>.
    
    Why is it important to not use all the data as training data?

Next, we need to separate the survival status, the outcome, from rest of the varaibles in the data set, the predictors.

In [None]:
# Divide each data set (test and train) into two parts: X & Y

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("Survived", axis=1)
Y_test = test_df["Survived"]

Now we can train a model. 
There are over 60 predictive modeling algorithms but not all apply to the problem we are trying to solve. Our problem is a classification and regression problem: we want to identify the relationship between passenger survival with other variables (e.g., sex, age, class). We are also perfoming a category of machine learning called supervised learning as we are training our model with a given data set. Given this, we have a few options for an algorithm: 
* LogisticRegression 
* LinearSVC 
* RandomForestClassifier 
* KNeighborsClassifier 
* GaussianNB 
* Perceptron 
* SGDClassifier 
* DecisionTreeClassifier 

Instead of a specific algorithm we have specified a placeholder "PLACEHOLDER()" in the code cell below.\
Select one of the models and replace "PLACEHOLDER" with one of the options from the list above.\
For example, if you decide to use the first algorithm, the code will look like this:\

`# Apply machine learning algorithm LogisticRegression
model = LogisticRegression()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)`

In [None]:
# Apply machine learning algorithm ________________

model = PLACEHOLDER()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)

Now, we will see how well the chosen model predicts our data.\
The function scoretakes the values of the test data set (X_test), calculates with the model the corresponding values for the survival status, and compares them with the correct values (Y_test). The output value `acc_logof` is the probability that the model predicted survival status correctly.

In [None]:
# Validate model and calculate accuracy

acc_log = round(model.score(X_test, Y_test) * 100, 2)

print("\n  \nThe accuracy of the model with respect to the test data is:")
print(acc_log)

model_name=str(model)
print("\n \nYou have used the following model for machine learning:")
print(model_name)

<div class="alert alert-block alert-info">Pause! Answer <b>Q7 on the U4I platform</b>.
    
    Which machine learning model did you use? What is the accuracy of the model?

<div class="alert alert-block alert-info">Pause! Answer <b>Q8 on the U4I platform (Bonus Question)</b>.
    
    Run the machine learning algorithm several rimes in a row (without changing the code). Why do you get a different accuracy each time?

<div class="alert alert-block alert-info">Pause! Answer <b>Q9 on the U4I platform (Bonus Question)</b>.
    
    Which model provides the highest accuracy?

### Congratulations! You have completed the Titanic Challenge (Beginner)! Remember to submit the exercise on the U4I platform.