# Titanic: Machine Learning from Disaster

##### Based on the famous [kaggle Titanic competition](https://www.kaggle.com/c/titanic).

#### Goal: Work through a simple machine learning example from start to finish along a typical data analysis pipeline. 

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered "unsinkable" RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren't enough lifeboats for everyone on board, resulting in the death of 1514 of the 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some were more likely to survive than others. We will use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. 

<div align="center">
<img width="1200" title="Titanic at Southampton docks, prior to departure" src="images/titanic.jpg"/>
</div>


**Please note**: This exercise is based on a Jupyter notebook, an interactive environment for writing and running code, and is running in Python. To get familiar with working in Jupyter notebooks, see our short "JupyterLab Tutorial". For a basic introduction to programming in Python, see the "Introduction to Python" notebook.

<mark>We should remember to add the Jupyter Tutorial and Intro to Python notebook (as optional!) when assembling a skill </mark>

<div class="alert alert-block alert-info">A cell like this indicates a question you need to answer in the Answer.txt file. Please answer the question <b>before</b> continuing through the notebook. You can <b>double click on Answer.txt</b> in the Left Sidebar now to open it in a new tab. As you go through the notebook, navigate between the tabs to answer questions.
</div>

## Table of contents

1. [Introduction](#1.-Introduction)

2. [Get familiar with the data](#2.-Get-familiar-with-the-data)

3. [Further explore the data](#3.-Further-explore-the-data)

4. [Prepare the data](#4.-Prepare-the-data)

   1. [Remove less relevant features](#4A.-Remove-less-relevant-features)
   2. [Convert text-based features](#4B.-Convert-text-based-features)
   3. [Fill in missing data](#4C.-Fill-in-missing-data)
   4. [Derive new features](#4D.-Derive-new-features)
   
   
5. [Train and evaluate a model](#5.-Train-and-evaluate-a-model)

6. [Sources](#Sources)

## 1. Introduction

[[ go back to the top ]](#Table-of-contents)

In this Challenge, we will build a predictive model to answer the question **"Who was more likely to survive on the Titanic?"** using passenger data. These data contain information about each passenger (e.g., the passenger's name, age, gender, ticket class, etc.). For this exercise, we will use a subset of the full passenger data set (891 of the 2224 passengers and crew on board). Importantly, our data set `titanic.csv` contains also the information whether each passenger survived or not, i.e. the *label* required for using supervised machine learning methods.

To build our model we will work through a number of steps. First, we will familiarze ourselves with the data set and understand what features we are working with. Next, we will explore our data set a little further with some visualizations to gain some first insights into what features might affect passenger survival. Then, we will prepare and tidy the data for machine learning. Finally we will select a machine learning algorithm and train our model. To do this, we will we will split our data set into two: a *training set* and *test set*. During training, the machine learning algorithm tries to find patterns in the training set that link features of each passenger to their survival. Once we have trained the model, we will use it to classify passengers from the test set as survived or not based on the their features. The test set will therefore allow us to assess the accuracy of our model.

## 2. Get familiar with the data

[[ go back to the top ]](#Table-of-contents)

Before we start exploring, we need to import some libraries that will help us with our calculations, visualizations, and machine learning models. 

*Remember to press ***Shift+Enter*** to run each code cell.*

In [None]:
# An overview of the libraries used in this notebook is provided in the "Sources" section

# Import data analysis libraries
import pandas as pd
import numpy as np
import random as rnd

# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

# Import machine learning libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

from ipywidgets import widgets

# Note: this cell and some of the cells below produce no visible output;
# the sucessful execution of a cell is indicated by the number in the brackets [ ] on the left.

<mark>Check that only required modules are imported after ML section is finished </mark> 

Now, let's import our data set `titanic.csv` and take a look at it:

In [None]:
# Load data from file into a new object called "titanic_data"
titanic_data = pd.read_csv('titanic.csv')

# Show first 15 rows of data set
titanic_data.head(15)

We can see that each line in the list corresponds to one passenger. These data are a mixture of <b>categorical</b> and <b>numerical</b> features:

`PassengerId`: Unique ID of the passenger\
`Survived`: Whether the passenger survived (1=Yes, 0=No)\
`Pclass`: Passenger's class (1=1st, 2=2nd, 3=3rd)\
`Name`: Passenger's name\
`Sex`: Passenger's sex (male or female)\
`Age`: Passenger's age in years\
`SibSp`: Number of siblings or spouses aboard the Titanic\
`Parch`: Number of parents or children aboard the Titanic\
`Ticket`: Ticket number\
`Fare`: Fare paid for ticket\
`Cabin`: Cabin number\
`Embarked`: Port where the passenger embarked (C=Cherbourg, Q=Queenstown, S=Southampton)

From this we can already discern some information about passengers. For example, Mr. Owen Harris Braund was a 22-year-old man traveling in 3rd class who did not survive. Note that `NaN` is the abbreviation for "Not a Number". This is how Python represents missing data.

*Hint: You can also take a look at the data set directly by double-clicking on `titanic.csv` in the Left Sidebar.* 

<div class="alert alert-block alert-info">Pause! Answer <b>Q1 in the Answer.txt file</b>. 
    
Why could missing data ("NaN") be problematic for machine learning models? </div>

Let's now get an overview of some summary statistics about the numerical features in the data set:

In [None]:
# Show summary statistics of the numerical features in the data set
titanic_data.describe()

This summary includes the total `count` of values for each feature, their `mean`, their standard deviation `std`, the minimal and maximal values `min` and `max`, as well as the 25, 50 and 75 percentiles. The 50 percentile is the same as the median.

We can already get some useful pieces of information from this table: Out of the 891 passengers in our data set, 38% survived (the survival rate is simply the `mean` of the `Survived` column because this column contains 1 if passengers survived, and 0 if they did not); more than half of the passengers were traveling in the 3rd class, and the majority were traveling alone. We can also see that we only have age data for 714 of 891 passengers in our data set. 

Let's also look at the non-numerical, or categorical features:

In [None]:
# Show summary statistics of the categorical features in the data set
titanic_data.describe(include=[np.object])

In this table, `count` and `unique` are the total and unique number of values for a given feature, `top` is the most common value, and `freq` is the frequency of the most common value.

Here, we can see that a majority of the passengers were male and embarked in Southampton. We also notice that `Cabin` data is missing for most passengers, which means we might not be able to use `Cabin` as a feature in our model.  

## 3. Further explore the data

[[ go back to the top ]](#Table-of-contents)

In addition to descriptive statistics, **data visualization** can be a powerful tool. Visualizations can help identify potential issues in the data, and, importantly, improve our understanding of the problem, guiding experimentation. Let's now further explore our data using some visualizations.

Remember that our aim is to predict passenger survival based on the information we have in our data set. For this, understanding the Titanic disaster and specifically what features might affect the outcome of survival is important. If you watched the movie Titanic, you would remember that women and children were given preference to lifeboats, and that the three passenger classes were not treated equally. This suggests that `Sex`, `Pclass`, and `Age` may be good predictors of survival. 

Let's first see how gender affects survival:

In [None]:
# Show bar plot of Sex vs survival rate 
sns.barplot(x='Sex',y='Survived',data=titanic_data,ci=None)
;

Over 70% of the female passengers survived, but only about 20% of the male passengers. Sex is therefore indeed a strong indicator of survival, and a trivial model deriving its predictions from just this one feature would likely already perform quite well! But we have a lot more data on each passenger than just gender and by considering multiple features, we should be able detect more complex patterns in the data that will, hopefully, allow us to improve the accuracy of our predictions. 

Let's look next at the relationship between `Pclass` and passenger survival:

In [None]:
# Show bar plot of Pclass vs survival rate 
sns.barplot(x='Pclass',y='Survived',data=titanic_data,ci=None)
;

The trend is clear: Passengers in the 1st class had the highest, passengers in the 3rd class the lowest rate of survival. We can also look at both features `Sex` and `Pclass` simultaneously to verify these initial observations:  

In [None]:
# Draw a nested barplot to show survival rate for both Sex and Pclass
sns.catplot(x="Sex", y="Survived", hue="Pclass", data=titanic_data, kind="bar", ci=None)
;

<div class="alert alert-block alert-success">  <b>Bonus Exercise</b>: Choose another feature and create a similar bar graph in the code cell below. Save your graph and describe any trends you observe: Does the selected feature predict survival? </div>

<mark>Turn this into a regular exercise? </mark> 

In [None]:
# your code here (create another bar graph for a factor of your choice)

# Uncomment the next line to save your graph as a png
# plt.savefig('feature_vs_survival.png')

Let's now investigate how `Age` affects survival. As we have seen above, this feature is represented in a continuous numerical column, containing values from 0.42 to 80.0. We will therefore plot two histograms to compare visually the age distributions of those who survived with those who died: 

In [None]:
# Create two separate data sets for Survived and not Survived
survived = titanic_data[titanic_data["Survived"] == 1]
died = titanic_data[titanic_data["Survived"] == 0]

# Draw histograms
survived["Age"].plot.hist(alpha=0.6,color='red',bins=50)
died["Age"].plot.hist(alpha=0.4,color='blue',bins=50)
plt.legend(['Survived','Died'])
;

Here, the relationship is not obvious. The considerable fraction of missing `Age` values (only 714 of 891 values were given in our data) may further complicate the interpretation. From the given data, we can see that in some age ranges more passengers survived - where the red bars are higher than the blue bars - but drawing a clear conclusion is difficult. 

For a data set containing multiple features, visualizing several of them simultaneously quickly reaches its limits. One way to address this problem is to use a *correlation matrix* to show how each feature relates with the others. A correlation matrix contains values ranging from +1 (perfect correlation) to -1 (perfect anti-correlation) and is often displayed as a heat map, in which the strength of a relationship is shown as color:

In [None]:
# Drop PassengerId feature since no meaningful correlation is expected
titanic_data_noId = titanic_data.drop(['PassengerId'], axis=1)

# Compute correlation matrix
corrMatrix = titanic_data_noId.corr()

# Show heat map
f, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corrMatrix, annot=True, cmap="coolwarm")
;

In this heat map, darker colors represent a stronger positive (red: closer to 1) or negative (blue: closer to -1) correlation, and lighter colors represent a weaker correlation (closer to 0). The values on the diagonal are exactly 1, since the diagonal represents the correlation of features with themselves. 

Take a look at this heat map and check if you can confirm some of your expectations. We will come back to the correlation matrix after we have prepared and cleaned the data. 

<div class="alert alert-block alert-info">Pause! Answer <b>QX in the Answers.txt file</b>.

Which are the top 3 feature pairs that are most strongly (positively or negatively) correlated? Try to explain one of these relationships.
    
*Hint: Pay attention to the sign of the correlation!*
</div>

## 4. Prepare the data

[[ go back to the top ]](#Table-of-contents)

Data preparation is an important step in every data analysis pipeline, and often one of the most time-consuming tasks. Decision taken during data preparation benefit from a good understanding of the problem and play a critical role for the performance of the model. 

Raw data typically cannot be used from machine learning without preparation for several reasons. For example, most machine learning algorithms cannot work with missing data, prefer to work with numbers instead of text labels, and do not perform well when the input numerical attributes have very different scales. Therefore, preparing the data can entail different tasks: 

- Identifying and correcting mistakes or errors in the data
- Dealing with missing data
- Identifying features that are most relevant to the task
- Removing features that are irrelevant to the task
- Converting text labels into numbers
- Changing the scale or distribution of features
- Deriving new features from existing features (e.g., by combining features)

However, not all data preparation tasks are always required for all data. Below, we will work through a few selected data preparation tasks that are relevant for our data set and the question we are trying to answer. Note that there are many ways to prepare and try to extract more information from this data set, as the [kaggle Titanic competition](https://www.kaggle.com/c/titanic) demonstrates.

#### 4A. Remove less relevant features

We start by removing features which may not contribute much to our machine learning model or are problematic because they contain a lot of missing values or potential errors. Let's have again a look at all features and see which contain missing values:

In [None]:
# Show number of missing values per feature 
column_names = titanic_data.columns
for column in column_names:
    print(column + ': ' + str(titanic_data[column].isnull().sum()))

We see for the `Cabin` feature, the majority of values are missing (687 out of 891). We also miss 177 `Age` values and 2 `Embarked` values. We decide to fix the `Age` and `Embarked` columns later, but drop the `Cabin` feature. 

We also decide to remove `PassengerId`, `Name`, and `Ticket` since we do not expect that they would contribute much to our model, either because they do not encode meaningful information relevant to the probability of survival of a passenger, or that information is already reflected in some of the remaining features. 

In [None]:
# Remove features PassengerId, Name, Ticket, Cabin from data set
titanic_data = titanic_data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

# List remaining features
list(titanic_data.columns)

<div class="alert alert-block alert-info">Pause! Answer <b>QX in the Answers.txt file</b>.
    
For each of the features `PassengerId`, `Name`, and `Ticket`, can you think of a reason why they may or may not contribute to our model? 
If you were to delete a further feature from the data set, which one would you delete and why? </div>

<div class="alert alert-block alert-success">  <b>Bonus Question</b>: 

In general, what could be good reasons to confine ourselves to the most relevant features in the data?</div>

#### 4B. Convert text-based features 

Most machine learning algorithms cannot use text labels. Currently, two of our features text labels: `Sex` ("male" and "female") and `Embarked` ("C", "Q", "S" encoding the three ports Cherbourg, Queenstown, and Southampton). We replace these labels with numbers as follows:

- `Sex`: male=0, female=1
- `Embarked` column: S=0, C=1, Q=2

In [None]:
# Replace 'Sex' labels with numbers
titanic_data['Sex'].replace('male', 0 ,inplace=True)
titanic_data['Sex'].replace('female', 1,inplace=True)

# Replace 'Embarked' labels with numbers
titanic_data['Embarked'].replace('S', 0,inplace=True)
titanic_data['Embarked'].replace('C', 1,inplace=True)
titanic_data['Embarked'].replace('Q', 2,inplace=True)

# List feature types after conversion
titanic_data.info()

All features are now numeric: integer `int64` or floating point numbers `float64`.

<div class="alert alert-block alert-success">  <b>Bonus Question</b>:    

One issue with replacing the port designations S, C, and Q by 0, 1, and 2 is that machine learning algorithms will assume that *two nearby values are more similar than two distant values*. This may be fine in some cases (e.g. for ordered categories such as "bad", "average", "good", and "excellent"), but it is obviously not the case for the `Embarked` column. Can you come up with an alternative replacement scheme to avoid this problem? </div>

#### 4C. Fill in missing data

We now address the missing data in the `Age` and `Embarked` columns. We have three options to deal with missing values: 

1. Remove all passenger rows which have missing values (NaN) from the data set 
2. Remove the whole feature
3. Fill in the empty value with some value (zero, the mean, the median, etc.) 

To preserve as much data as possible, we choose option 3. For the `Age` column, we replace all missing values with the median age of the passengers; for the `Embarked`column, we replace the missing values with the mode (the most common value):

In [None]:
# Supplement missing `Age` data with median
titanic_data['Age'].fillna(titanic_data['Age'].dropna().median(), inplace=True)

# Supplement missing `Embarked` data with mode (most common value)
freq_port = titanic_data.Embarked.dropna().mode()[0]
titanic_data['Embarked'].fillna(freq_port, inplace=True)

<div class="alert alert-block alert-info">Pause! Answer <b>QX in the Answers.txt file</b>.
    
What could be better ways to replace the many missing age values than just using one single value (the median) for all of them?</div>

#### 4D. Derive new features 

Sometimes it can be helpful to derive new features from the original ones. This can mean to transform the scaling or the distribution of data of a given feature to make it more useful for the machine learning algorithm, or to combine two or more existing features to produce a more useful one--a process known as *feature engineering*.

Here, we will only modify the `Age` and `Fare` columns by diving them into ranges, i.e. grouping their values into a few (numerical) categories. We also plot the histograms of the `Age` column to visualize the change:

In [None]:
# Plot original `Age` distribution
plt.figure(figsize=(10,6))
plt.subplot(2,2,1)
titanic_data['Age'].hist()
plt.title("Original Age distribution")
plt.xlabel('Age')
plt.ylabel('Number of passengers')
;

# Split `Age` into five age groups: 0-16, 17-32, 33-48, 48-64, >64

titanic_data['AgeBand'] = pd.cut(titanic_data['Age'], 5)
titanic_data[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)
titanic_data.loc[ titanic_data['Age'] <= 16, 'Age'] = 0
titanic_data.loc[(titanic_data['Age'] > 16) & (titanic_data['Age'] <= 32), 'Age'] = 1
titanic_data.loc[(titanic_data['Age'] > 32) & (titanic_data['Age'] <= 48), 'Age'] = 2
titanic_data.loc[(titanic_data['Age'] > 48) & (titanic_data['Age'] <= 64), 'Age'] = 3
titanic_data.loc[ titanic_data['Age'] > 64, 'Age'] = 4
titanic_data = titanic_data.drop(['AgeBand'], axis=1)

# Show modified Age distribution
plt.subplot(2,2,2)
titanic_data['Age'].hist()
plt.title("Modified Age distribution")
plt.xlabel('Age group')
plt.ylabel('Number of passengers')
;

We proceed similarly for the `Fare` column, using the 25%, 50%, and 75% percentiles to define the fare "bands":

In [None]:
# Split `Fare` into four fare groups using the 25%, 50%, and 75% percentiles

titanic_data['Fare'].fillna(titanic_data['Fare'].dropna().median(), inplace=True)
titanic_data['FareBand'] = pd.qcut(titanic_data['Fare'], 4)
titanic_data[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)
titanic_data.loc[ titanic_data['Fare'] <= 7.91, 'Fare'] = 0
titanic_data.loc[(titanic_data['Fare'] > 7.91) & (titanic_data['Fare'] <= 14.45), 'Fare'] = 1
titanic_data.loc[(titanic_data['Fare'] > 14.45) & (titanic_data['Fare'] <= 31), 'Fare'] = 2
titanic_data.loc[ titanic_data['Fare'] > 31, 'Fare'] = 3
titanic_data['Fare'] = titanic_data['Fare'].astype(int)
titanic_data = titanic_data.drop(['FareBand'], axis=1)

<div class="alert alert-block alert-info">Pause! Answer <b>Q4 in the Answers.txt file</b>.
    
Why do you think is it reasonable to split the `Age` and `Fare` features into groups?</div>

Now that we've cleaned the data, let's save it as a separate file `titanic-clean.csv` and work directly with that file from now on:

In [None]:
# Write cleaned data set to new csv file
titanic_data.to_csv('titanic-clean.csv', index=False)

# Load cleaned data into new DataFrame object
titanic_data_clean = pd.read_csv('titanic-clean.csv')

# Show first 15 rows of cleaned data set
titanic_data_clean.head(15)

Let's now take another brief look at the correlation matrix of the prepared data set:

In [None]:
# Compute correlation matrix
corrMatrix = titanic_data_clean.corr()

# Show heat map
f, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corrMatrix, annot=True, cmap="coolwarm")
;

We can get now an overview of the features that we will use for modeling. We confirm the trends we noticed earlier: sex correlates with survival (women are more likely to have survived than man), and the passengers in first and second class were more likely to survive than those in third class. We can also see a negative correlation between `Fare` and `Pclass` (-0.63), and a positive correlation between `Fare` and `Survived` (0.3), which makes sense as first class tickets were the most expensive, and first class passengers were more likely to survive.

We are now finally ready to train our machine learning model. 

## <mark>WIP below</mark> 

## 5. Train and evaluate a model

[[ go back to the top ]](#Table-of-contents)


The goal of a machine learning model is to make accurate predictions on new, previously unseen data. If we are building a model using the data set that contains what we want to predict, we need to divide the data set into two:

- A <b>training</b> subset to train a model, which contains the information we are trying to predict 
- A <b>test</b> subset to test the model, which does not contain the information we are trying to predict

In [None]:
# Separate the data into training set and test set
train_df, test_df = train_test_split(titanic_data, test_size=0.3)

Next, we need to separate survival, the outcome, from rest of the factors in the data set.

In [None]:
# Divide each data set (training and test) into two parts: X & Y

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("Survived", axis=1)
Y_test = test_df["Survived"]

Now we can train a model. 
There are numerous predictive modeling algorithms but not all apply to our problem. Our problem is a classification and regression problem: we want to identify the relationship between passenger survival with other features (e.g., sex, age, class). We are also perfoming a category of machine learning called supervised learning as we are training our model with a given data set. Given this, we will take a closer look at 3(4?) algorithms:

### Logistic regression

Logistic regression is a useful early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (in our case, Survival) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. 

Source: https://en.wikipedia.org/wiki/Logistic_regression

In [None]:
# Logistic regression
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_test, Y_test) * 100, 2)
acc_log

We can use Logistic Regression to confirm our assumptions by calculating the coefficient of the features in the function.\
Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).

In [None]:
coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)

<b>Sex is highest positive coefficient</b>, implying that as the Sex value increases (male = 0 to female = 1), the probability of Survived = 1 increases the most.\
<b>Pclass is the highest negative coefficient</b>, implying that as class increases (1-3), probability of Survived = 1 decreases the most.

### Decision tree classifier

The decision tree classifier maps features (tree branches) to conclusions about the target value (tree leaves, in our case, Survival). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. 

Source: https://en.wikipedia.org/wiki/Decision_tree

In [None]:
# Decision tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_test, Y_test) * 100, 2)
acc_decision_tree

In [None]:
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_test, Y_test) * 100, 2)
acc_random_forest

### K-nearest neighbors classifier (K-NN)

The K-NN classifier is a non-parametric method used for classification and regression. A sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. 

Source:https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_test, Y_test) * 100, 2)
acc_knn

Now, we will see how well the chosen model predicts our data.\
The function scoretakes the values of the test data set (X_test), calculates with the model the corresponding values for the survival status, and compares them with the correct values (Y_test). The output value `acc_logof` is the probability that the model predicted survival status correctly.

In [None]:
# Validate model and calculate accuracy

acc_random_forest = round(random_forest.score(X_test, Y_test) * 100, 2)

print("The accuracy of the model with respect to the test data is:")
print(acc_log)

<div class="alert alert-block alert-info">Pause! Answer <b>Q6 on the U4I platform (Bonus Question)</b>.
    
Run any of machine learning algorithm several rimes in a row (without changing the code). Why do you get a different accuracy each time? More here?</div>

### Congratulations! You have completed the Titanic Challenge!

## Sources

[[ go back to the top ]](#Table-of-contents)

- https://www.kaggle.com/c/titanic
- https://www.kaggle.com/startupsci/titanic-data-science-solutions

Sources for pictures:
- Titanic.jpg: https://upload.wikimedia.org/wikipedia/commons/9/92/Titanic.jpg

### Python packages used

This notebook uses several standard Python packages. These are:

* **pandas** is a powerful data analysis package, providing the "DataFrame" structure to store data in memory and work with it easily and efficiently. DataFrame is a 2-dimensional labeled data structure with columns of potentially different types; you can think of it like a spreadsheet.
* **numpy** is an essential package for scientific computing with Python, providing a fast numerical array structure and helper functions.
* **random** generates pseudo-random numbers for various distributions.
* **matplotlib** is the basic plotting library in Python; most other Python plotting libraries are built on top of it.
* **seaborn** is an advanced statistical plotting library.
* **sklearn** (**scikit-learn**) is an essential Machine Learning package in Python.

<mark>To be included</mark>

- ipywidgets


## <mark>===Internal notes section, to be removed in final Challenge===</mark>

### <mark> Check ALL code! </mark>

### <mark> User interactivity </mark>

- Ask user to write some/any code, e.g. "data = pd.read_csv('<file name here>')"? Trivial, but might add to the feeling that you're doing sth. yourself
- ("GUI": Answering questions: Text fields?)


### <mark> Approaches to improve the accuracy of the model </mark>

Learn more about the data > improve understanding > guide experimentation:

- Feature engineering: Design / create some new features
- try different types of preprocessing (different methods to fill in missing values)
- try different types of ML models, then combine / ensemble them
- Learn from other's code



## <mark> To Dos </mark>
    
- Prepare Answer.txt
- ...
- 
    
    
## <mark> Idea collection for Advanced Version </mark>
    
- Use Kaggles Test data set to generate realistic Kaggle scores
- Use cross-correlation
- Use one-hot encoding
- ...
