# Titanic: Machine Learning from Disaster

##### Based on the famous [kaggle Titanic competition](https://www.kaggle.com/c/titanic).

#### Goal: Work through a simple machine learning example from start to finish along a typical data analysis pipeline. 

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered "unsinkable" RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren't enough lifeboats for everyone on board, resulting in the death of 1514 of the 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some were more likely to survive than others. We will use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. 

<div align="center">
<img width="1200" title="Titanic at Southampton docks, prior to departure" src="images/titanic.jpg"/>
</div>


**Please note**: This exercise is based on a Jupyter notebook, an interactive environment for writing and running code, and is running in Python. To get familiar with working in Jupyter notebooks, see our short "JupyterLab Tutorial". For a basic introduction to programming in Python, see the "Introduction to Python" notebook.

<div class="alert alert-block alert-info">A cell like this indicates a question you need to answer in the Answer.txt file. Please answer the question <b>before</b> continuing through the notebook. You can <b>double click on Answer.txt</b> in the Left Sidebar now to open it in a new tab. As you go through the notebook, navigate between the tabs to answer questions.
</div>

## Table of contents

1. [Introduction](#1.-Introduction)

2. [Get familiar with the data](#2.-Get-familiar-with-the-data)

3. [Further explore the data](#3.-Further-explore-the-data)

4. [Prepare the data](#4.-Prepare-the-data)

   1. [Remove less relevant features](#4A.-Remove-less-relevant-features)
   2. [Convert text-based features](#4B.-Convert-text-based-features)
   3. [Fill in missing data](#4C.-Fill-in-missing-data)
   4. [Derive new features](#4D.-Derive-new-features)
   
   
5. [Train and evaluate a model](#5.-Train-and-evaluate-a-model)

6. [Sources](#Sources)

## 1. Introduction

[[ go back to the top ]](#Table-of-contents)

In this Challenge, we will build a predictive model to answer the question **"Who was more likely to survive on the Titanic?"** using passenger data. These data contain information about each passenger (e.g., the passenger's name, age, gender, ticket class, etc.). For this exercise, we will use a subset of the full passenger data set (891 of the 2224 passengers and crew on board). Importantly, our data set `titanic.csv` contains also the information whether each passenger survived or not, i.e. the *label* required for using supervised machine learning methods.

To build our model we will work through a number of steps. First, we will familiarze ourselves with the data set and understand what features we are working with. Next, we will explore our data set a little further with some visualizations to gain some first insights into what features might affect passenger survival. Then, we will prepare and tidy the data for machine learning. Finally we will select a machine learning algorithm, and train and assess our model.

## 2. Get familiar with the data

[[ go back to the top ]](#Table-of-contents)

Before we start exploring, we need to import some libraries that will help us with our calculations and visualizations. 

*Remember to press ***Shift+Enter*** to run each code cell.*

In [None]:
# An overview of the libraries used in this notebook is provided in the "Sources" section

# Import data analysis libraries
import pandas as pd
import numpy as np
import random as rnd

# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

# Note: this cell and some of the cells below produce no visible output;
# the sucessful execution of a code cell is indicated by the number in the brackets [ ] on the left.

Now, let's import our data set `titanic.csv` and take a look at it:

In [None]:
# Load data from file into a new object called "titanic_data"
titanic_data = pd.read_csv('titanic.csv')

# Show first 15 rows of data set
titanic_data.head(15)

We can see that each line in the list corresponds to one passenger. These data are a mixture of <b>categorical</b> and <b>numerical</b> features:

`PassengerId`: Unique ID of the passenger\
`Survived`: Whether the passenger survived (1=Yes, 0=No)\
`Pclass`: Passenger's class (1=1st, 2=2nd, 3=3rd)\
`Name`: Passenger's name\
`Sex`: Passenger's sex (male or female)\
`Age`: Passenger's age in years\
`SibSp`: Number of siblings or spouses aboard the Titanic\
`Parch`: Number of parents or children aboard the Titanic\
`Ticket`: Ticket number\
`Fare`: Fare paid for ticket\
`Cabin`: Cabin number\
`Embarked`: Port where the passenger embarked (C=Cherbourg, Q=Queenstown, S=Southampton)

From this we can already discern some information about passengers. For example, Mr. Owen Harris Braund was a 22-year-old man traveling in 3rd class who did not survive. Note that `NaN` is the abbreviation for "Not a Number". This is how Python represents missing data.

*Hint: You can also take a look at the data set directly by double-clicking on `titanic.csv` in the Left Sidebar.* 

<div class="alert alert-block alert-info">Pause! Answer <b>Question 1</b> in the Answer.txt file. 
    
Why could missing data ("NaN") be problematic for machine learning models? </div>

Let's now get an overview of some summary statistics about the numerical features in the data set:

In [None]:
# Show summary statistics of the numerical features in the data set
titanic_data.describe()

This summary includes the total `count` of values for each feature, their `mean`, their standard deviation `std`, the minimal and maximal values `min` and `max`, as well as the 25, 50 and 75 percentiles. The 50 percentile is the same as the median.

We can already get some useful pieces of information from this table: Out of the 891 passengers in our data set, 38% survived (the survival rate is simply the `mean` of the `Survived` column because this column contains 1 if passengers survived, and 0 if they did not); more than half of the passengers were traveling in the 3rd class, and the majority were traveling alone. We can also see that we only have age data for 714 of 891 passengers in our data set. 

Let's also look at the non-numerical, or categorical features:

In [None]:
# Show summary statistics of the categorical features in the data set
titanic_data.describe(include=[np.object])

In this table, `count` and `unique` are the total and unique number of values for a given feature, `top` is the most common value, and `freq` is the frequency of the most common value.

Here, we can see that a majority of the passengers were male and embarked in Southampton. We also notice that `Cabin` data is missing for most passengers, which means we might not be able to use `Cabin` as a feature in our model.  

## 3. Further explore the data

[[ go back to the top ]](#Table-of-contents)

In addition to descriptive statistics, **data visualization** can be a powerful tool. Visualizations can help identify potential issues in the data, and, importantly, improve our understanding of the problem, guiding experimentation. Let's now further explore our data using some visualizations.

Remember that our aim is to predict passenger survival based on the information we have in our data set. For this, understanding the Titanic disaster and specifically what features might affect the outcome of survival is important. If you watched the movie Titanic, you would remember that women and children were given preference to lifeboats, and that the three passenger classes were not treated equally. This suggests that `Sex`, `Pclass`, and `Age` may be good predictors of survival. 

Let's first see how gender affects survival:

In [None]:
# Show bar plot of Sex vs survival rate 
sns.barplot(x='Sex',y='Survived',data=titanic_data,ci=None)
;

Over 70% of the female passengers survived, but only about 20% of the male passengers. Sex is therefore indeed a strong indicator of survival, and a trivial model deriving its predictions from just this one feature would likely already perform quite well! But we have a lot more data on each passenger than just gender and by considering multiple features, we should be able detect more complex patterns in the data that will, hopefully, allow us to improve the accuracy of our predictions. 

Let's look next at the relationship between `Pclass` and passenger survival:

In [None]:
# Show bar plot of Pclass vs survival rate 
sns.barplot(x='Pclass',y='Survived',data=titanic_data,ci=None)
;

The trend is clear: Passengers in the 1st class had the highest, passengers in the 3rd class the lowest rate of survival. We can also look at both features `Sex` and `Pclass` simultaneously to verify these initial observations:  

In [None]:
# Draw a nested barplot to show survival rate for both Sex and Pclass
sns.catplot(x="Sex", y="Survived", hue="Pclass", data=titanic_data, kind="bar", ci=None)
;

<div class="alert alert-block alert-success">  <b>Bonus Question 1</b>: 
    
Choose another feature and create a similar bar graph in the code cell below. Save your graph and describe any trends you observe: Does the selected feature predict survival? </div>

In [None]:
# your code here (create another bar graph for a factor of your choice)

# Uncomment the next line to save your graph as a png
# plt.savefig('feature_vs_survival.png')

Let's now investigate how `Age` affects survival. As we have seen above, this feature is represented in a continuous numerical column, containing values from 0.42 to 80.0. We will therefore plot two histograms to compare visually the age distributions of those who survived with those who died: 

In [None]:
# Create two separate data sets for Survived and not Survived
survived = titanic_data[titanic_data["Survived"] == 1]
died = titanic_data[titanic_data["Survived"] == 0]

# Draw histograms
survived["Age"].plot.hist(alpha=0.6,color='red',bins=50)
died["Age"].plot.hist(alpha=0.4,color='blue',bins=50)
plt.legend(['Survived','Died'])
;

Here, the relationship is not obvious. The considerable fraction of missing `Age` values (only 714 of 891 values were given in our data) may further complicate the interpretation. From the given data, we can see that in some age ranges more passengers survived - where the red bars are higher than the blue bars - but drawing a clear conclusion is difficult. 

For a data set containing multiple features, visualizing several of them simultaneously quickly reaches its limits. One way to address this problem is to use a *correlation matrix* to show how each feature relates with the others. A correlation matrix contains values ranging from +1 (perfect correlation) to -1 (perfect anti-correlation) and is often displayed as a heat map, in which the strength of a relationship is shown as color:

In [None]:
# Drop PassengerId feature since no meaningful correlation is expected
titanic_data_noId = titanic_data.drop(['PassengerId'], axis=1)

# Compute correlation matrix
corrMatrix = titanic_data_noId.corr()

# Show heat map
f, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corrMatrix, annot=True, cmap="coolwarm")
;

In this heat map, darker colors represent a stronger positive (red: closer to 1) or negative (blue: closer to -1) correlation, and lighter colors represent a weaker correlation (closer to 0). The values on the diagonal are exactly 1, since the diagonal represents the correlation of features with themselves. 

Take a look at this heat map and check if you can confirm some of your expectations. We will come back to the correlation matrix after we have prepared and cleaned the data. 

<div class="alert alert-block alert-info">Pause! Answer <b>Question 2</b> in the Answer.txt file.

Which are the top 3 feature pairs that are most strongly (positively or negatively) correlated? Try to explain one of these relationships.
    
*Hint: Pay attention to the sign of the correlation!*
</div>

## 4. Prepare the data

[[ go back to the top ]](#Table-of-contents)

Data preparation is an important step in every data analysis pipeline, and often one of the most time-consuming tasks. Decision taken during data preparation benefit from a good understanding of the problem and play a critical role for the performance of the model. 

Raw data typically cannot be used from machine learning without preparation for several reasons. For example, most machine learning algorithms cannot work with missing data, prefer to work with numbers instead of text labels, and do not perform well when the input numerical attributes have very different scales. Therefore, preparing the data can entail different tasks: 

- Identifying and correcting mistakes or errors in the data
- Dealing with missing data
- Identifying features that are most relevant to the task
- Removing features that are irrelevant to the task
- Converting text labels into numbers
- Changing the scale or distribution of features
- Deriving new features from existing features (e.g., by combining features)

However, not all data preparation tasks are always required for all data. Below, we will work through a few selected data preparation tasks that are relevant for our data set and the question we are trying to answer. Note that there are many ways to prepare and try to extract more information from this data set, as the [kaggle Titanic competition](https://www.kaggle.com/c/titanic) demonstrates.

#### 4A. Remove less relevant features

We start by removing features which may not contribute much to our machine learning model or are problematic because they contain a lot of missing values or potential errors. Let's have again a look at all features and see which contain missing values:

In [None]:
# Show number of missing values per feature 
column_names = titanic_data.columns
for column in column_names:
    print(column + ': ' + str(titanic_data[column].isnull().sum()))

We see for the `Cabin` feature, the majority of values are missing (687 out of 891). We also miss 177 `Age` values and 2 `Embarked` values. We decide to fix the `Age` and `Embarked` columns later, but drop the `Cabin` feature. 

We also decide to remove `PassengerId`, `Name`, and `Ticket` since we do not expect that they would contribute much to our model, either because they do not encode meaningful information relevant to the probability of survival of a passenger, or that information is already reflected in some of the remaining features. 

In [None]:
# Remove features PassengerId, Name, Ticket, Cabin from data set
titanic_data = titanic_data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

# List remaining features
list(titanic_data.columns)

<div class="alert alert-block alert-info">Pause! Answer <b>Question 3</b> in the Answer.txt file.

For each of the features `PassengerId`, `Name`, and `Ticket`, can you think of a reason why they may or may not contribute to our model? 
If you were to delete a further feature from the data set, which one would you delete and why? </div>

<div class="alert alert-block alert-success">  <b>Bonus Question 2</b>: 

In general, what could be good reasons to confine ourselves to the most relevant features in the data?</div>

#### 4B. Convert text-based features 

Most machine learning algorithms cannot use text labels. Currently, two of our features text labels: `Sex` ("male" and "female") and `Embarked` ("C", "Q", "S" encoding the three ports Cherbourg, Queenstown, and Southampton). We replace these labels with numbers as follows:

- `Sex`: male=0, female=1
- `Embarked` column: S=0, C=1, Q=2

In [None]:
# Replace 'Sex' labels with numbers
titanic_data['Sex'].replace('male', 0 ,inplace=True)
titanic_data['Sex'].replace('female', 1,inplace=True)

# Replace 'Embarked' labels with numbers
titanic_data['Embarked'].replace('S', 0,inplace=True)
titanic_data['Embarked'].replace('C', 1,inplace=True)
titanic_data['Embarked'].replace('Q', 2,inplace=True)

# List feature types after conversion
titanic_data.info()

All features are now numeric: integer `int64` or floating point numbers `float64`.

<div class="alert alert-block alert-success">  <b>Bonus Question 3</b>:    

One issue with replacing the port designations S, C, and Q by 0, 1, and 2 is that machine learning algorithms will assume that *two nearby values are more similar than two distant values*. This may be fine in some cases (e.g. for ordered categories such as "bad", "average", "good", and "excellent"), but it is obviously not the case for the `Embarked` column. Can you come up with an alternative replacement scheme to avoid this problem? </div>

#### 4C. Fill in missing data

We now address the missing data in the `Age` and `Embarked` columns. We have three options to deal with missing values: 

1. Remove all passenger rows which have missing values (NaN) from the data set 
2. Remove the whole feature
3. Fill in the empty value with some value (zero, the mean, the median, etc.) 

To preserve as much data as possible, we choose option 3. For the `Age` column, we replace all missing values with the median age of the passengers; for the `Embarked`column, we replace the missing values with the mode (the most common value):

In [None]:
# Supplement missing `Age` data with median
titanic_data['Age'].fillna(titanic_data['Age'].dropna().median(), inplace=True)

# Supplement missing `Embarked` data with mode (most common value)
freq_port = titanic_data.Embarked.dropna().mode()[0]
titanic_data['Embarked'].fillna(freq_port, inplace=True)

<div class="alert alert-block alert-info">Pause! Answer <b>Question 4</b> in the Answer.txt file.
    
What could be better ways to replace the many missing age values than just using one single value (the median) for all of them?</div>

#### 4D. Derive new features 

Sometimes it can be helpful to derive new features from the original ones. This can mean to transform the scaling or the distribution of data of a given feature to make it more useful for the machine learning algorithm, or to combine two or more existing features to produce a more useful one--a process known as *feature engineering*.

Here, we will only modify the `Age` and `Fare` columns by diving them into ranges, i.e. grouping their values into a few (numerical) categories. We also plot the histograms of the `Age` column to visualize the change:

In [None]:
# Plot original `Age` distribution
plt.figure(figsize=(10,6))
plt.subplot(2,2,1)
titanic_data['Age'].hist()
plt.title("Original Age distribution")
plt.xlabel('Age')
plt.ylabel('Number of passengers')
;

# Split `Age` into five age groups: 0-16, 17-32, 33-48, 48-64, >64

titanic_data['AgeBand'] = pd.cut(titanic_data['Age'], 5)
titanic_data[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)
titanic_data.loc[ titanic_data['Age'] <= 16, 'Age'] = 0
titanic_data.loc[(titanic_data['Age'] > 16) & (titanic_data['Age'] <= 32), 'Age'] = 1
titanic_data.loc[(titanic_data['Age'] > 32) & (titanic_data['Age'] <= 48), 'Age'] = 2
titanic_data.loc[(titanic_data['Age'] > 48) & (titanic_data['Age'] <= 64), 'Age'] = 3
titanic_data.loc[ titanic_data['Age'] > 64, 'Age'] = 4
titanic_data = titanic_data.drop(['AgeBand'], axis=1)

# Show modified `Age` distribution
plt.subplot(2,2,2)
titanic_data['Age'].hist()
plt.title("Modified Age distribution")
plt.xlabel('Age group')
plt.ylabel('Number of passengers')
;

We proceed similarly for the `Fare` column, using the 25%, 50%, and 75% percentiles to define the fare "bands":

In [None]:
# Split `Fare` into four fare groups using the 25%, 50%, and 75% percentiles

titanic_data['Fare'].fillna(titanic_data['Fare'].dropna().median(), inplace=True)
titanic_data['FareBand'] = pd.qcut(titanic_data['Fare'], 4)
titanic_data[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)
titanic_data.loc[ titanic_data['Fare'] <= 7.91, 'Fare'] = 0
titanic_data.loc[(titanic_data['Fare'] > 7.91) & (titanic_data['Fare'] <= 14.45), 'Fare'] = 1
titanic_data.loc[(titanic_data['Fare'] > 14.45) & (titanic_data['Fare'] <= 31), 'Fare'] = 2
titanic_data.loc[ titanic_data['Fare'] > 31, 'Fare'] = 3
titanic_data['Fare'] = titanic_data['Fare'].astype(int)
titanic_data = titanic_data.drop(['FareBand'], axis=1)

# Check unique values of modified `Fare` 
np.unique(titanic_data['Fare'])

<div class="alert alert-block alert-info">Pause! Answer <b>Question 5</b> in the Answer.txt file.
    
Why do you think is it reasonable to split the `Age` and `Fare` features into groups?</div>

Now that we've cleaned the data, let's save it as a separate file `titanic-clean.csv` and work directly with that file from now on:

In [None]:
# Write cleaned data set to new csv file
titanic_data.to_csv('titanic-clean.csv', index=False)

# Load cleaned data into new DataFrame object
titanic_data_clean = pd.read_csv('titanic-clean.csv')

# Show first 15 rows of cleaned data set
titanic_data_clean.head(15)

Let's now take another brief look at the correlation matrix of the prepared data set:

In [None]:
# Compute correlation matrix
corrMatrix = titanic_data_clean.corr()

# Show heat map
f, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corrMatrix, annot=True, cmap="coolwarm")
;

We can get now an overview of the features that we will use for modeling. We confirm the trends we noticed earlier: sex correlates with survival (women are more likely to have survived than man), and the passengers in first and second class were more likely to survive than those in third class. We can also see a negative correlation between `Fare` and `Pclass` (-0.63), and a positive correlation between `Fare` and `Survived` (0.3), which makes sense as first class tickets were the most expensive, and first class passengers were more likely to survive.

We are now finally ready to train our machine learning model! 

## 5. Train and evaluate a model

[[ go back to the top ]](#Table-of-contents)

A machine learning model is expected to not just explain known data, but make accurate predictions on new, unseen data. To assess the accuracy of a model, we must not test it on the same data we have trained it on because the model will then likely overfit: it will perform well on the training data, but will not generalize to new data. 

Instead, we will split our data set into two (random) subsets:

- A **training set** to train our model on (typically 80% of the data).
- A **testing set** (mutually exclusive from the training set) to validate our model on unseen data (typically 20% of the data).

During training, the algorithm looks for patterns in the *training set* that link the features of each passenger to their survival. Following training, the model is used to predict survival of the passengers from the *test set*. By comparing the model predictions to the `Survived` column of the test set, we can assess the accuracy of our model.

We split the data randomly using the method "train_test_split" from the *scikit-learn* library. We specify what fraction of the data should be used for the test set using the parameter `test_size`, and set a random seed using the parameter `random_state` so that our results will be reproducible. We then separate the `Survived` column, which contains the labels or the outcome, from the rest of the features in the data set: 

<div align="center">
<img width="800" title="Splitting the data into training and test sets" src="images/train_test_split.png"/>
</div>

In [None]:
from sklearn.model_selection import train_test_split

# Split the data randomly into training set and test set in a 80:20 ratio
train_df, test_df = train_test_split(titanic_data_clean, test_size=0.2, random_state=0)

# Separate the `Survived` feature from the rest of the features
X_train = train_df.drop("Survived", axis=1)
y_train = train_df["Survived"]
X_test = test_df.drop("Survived", axis=1)
y_test = test_df["Survived"]

# Verify dimensions of the four subsets 
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

Now we can select and train a model. The good news is that thanks to all the previous steps, things are going to be much simpler than you might think. The model we will use is called *Logistic Regression*, which is often the first model you will train when performing classification.

### Logistic Regression

Logistic Regression is commonly used to estimate the probability that an instance belongs to a particular class. If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class, and otherwise it predicts that it does not. This makes it a binary classifier. 

We will again be using the *scikit-learn* library. Each model in scikit-learn is implemented as a separate class, which we import and of which we create an instance--this instance is the model that we then train:

In [None]:
# Import LogisticRegression class
from sklearn.linear_model import LogisticRegression

# Create a LogisticRegression object
logreg = LogisticRegression()

# Train the model on the training set
logreg.fit(X_train, y_train);

Done! We now have a working Logistic Regression model. 

Let's use it to generate predictions on the test set. We can then calculate the accuracy of the model, i.e. the percentage of passengers correctly classified, by comparing the predicted values to the true values `y_test`:

In [None]:
# Predict `Survived' feature for the test set
y_pred = logreg.predict(X_test)

# Compare predicted values to true values
from sklearn.metrics import accuracy_score
accuracy = round(accuracy_score(y_test, y_pred) * 100, 3)
accuracy

Our model achieved an accuracy of **80.5%** when tested against the test set. 

With the Logistic Regression model, we can also investigate the relative importance of the features contributing to its prediction of survival. Below, we display the coefficients associated with each feature in its decision function: Positive coefficients increase the log-odds of the response (and thus increase the probability for `Survived`=1), and negative coefficients decrease the log-odds of the response (and thus decrease the probability for `Survived`=1):

In [None]:
#Display the coefficients of the features in the decision function.
coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])
coeff_df.sort_values(by='Correlation', ascending=False)

The coefficients found during training confirm our initial observations:

- `Sex` is associated with the highest positive coefficient, implying that as the Sex value increases (from `male`=0 to `female`=1), the probability of `Survived`=1 increases the most.
- `Pclass` is associated with the highest negative coefficient, implying that as class increases (from 1 to 3), the probability of `Survived`=1 decreases the most.

<div class="alert alert-block alert-success">  <b>Bonus Question 4</b>:    

The code below repeats the splitting of the data, the training of the model, and its evaluation using the test set 1000 times. The accuracies obtained are displayed as a histogram. Can you explain why we get different accuracy values despite running the same code? What would be therefore a useful piece of information when evaluating the accuracy of a model? </div>

<mark> Check performance on HUB! </mark>

In [None]:
model_accuracies = []

# Split data set randomly 1000 times, and train and test the model
for repetition in range(1000):

    # Split the data and separate the `Survived` feature
    train_df, test_df = train_test_split(titanic_data_clean, test_size=0.2)
    X_train = train_df.drop("Survived", axis=1)
    y_train = train_df["Survived"]
    X_test = test_df.drop("Survived", axis=1)
    y_test = test_df["Survived"]

    # Train and test the model
    logreg = LogisticRegression()
    logreg.fit(X_train, y_train);
    y_pred = logreg.predict(X_test)
    accuracy = round(accuracy_score(y_test, y_pred) * 100, 3)
    
    # Save accuracy value
    model_accuracies.append(accuracy)

# Display the accuracy distribution obtained    
plt.hist(model_accuracies)
plt.xlabel('model accuracy')
plt.ylabel('number of repetitions')
left, right = plt.xlim() 
;

#### You may now wonder...

...what things could we do to improve the accuracy of our model? Here are a few ideas:

- Improving the features
     - create new features from the existing data (feature engineering)
     - experiment with preprocessing of data, e.g. use different methods to fill in missing values
     - try to use information from features not considered here
- Improving the model
     - try a variety of models
     - optimize the settings within each particular model (hyperparameter optimization)
     - combine several models

#### Congratulations! You have completed the Titanic Challenge!

## Sources

[[ go back to the top ]](#Table-of-contents)

Based on the famous [kaggle Titanic competition](https://www.kaggle.com/c/titanic).

Sources for pictures:
- Titanic.jpg: https://upload.wikimedia.org/wikipedia/commons/9/92/Titanic.jpg

### Python packages used

This notebook uses several standard Python packages. These are:

* **pandas** is a powerful data analysis package, providing the "DataFrame" structure to store data in memory and work with it easily and efficiently. DataFrame is a 2-dimensional labeled data structure with columns of potentially different types; you can think of it like a spreadsheet.
* **numpy** is an essential package for scientific computing with Python, providing a fast numerical array structure and helper functions.
* **random** generates pseudo-random numbers for various distributions.
* **matplotlib** is the basic plotting library in Python; most other Python plotting libraries are built on top of it.
* **seaborn** is an advanced statistical plotting library.
* **sklearn** (**scikit-learn**) is an essential Machine Learning package in Python.