
<div style="text-align: right"><a href="http://ml-school.uni-koeln.de">Virtual Summer School "Deep Learning for
    Language Analysis"</a> <br/><strong>Text Analysis with Deep Learning</strong><br/>Aug 31 — Sep 4, 2020<br/>Nils Reiter<br/><a href="mailto:nils.reiter@uni-koeln.de">nils.reiter@uni-koeln.de</a></div>


# Feedforward Neural Network: Titanic

<br/>

<div style="float:left;margin-right:10px;"><img src="gfx/titanic.jpg" width="200" /></div> This dataset contains information about the titanic passengers, including names, gender, passenger class and whether they survived <a href="https://en.wikipedia.org/wiki/Sinking_of_the_Titanic">the sinking of the ship</a>. 

We will use the data set to train a feedforward neural network that predicts — given the other information — whether someone survived. It is therefore a binary classification with the two classes "survived" (encoded as 1) and "drowned" (encoded as 0).

In [1]:
# import some libraries

# fast matrices and array (much faster than regular python)
import numpy as np


# generic machine learning utility functions

# what we need for deep learning
from tensorflow.python.keras import models, layers, optimizers

In [2]:
# working with tabular data
import pandas as pd

# read the data from a CSV file (included in the repository)
df = pd.read_csv("data/titanic/train.csv")

# show the table
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


There are two columns that we do not want to use as features, because they allow no generalization: `Name` and `PassengerId` can not be expected to contribute useful information to the survival of the passenger. We therefore drop the columns.

In [3]:
df = df.drop("Name", 1)
df = df.drop("PassengerId", 1)

Neural networks generally expect numeric input. We therefore convert all non-numeric columns (`Sex`, `Cabin`, `Ticket` and `Embarked`) to numeric columns. This is done with [the pandas function `factorize()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html). The function returns a tuple, and the numeric values are the first part. We therefore assign these as values for the respective columns. 

(It is not strictly necessary to wrap this into a function.)

After calling the function, we again inspect the table to verify that this worked.

In [4]:
def make_numeric(df):
  df["Sex"] = pd.factorize(df["Sex"])[0]
  df["Cabin"] = pd.factorize(df["Cabin"])[0]
  df["Ticket"] = pd.factorize(df["Ticket"])[0]
  df["Embarked"] = pd.factorize(df["Embarked"])[0]
  return df

df = make_numeric(df)
df

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,0,22.0,1,0,0,7.2500,-1,0
1,1,1,1,38.0,1,0,1,71.2833,0,1
2,1,3,1,26.0,0,0,2,7.9250,-1,0
3,1,1,1,35.0,1,0,3,53.1000,1,0
4,0,3,0,35.0,0,0,4,8.0500,-1,0
...,...,...,...,...,...,...,...,...,...,...
886,0,2,0,27.0,0,0,677,13.0000,-1,0
887,1,1,1,19.0,0,0,678,30.0000,145,0
888,0,3,1,,1,2,614,23.4500,-1,0
889,1,1,0,26.0,0,0,679,30.0000,146,1


If you're looking closely, you might have seen that some rows contain NaN (= "not a number") or missing values. To remove these, we use the [pandas function `dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html).

In [5]:
df = df.dropna()

Next, we split up the data into the feature values (`x`) and the correct class labels (`y`). In this case, this is straightforward, because all we need is to extract one column and assign it to `y`. For `x`, we simply remove the column.

In [6]:
y = df["Survived"]
x = df.drop("Survived", 1)

## Splitting into train and test

For splitting the data into train and test, we make use of the [scikit-learn-function `train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). This is actually the only function from scikit-learn that we are going to use.

In [7]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0, test_size=0.1)

## Create the neural network

TODO: write text

In [12]:
ffnn = models.Sequential()
ffnn.add(layers.Input(shape=(9,)))
ffnn.add(layers.Dense(50, activation="sigmoid"))
ffnn.add(layers.Dense(1, activation="sigmoid"))

ffnn.compile(loss="mean_squared_error", 
             optimizer="adam",
             metrics=["accuracy"])

ffnn.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_2 (Dense)              (None, 50)                500       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 51        
Total params: 551
Trainable params: 551
Non-trainable params: 0
_________________________________________________________________


Next, we start the training process, using [the `fit()`-function from keras](https://keras.io/api/models/model_training_apis/#fit-method). The function takes feature values (`x`), correct outcomes (`y`), and three more parameters to control the number of `epochs` and the `batch_size`. `verbose` controls the amount of output that is generated.

In [9]:
history = ffnn.fit(x_train.to_numpy(), y_train.to_numpy(), epochs=10, batch_size=3, verbose=1)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [11]:
ffnn.evaluate(x_test, y_test)



[0.19128428399562836, 0.7083333134651184]