<div style="text-align: right"><a href="http://ml-school.uni-koeln.de">Summer School "Deep Learning for
    Language Analysis"</a> <br/><strong>Text Analysis with Deep Learning</strong><br/>Sep 5 - 9, 2022<br/>Nils Reiter<br/><a href="mailto:nils.reiter@uni-koeln.de">nils.reiter@uni-koeln.de</a></div>


# Feedforward Neural Network: Titanic

<br/>

<div style="float:left;margin-right:10px;"><img src="gfx/titanic.jpg" width="200" /></div> This dataset contains information about the titanic passengers, including names, gender, passenger class and whether they survived <a href="https://en.wikipedia.org/wiki/Sinking_of_the_Titanic">the sinking of the ship</a>. 

We will use the data set to train a feedforward neural network that predicts — given the other information — whether someone survived. It is therefore a binary classification with the two classes "survived" (encoded as 1) and "drowned" (encoded as 0).

In [None]:
# import some libraries

# fast matrices and array (much faster than regular python)
import numpy as np


# generic machine learning utility functions

# what we need for deep learning
from tensorflow.python.keras import models, layers, optimizers

In [None]:
# working with tabular data
import pandas as pd

# read the data from a CSV file (included in the repository)
df = pd.read_csv("data/titanic/train.csv")

# show the table
df

There are two columns that we do not want to use as features, because they allow no generalization: `Name` and `PassengerId` can not be expected to contribute useful information to the survival of the passenger. We therefore drop the columns.

In [None]:
df = df.drop("Name", 1)
df = df.drop("PassengerId", 1)

Neural networks generally expect numeric input. We therefore convert all non-numeric columns (`Sex`, `Cabin`, `Ticket` and `Embarked`) to numeric columns. This is done with [the pandas function `factorize()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html). The function returns a tuple, and the numeric values are the first part. We therefore assign these as values for the respective columns. 

(It is not strictly necessary to wrap this into a function.)

After calling the function, we again inspect the table to verify that this worked.

In [None]:
def make_numeric(df):
  df["Sex"] = pd.factorize(df["Sex"])[0]
  df["Cabin"] = pd.factorize(df["Cabin"])[0]
  df["Ticket"] = pd.factorize(df["Ticket"])[0]
  df["Embarked"] = pd.factorize(df["Embarked"])[0]
  return df

df = make_numeric(df)
df

If you're looking closely, you might have seen that some rows contain NaN (= "not a number") or missing values. To remove these, we use the [pandas function `dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html).

In [None]:
df = df.dropna()

Next, we split up the data into the feature values (`x`) and the correct class labels (`y`). In this case, this is straightforward, because all we need is to extract one column and assign it to `y`. For `x`, we simply remove the column.

In [None]:
y = df["Survived"]
x = df.drop("Survived", 1)

## Splitting into train and test

For splitting the data into train and test, we make use of the [scikit-learn-function `train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). This is actually the only function from scikit-learn that we are going to use.

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0, test_size=0.1)

## Create the neural network

We create a simple neural network here, consisting of one hidden layer with 50 neurons.

In [None]:
ffnn = models.Sequential()
ffnn.add(layers.Input(shape=(9,)))
ffnn.add(layers.Dense(50, activation="sigmoid"))
ffnn.add(layers.Dense(1, activation="sigmoid"))

ffnn.compile(loss="mean_squared_error", 
             metrics=["accuracy"])

ffnn.summary()

Next, we start the training process, using [the `fit()`-function from keras](https://keras.io/api/models/model_training_apis/#fit-method). The function takes feature values (`x`), correct outcomes (`y`), and three more parameters to control the number of `epochs` and the `batch_size`. `verbose` controls the amount of output that is generated.

In [None]:
history = ffnn.fit(x_train.to_numpy(), y_train.to_numpy(), epochs=10, batch_size=3, verbose=1)


In [None]:
ffnn.evaluate(x_test, y_test)