# Binary Classification using Tensorflow

We are going to do similar to the Regression Tensorflow workbook, but this time we are going to work at a classification problem

Back to the titanic.csv file we used before

In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

### Read and clean data

We'll need to read in the data and do the following
- Remove any rows without Embarked filled in
- Fill in any missing ages with a median value
- Pick the columns we are going to build our model with
- Encode some columns - convert from text to numbers
  - Sex using label encoder
  - Embarked using Onehotencoder

In [None]:
df = pd.read_csv("titanic.csv")
df = df[df["Embarked"].notnull()]   #Remove any rows without Embarked filled in
df['Age'] = df['Age'].fillna(df['Age'].median())    #Fill in any missing ages with a median value

#Pick the columns we are going to build our model with
y = df["Survived"]
X = df[["Pclass","Age","Sex","SibSp","Parch","Fare","Embarked"]]

Convert sex to numeric values using label encoder from preprocessing library

In [None]:
from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
le_sex.fit(X['Sex'])
sex = le_sex.transform(X['Sex'])
X = X.drop(['Sex'], axis = 1)
X['Sex'] = sex

Convert Embarked to numeric values using OneHotEncoder from preprocessing library. There's probably a better way of doing this but this'll work

In [None]:
le_embark = preprocessing.OneHotEncoder(sparse_output=False)
le_embark.fit(X["Embarked"].values.reshape(-1,1))
embarked = le_embark.transform(X["Embarked"].values.reshape(-1,1))
X = X.drop(["Embarked"], axis = 1)
X["EmbarkC"] = embarked[:,0]
X["EmbarkQ"] = embarked[:,1]
X["EmbarkS"] = embarked[:,2]
X

### Machine Learning 
Then we can start doing our Machine Learning process
- Train_test_split
- Normalise the data
- Write the base models
- Compile
- Fit
- Evaluate
Usual Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1138, train_size=0.8)

Normalisation like was done on the Regression workbook

In [None]:
from tensorflow.keras.layers.experimental import preprocessing

normalizer = preprocessing.Normalization()
normalizer.adapt(X_train.values)
print('Normalized:', normalizer(X_train.values).numpy())

## Model Number 1 - Logistic Regression

We are just going to do a regular Logistic Regression model to start with

This just requires one output layer with one unit in it so it is the same as
$$\log\left(\frac{p}{1-p}\right) = w^Tx + b $$

This should be similar to using sklearn LogisticRegression()

We put the normalizer in as a layer

In [None]:
model = tf.keras.Sequential([
    normalizer,
    tf.keras.layers.Dense(1)
])
model.summary()

Trainable params is 10, we have 9 features (so 9 weights) and the bias b to be learnt as well

Our loss is going to be https://www.tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy as it is a binary problem

Read the manual even though the default is from_logits=False, they say Recommended Usage: (set from_logits=True (meaning output data is unscaled)). This depends on our output layer, if you look above I did not put any activation on the output layer, so the outputs will be logits i.e. from -infinity to +infinity

If I put above tf.keras.layers.Dense(1, activation='sigmoid') then from_logits=False is needed, but it is not recommended according to the manual

In [None]:
model.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    metrics='accuracy',
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True)
)

model.fit(X_train, y_train, epochs=100)

Loss keeps decreasing, haven't use a validation set though. Could've done a different number of epochs

Be careful if you run .fit again. It continues on from where you left off, not starting from epoch 1 again

In [None]:
logRegEval = model.evaluate(X_test, y_test)
print("Logistic Regression Loss: ", round(logRegEval[0],4), "Accuracy: ", round(logRegEval[1],4))

## Model Number 2 - Neural Network

Now we are going to build a Neural Network

4 layers, 100 units in each hidden layer with relu as the activation function (you can try different structures if you want but for now I'm just demonstrating how Tensorflow works as well as some other things we can try)

In [None]:
model_tf = tf.keras.Sequential([
    normalizer,
    tf.keras.layers.Dense(100,activation='relu'),
    tf.keras.layers.Dense(100,activation='relu'),
    tf.keras.layers.Dense(100,activation='relu'),
    tf.keras.layers.Dense(1)
])

In [None]:
model_tf.summary()

21301 trainable params, a lot more than the previous 10! Let's look at each layer

Input has 9 features Hidden Layer 1 has 100 units. Therefore the weight matrix is going to have 100 rows and 9 columns to match
$$ Wx + b $$

That gives us 900 paramaters for the weights. Then we have 100 biases - giving the total of 1000 parameters to go from input layer to hidden layer 1

Now from hidden layer 1 to hidden layer 2. 100 features, 100 units, gives us a 100x100 matrix so 10000 parameters, add in the 100 biases and we get the 10100

Let's compile and fit

In [None]:
model_tf.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.01),
    metrics='accuracy',
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True)
)

model_tf.fit(X_train, y_train, epochs=100)

Again Loss keeps decreasing, haven't use a validation set though. Could've done a different number of epochs

In [None]:
print("Logistic Regression Loss: ", round(logRegEval[0],4), "Accuracy: ", round(logRegEval[1],4))

In [None]:
annEval = model_tf.evaluate(X_test, y_test)
print("ANN Loss: ", round(annEval[0],4), "Accuracy: ", round(annEval[1],4))

These are potentially overfit, we should have done some validation to check over things

While the accuracy for the test data is better, the loss is actually worse!

## Using Validation
### Logistic Regression with Validation

Let's add in validation. kFold Cross Validation would be nice but that's more difficult to do with tensorflow and requires writing our own functions so let's just take the last 20% as a validation set. Since we used train_test_split already, the data is already shuffled so it should be ok

In [None]:
model = tf.keras.Sequential([
    normalizer,
    tf.keras.layers.Dense(1)
])
model.summary()

In [None]:
model.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    metrics='accuracy',
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True)
)

history = model.fit(X_train, y_train, epochs=100, validation_split=0.2)

Same function that was in the previous notebook

In [None]:
def plot_loss(history, which='loss'):
    plt.plot(history.history[which], label='train')
    try:
        plt.plot(history.history['val_'+which], label='validation')
    except:
        None
    plt.xlabel('Epoch')
    plt.ylabel(which)
    plt.legend()
    plt.grid(True)

In [None]:
plot_loss(history)

In [None]:
logRegValidationEval = model.evaluate(X_test, y_test)
print("Logistic Regression with Validation Loss: ", round(logRegValidationEval[0],4), "Accuracy: ", round(logRegValidationEval[1],4))

Seems about the same performance on the test set. Your numbers will vary due to SGD

Now ANN

### ANN with Validation

In [None]:
model_tf = tf.keras.Sequential([
    normalizer,
    tf.keras.layers.Dense(100,activation='relu'),
    tf.keras.layers.Dense(100,activation='relu'),
    tf.keras.layers.Dense(100,activation='relu'),
    tf.keras.layers.Dense(1)
])

model_tf.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.01),
    metrics='accuracy',
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True)
)

history = model_tf.fit(X_train, y_train, epochs=100, validation_split=0.2)

Let's plot the loss graph

In [None]:
plot_loss(history)

Validation loss is a bit all over the place. This suggests overfitting to me

Neural Networks are not always better....even though the training loss and accuracy was better with the NN

In [None]:
plot_loss(history, 'accuracy')

In [None]:
print("Logistic Regression with Validation Loss: ", round(logRegValidationEval[0],4), "Accuracy: ", round(logRegValidationEval[1],4))

In [None]:
annValidationEval = model_tf.evaluate(X_test, y_test)
print("ANN with Validation Loss: ", round(annValidationEval[0],4), "Accuracy: ", round(annValidationEval[1],4))

The NN did do better with the test set in terms of accuracy but loss quite a bit worse

## ANN with Regularisation

Ok, let's try smoothing some of that out with adding l2 regularisation to the layers

In [None]:
#Defining model with L2 regularisation
model_tf = tf.keras.Sequential([
    normalizer,
    tf.keras.layers.Dense(100,activation='relu',kernel_regularizer=tf.keras.regularizers.l2(0.01), bias_regularizer=tf.keras.regularizers.l2(0.01)),
    tf.keras.layers.Dense(100,activation='relu',kernel_regularizer=tf.keras.regularizers.l2(0.01), bias_regularizer=tf.keras.regularizers.l2(0.01)),
    tf.keras.layers.Dense(100,activation='relu',kernel_regularizer=tf.keras.regularizers.l2(0.01), bias_regularizer=tf.keras.regularizers.l2(0.01)),
    tf.keras.layers.Dense(1)
])

model_tf.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.01),
    metrics='accuracy',
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True)
)

history = model_tf.fit(X_train, y_train, epochs=100, validation_split=0.2)

val_loss: 0.4876 - val_accuracy: 0.8462 , so better than the last NN we build

In [None]:
plot_loss(history)

That graph looks a lot better with some regularisation

In [None]:
model_tf.evaluate(X_test, y_test)
annValidationRegularisationEval = model_tf.evaluate(X_test, y_test)
print("ANN with Regularisation & Validation Loss: ", round(annValidationRegularisationEval[0],4), "Accuracy: ", round(annValidationRegularisationEval[1],4))

In [None]:
print("Logistic Regression Loss: ", round(logRegEval[0],4), "Accuracy: ", round(logRegEval[1],4))
print("ANN Loss: ", round(annEval[0],4), "Accuracy: ", round(annEval[1],4))
print("Logistic Regression with Validation Loss: ", round(logRegValidationEval[0],4), "Accuracy: ", round(logRegValidationEval[1],4))
print("ANN with Validation Loss: ", round(annValidationEval[0],4), "Accuracy: ", round(annValidationEval[1],4))
print("ANN with Regularisation & Validation Loss: ", round(annValidationRegularisationEval[0],4), "Accuracy: ", round(annValidationRegularisationEval[1],4))

Interesting to see that ANN doesn't do the best in terms of loss but does well in accuracy performace - shows that optimising for loss and accuracy are not the same thing. Also shows that a simpler model can sometimes be better choice