# Intro to Neural Networks with Keras

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

Same as last week, we will try to predict `total_amount` using `fare_amount, tip_amount, toll_amount, trip_distance, VendorID`

In [2]:
COLS = ['total_amount', 'fare_amount', 'tip_amount', 'tolls_amount', 'trip_distance', 'VendorID']

df = pd.read_parquet("../../data/tute_data/sample_data.parquet")
df = df[COLS]
df

Unnamed: 0,total_amount,fare_amount,tip_amount,tolls_amount,trip_distance,VendorID
0,30.30,23.50,3.00,0.00,4.30,2
1,9.79,4.50,1.49,0.00,0.50,2
2,25.80,22.00,0.00,0.00,7.37,2
3,16.56,10.00,2.76,0.00,1.85,2
4,29.76,21.00,4.96,0.00,5.88,2
...,...,...,...,...,...,...
123340,13.18,8.97,0.91,0.00,1.59,2
123341,20.08,14.78,2.00,0.00,3.74,2
123342,48.16,29.76,8.55,6.55,7.97,2
123343,16.76,10.51,2.95,0.00,2.28,2


One-hot encode the categorical `VendorID`, could choose to do something about the instances with ID's 5 and 6.

In [3]:
df = pd.get_dummies(df, columns=['VendorID'])
df

Unnamed: 0,total_amount,fare_amount,tip_amount,tolls_amount,trip_distance,VendorID_1,VendorID_2,VendorID_5,VendorID_6
0,30.30,23.50,3.00,0.00,4.30,0,1,0,0
1,9.79,4.50,1.49,0.00,0.50,0,1,0,0
2,25.80,22.00,0.00,0.00,7.37,0,1,0,0
3,16.56,10.00,2.76,0.00,1.85,0,1,0,0
4,29.76,21.00,4.96,0.00,5.88,0,1,0,0
...,...,...,...,...,...,...,...,...,...
123340,13.18,8.97,0.91,0.00,1.59,0,1,0,0
123341,20.08,14.78,2.00,0.00,3.74,0,1,0,0
123342,48.16,29.76,8.55,6.55,7.97,0,1,0,0
123343,16.76,10.51,2.95,0.00,2.28,0,1,0,0


Prepare a training, validation, and test dataset:

In [4]:
TARGET_COLS = ['total_amount']

train, test = train_test_split(df, train_size=0.8, random_state=0)

X_train, y_train = train.drop(TARGET_COLS, axis=1), train[TARGET_COLS]
X_test, y_test = test.drop(TARGET_COLS, axis=1), test[TARGET_COLS]

print(f'{len(X_train)} training instances, {len(X_test)} test instances')

98676 training instances, 24669 test instances


### Prepare NN
We will use `keras`, which is built on top of `TensorFlow` and provides a very beginner-friendly interface for building NNs.

In [6]:
from tensorflow import keras
from tensorflow.keras.layers import Dense, Normalization

It is recommended to normalise input data to NNs in most cases so that features with a larger magnitude don't dominate when the weights are initialised to similar values.

We can use the `Normalization` layer from `keras` to do this automatically.

In [8]:
# Setup a normalization layer and adapt it to the training set so that it knows
# what mean and sd to use when normalising
norm_layer = Normalization()
norm_layer.adapt(X_train)

Now we can assemble a simple sequential NN using our normalisation layer.

There are a lot of design decisions to experiment with here, including:
- the amount of (hidden) layers,
- the amount of nodes in each layer,
- the activation functions in each layer,
- the type of layers we use,
- ...
___

1. We know we want to normalise the data first, so put that at the start.
2. We know we are trying to predict a single target variable `total_amount`, so our final layer will have a single node.
3. The target, `total_amount` should be non-negative, so it makes sense to use `relu` activation for this ($f(x) = max(0, x)$)
    - Other options are `linear`, `softmax`, `tanh`, `sigmoid`, and more
        - `softmax`, for example, is useful when predicting a target probability $p \in (0, 1)$
4. This model is quite simple (modelling total amount from features which sum to the total) so we can try just a single hidden layer.
    - Can also see what happens if we don't inlcude a hidden layer (model will only be able to represent linear functions)
5. Generally we'll pick an amount of nodes in the hidden layer which is between the size of the input and output layers. There are 8 features and 1 output, so lets start with 5 and experiment.

Take a look at https://stats.stackexchange.com/a/180052 for some rules of thumb about setting our layer and node counts.

In [16]:
model = keras.Sequential(
    [   
        norm_layer,                   # our normalisation layer recieves the input
        Dense(5, activation='relu'),  # the hidden layer gets the normalised result
        # Dense(3, activation='relu'),  # (in case you want to try an extra hidden layer)
        Dense(1, activation='relu')   # and the output layer has a single node which will estimate total_amount
    ]
)

Now we need to decide what optimiser to use with our model, and what loss function we want to try and minimise. `keras` gives us lots of options which we can look at here:
- https://www.tensorflow.org/api_docs/python/tf/keras/optimizers
- https://www.tensorflow.org/api_docs/python/tf/keras/losses

In [17]:
model.compile(
    optimizer='adam',  # Adam optimises using gradient descent, is generally fast and a good choice in many cases
    loss='MSE'  # Mean Squared Error makes sense for this problem, 
                # though we could use Mean Absolute Error, or many other choices.
                # Classification outputs would use a different loss (eg. BinaryCrossentropy)
)

Now we can fit the model. We process the instances in batches of 32 and use a validation split equal to the size of our test set here, but these are hyperparameters.

The optimal number of epochs can be determined experimentally (often to minimise validation loss) or we could use `tf.keras.callbacks.EarlyStopping` to do this automatically.

(https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/ is a good article explaining batches and epochs!)

In [18]:
history = model.fit(
    x=X_train,
    y=y_train,
    batch_size=16,
    validation_split=0.25,
    epochs=10
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Lets look at some predictions:

In [19]:
comparison = y_test.iloc[:5].copy()
comparison.loc[:, 'prediction'] = model.predict(X_test.head())
comparison



Unnamed: 0,total_amount,prediction
65969,18.95,18.212635
18804,12.8,13.362486
31167,14.8,14.997655
85799,14.3,14.520094
110326,10.8,10.398303


And evaluate the performance of the model on the test dataset. Here is our Mean Squared Error (or other loss function)

In [20]:
model.evaluate(
    x=X_test,
    y=y_test,
    batch_size=16,
)



3.073211193084717

And we can verify the MSE ourselves:

In [21]:
predictions = model.predict(X_test)
errors = np.array(predictions - y_test)
squared_errors = errors**2
mean_squared_error = squared_errors.mean()

print(f'MSE: {mean_squared_error}')

MSE: 3.0732131331264596


Finally so we can compare to last week, lets do $R^2$:

In [22]:
tot_sum_squares = (np.array(y_test - y_test.mean())**2).sum()
r2 = 1 - (squared_errors.sum() / tot_sum_squares)
print(f'Model R^2: {r2:.4f}')

Model R^2: 0.9872
