In [71]:
%load_ext tensorboard

<h2> Defining the base models

<h4>In this chapter we will build some base models including a naive/simple model and we use some defined functions to get our data to do so. </h4>
The problem is a classification problem so that is the key driver of our decisions

The following models are build:

<li> A naive/simple model </li>
<li> A randomforest model </li>
<li> A initial simple neural network </li>
<li> A deeper neural </li>

<p>
The simple model functions as a simple baseline. We should be able to defeat this model otherwise the project would not make sense at all.
We also train a random forest to see how a traditional model performs with the data we have.

In [3]:
import numpy as np
from pathlib import Path
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import sys, os
from loguru import logger
import tensorflow as tf
from tensorflow import keras
from tensorflow.python.keras import layers
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from tensorflow.python.keras.layers import Dense, Flatten, Input
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.callbacks import EarlyStopping, TensorBoard

sys.path.append('..')

from definitions import get_project_root
from src.data.make_dataset import create_train_test_validation
from src.visualization.visualize import plot_results
from src.models.train_model import simple_baseline

root = get_project_root()


 The versions of TensorFlow you are currently using is 2.8.0 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
2022-02-13 13:39:14.914 | INFO     | src.data.make_dataset:create_train_test_validation:73 - found file labeled_data.csv, procceed with creating train, test and validation sets


In [5]:
## Create train, validation and test sets
x_train, x_valid, x_test, y_train, y_valid, y_test = create_train_test_validation()
x_train.shape, y_train.shape, x_valid.shape, y_valid.shape, x_test.shape, y_test.shape

2022-02-13 13:39:21.608 | INFO     | src.data.make_dataset:create_train_test_validation:73 - found file labeled_data.csv, procceed with creating train, test and validation sets


((61711, 23), (61711, 1), (13225, 23), (13225, 1), (13224, 23), (13224, 1))

In [1]:
## Initialize empty scores
result = {}
score = {}

<h2> simple model

In [28]:
## Load in the simple model (prediction: song = rock)

score['simple_baseline'] = simple_baseline()
score['simple_baseline']

## So if we always predict the genre being 'Rock', we'd have a accuracy of 38.4%. That's due to a signifiacnt class imbalance as we already observed during the EDA.

0.3842256503327284

So if we always predict the genre being 'Rock', we'd have a accuracy of 38.4%. That's due to a signifiacnt class imbalance as we already observed during the EDA.

In [6]:
## Simple decision tree
rf_clf = RandomForestClassifier(n_estimators=100,random_state=0)
rf_clf.fit(x_train,y_train)

  rf_clf.fit(x_train,y_train)


RandomForestClassifier(random_state=0)

In [7]:
score['randomforest'] = rf_clf.score(x_test,y_test)
score['randomforest']

0.5569419237749547

So the randomforest hits an accuracy of 55.7%.
Since the objective of this course is to deliver neural networks, we won't dive any deeper into trying to optimize it.

Can our first base model beat the random forest?

<H2> baseline model

So for the base model, we made the following assumptions:

<li> We use the sequential API.
<li> We use the activation function 'relu'
<li> We Need to use a softmax activation on the output layer
<li> We use the sparse_categorical_crossentropy as loss function due to our categorical
<li> We use early stopping to stop the model when it is not learning anymore
<li> We use three layers. 1 input, 1 hidden and 1 output layer

In [30]:
## first neural network.

early_stop = EarlyStopping(patience=5,restore_best_weights=True)
tensorboard_callback = TensorBoard(log_dir = root / 'src' / 'logs',histogram_freq=1) 

base_model = Sequential(
    [   
        Dense(23,activation='relu', name = 'input',input_shape=(len(x_train[0]),)),
        Dense(100,activation='relu', name = 'hidden_1'),
        Dense(15, activation='softmax',name='output')
    ]
)

base_model.summary()
base_model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy']) 
result['base'] = base_model.fit(x_train, y_train, epochs = 100, validation_data=(x_valid,y_valid),callbacks=[early_stop,tensorboard_callback],verbose=1)
score['base'] = base_model.evaluate(x_test,y_test)

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (Dense)                (None, 23)                552       
_________________________________________________________________
hidden_1 (Dense)             (None, 100)               2400      
_________________________________________________________________
output (Dense)               (None, 15)                1515      
Total params: 4,467
Trainable params: 4,467
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100


In [31]:
score['base']

[1.3466856479644775, 0.5502873659133911]

During training we see that the validation accuracy is moving towares 56% which would beat the random forest by a small margin but unfortunetaly on the test set, the base model scores an accuracy of 55%

In [74]:
## let's have a look at the results
##plot_results(result,ymin=0,ymax=1,yscale="linear")
%tensorboard


ERROR: Could not find `tensorboard`. Please ensure that your PATH
contains an executable `tensorboard` program, or explicitly specify
the path to a TensorBoard binary by setting the `TENSORBOARD_BINARY`
environment variable.

<h2> Adding more complexity to baseline model

To see if we can improve this, we are going to try and deepen the model a bit to see if more layers add beter results. 

In [32]:
## second neural network.

tensorboard_callback = TensorBoard(log_dir = root / 'src' / 'logs',histogram_freq=1) 

base_model_deep = Sequential(
    [   
        Dense(23,activation='relu', name = 'input',input_shape=(len(x_train[0]),)),
        Dense(150,activation='relu'),
        Dense(50,activation='relu'),
        Dense(50,activation='relu'),
        Dense(50,activation='relu'),
        Dense(15, activation='softmax',name='output')
    ]
)

base_model_deep.build()
base_model_deep.summary()

base_model_deep.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
base_model_deep.fit(x_train,y_train,epochs=100,validation_data=(x_valid,y_valid),callbacks=[early_stop,tensorboard_callback],verbose=1)

score['base_deep'] = base_model_deep.evaluate(x_test,y_test)

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (Dense)                (None, 23)                552       
_________________________________________________________________
dense_4 (Dense)              (None, 150)               3600      
_________________________________________________________________
dense_5 (Dense)              (None, 50)                7550      
_________________________________________________________________
dense_6 (Dense)              (None, 50)                2550      
_________________________________________________________________
dense_7 (Dense)              (None, 50)                2550      
_________________________________________________________________
output (Dense)               (None, 15)                765       
Total params: 17,567
Trainable params: 17,567
Non-trainable params: 0
__________________________________________________

In [None]:
score

Unfortunetaly adding more layers even slightly decreases the performance. During the hypertuning step we will see if we can find an optimum

In [36]:
## Saving base model
file_model = root / 'src' / 'models' / 'best_base_model.model' 
base_model.save(file_model)

INFO:tensorflow:Assets written to: c:\Users\huube\OneDrive\Master of Informatics\Machine Learning\Eindopdracht\src\models\best_base_model.model\assets


{'dummy_baseline': 0.3842256503327284,
 'base': [1.3469150066375732, 0.5529340505599976],
 'base2': [1.3545866012573242, 0.5489261746406555]}

<h2> Conclusions </h4>

<p> So in this notebook we tried a few models and saved the best one for the hypertuning phase. The simple base model seemed to be performing slightly better but by such a small margin that we let the hypertuner figure this out for us.