# More Models!

Now let's try to come up with the best model we can for predicting expected goals! In this section, we will try several different models, some feature selection methods and hyperparameter tuning.

## Neural Network

Firstly, we implemented a neural network with the keras library. After tuning the hyperparameters, we have come to the conclusion that the best results were found with the SGD optimizer (learning rate of 0.0001) and with one hidden layer containing 16 neurons (relu activation function and binary cross-entropy as the loss function). It trains on 50 epochs with an early stopping that checks if the model hasn't improved for the last 10 epochs (with a minimum of 10 epochs done already) and keeps the wiehgts of the best model.

In [None]:
# for preprocessing
from sklearn.compose import ColumnTransformer
from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from utils.model_utils import *
from neural_network import *

# for model training
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras.optimizers import SGD

# for plotting
from utils.plot_utils import *

In [None]:
# Load dataset
df = pd.read_csv('advanced_models_data.csv')

# Preprocess data
X_res_scaled, y_res = preprocess_neural_network_rfc(df)

# Split the data into training, validation and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_res_scaled, y_res, test_size=0.2, shuffle=True)
X_train, y_train = balance_data(X_train, y_train)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, shuffle=True)  # 0.25 x 0.8 = 0.2

# train model
model, history = train(X_train, y_train, X_val, y_val)

# save model
model.save("models/neural_network.h5")

In [None]:
# make predictions
predictions = model.predict(X_test)

In [None]:
# plot the training accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()
plt.savefig(f'model_accuracy_corr.png')


# plot the training loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()
plt.savefig(f'model_loss_corr.png')

In [None]:
# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f'Test accuracy: {test_accuracy * 100:.2f}%')

preds = np.round(model.predict(X_test), 0)

# confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

print(f1_score(y_test, preds, average="macro"))
ConfusionMatrixDisplay(confusion_matrix(y_test, preds)).plot()
plt.show()
print(classification_report(y_test, preds))

### Plotting the ROC curve, goal rate vs probability percentile, cumulative proportion of goals vs probability percentile, and the reliability curve

In [None]:
# ROC curve
plot_roc_curve_nn(predictions, y_test)

In [None]:
# make the probability predictions 1D
predictions = predictions.flatten()
predictions

In [None]:
# goal rate vs probability percentile
shot_prob_model_percentile_nn(predictions, y_test)

In [None]:
# cumulative proportion of goals vs probability percentile
plot_cumulative_sum_nn(predictions, y_test)

In [None]:
# reliability curve
plot_calibration_curve_nn(predictions, y_test)