![](https://miro.medium.com/max/1200/0*3I4P4pkL1xySQS9B.png)

# How to Structure a Machine Learning project

Here we study the best strategies to work on a big problem in machine learning applications.

[Here](https://jamboard.google.com/d/1z45e4QmQ0iZAVbgoVi4QxD7DjhW9sX8i2-1_TKKkmlo/edit?usp=sharing) a brief case of study for this lecture.

## Cross Validation and Parameter Choice

![title](https://www.researchgate.net/publication/307087929/figure/fig6/AS:399685689856008@1472303902570/For-cross-validation-and-cross-testing-data-are-divided-into-two-separate-sets-only.png)

## Cross validation

In the previous discussion, we left the validation set a bit apart. This is the moment to cope with it.

The train/test split may introduce an error, due to the fact that we may exclude data that are crucial for the algorithm. For example, think about a binary classification problem, in the case the slit completely excludes a class.

This will result in overfitting, even though we’re trying to avoid it! This is where cross validation comes in.

In order to avoid this, we can perform something called __cross validation__. It is very similar to train/test split, but it is applied to more subsets. Meaning, we split our data into $k$ subsets, and train on $k-1$ one of those subset. What we do is to hold the last subset for test. We’re able to do it for each of the subsets.

There are several cross validation methods, we are going to go over two of them: the first is _K-Folds Cross Validation_ and the second is _Leave One Out Cross Validation_ (LOOCV).

### K-fold cross validation

In $K$-Folds Cross Validation we split our data into $k$ different subsets (or folds). We use $k-1$ subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model. After that we test it against the test set.

![title](https://miro.medium.com/max/1400/1*J2B_bcbd1-s1kpWOu_FZrg.png)

#### Example

To have a concrete idea about how this works, we take an example directly from [sklearn documentation for $k$-fold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html).

In [1]:
import numpy as np
from sklearn.model_selection import KFold # import KFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) # create an array
y = np.array([1, 2, 3, 4]) # Create another array

kf = KFold(n_splits=3) # Define the split - into 2 folds 
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator

print(kf)

KFold(n_splits=3, random_state=None, shuffle=False)


One can print out the folds.

In [2]:
for train_index, test_index in kf.split(X):
    print('TRAIN:', train_index, 'TEST:', test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

TRAIN: [2 3] TEST: [0 1]
TRAIN: [0 1 3] TEST: [2]
TRAIN: [0 1 2] TEST: [3]


As one can see, the function split the original data into different subsets of the data. 
This is a very simple example, but it explains the concept pretty well.

## Leave One Out Cross Validation (LOOCV)

Another method we want to analyse is the so-called [Leave One Out Cross Validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html).

In this type of cross validation, the number of folds (subsets) equals to the number of observations we have in the dataset. We then average ALL of these folds and build our model with the average. We then test the model against the last fold. Because we would get a big number of training sets (equals to the number of samples), this method is very computationally expensive and should be used on small datasets. If the dataset is big, it would most likely be better to use a different method, like $k$-fold.

Again, let's take as example the one from [`sklearn` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html).

In [3]:
from sklearn.model_selection import LeaveOneOut # Import LeaveOneOut

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
loo = LeaveOneOut()
loo.get_n_splits(X)

4

In [4]:
for train_index, test_index in loo.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print('X_train: ', X_train, 'X_test: ', X_test, 'y_train: ', y_train, 'y_test: ', y_test)
    print('='*25)

TRAIN: [1 2 3] TEST: [0]
X_train:  [[3 4]
 [5 6]
 [7 8]] X_test:  [[1 2]] y_train:  [2 3 4] y_test:  [1]
TRAIN: [0 2 3] TEST: [1]
X_train:  [[1 2]
 [5 6]
 [7 8]] X_test:  [[3 4]] y_train:  [1 3 4] y_test:  [2]
TRAIN: [0 1 3] TEST: [2]
X_train:  [[1 2]
 [3 4]
 [7 8]] X_test:  [[5 6]] y_train:  [1 2 4] y_test:  [3]
TRAIN: [0 1 2] TEST: [3]
X_train:  [[1 2]
 [3 4]
 [5 6]] X_test:  [[7 8]] y_train:  [1 2 3] y_test:  [4]


## Further Cross Validation methods

We presented two of the most used approaches to cross validation. However, one can check further methods on the [`sklearn` documentation webpage](https://scikit-learn.org/stable/modules/classes.html).

## Working Example

We want to use the knwon and loved iris dataset to build a neural network classifier. We will make use of cross-validation to choose hyperparameter values:
1. how many layers
2. how many hidden units
3. dropout rate

In [1]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = "retina"

# Model configuration
batch_size = 10
loss_function = categorical_crossentropy
n_classes = 3
n_epochs = 20
n_folds = 7
optimizer = Adam()
validation_split = 0.2
verbosity = 1

# Load Iris data
X=load_iris().data
Y=load_iris().target

# Encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# Convert integers to dummy variables (i.e. one hot encoded)
Y = to_categorical(encoded_Y)

## Train-Test split
#X_train, X_test, Y_train, Y_test=train_test_split(X,Y,test_size=0.2)

# Define per-fold score containers <-- these are new
acc_per_fold = []
loss_per_fold = []

# Determine shape of the data
input_shape = X.shape[1]

# Define the K-fold Cross Validator
kfold = KFold(n_splits=n_folds, shuffle=True)

# K-fold Cross Validation model evaluation
fold = 1
for train, test in kfold.split(X, Y):

    # Create the model
    model = Sequential()
    model.add(Dense(32, activation='relu', input_dim = input_shape))
    model.add(Dropout(0.3))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(n_classes, activation='softmax'))

    # Compile the model
    model.compile(loss=loss_function,
                  optimizer=optimizer,
                  metrics=['accuracy'])


    # Generate a print
    print('------------------------------------------------------------------------')
    print(f'Training for fold {fold} ...')

    # Fit data to model
    history = model.fit(X[train], Y[train],
              batch_size=batch_size,
              epochs=n_epochs,
              verbose=verbosity,
              validation_split=validation_split)

    # Generate generalization metrics
    scores = model.evaluate(X[test], Y[test], verbose=0)
    print(f'Score for fold {fold}: {model.metrics_names[0]} of {scores[0]}; {model.metrics_names[1]} of {scores[1]*100}%')
    acc_per_fold.append(scores[1] * 100)
    loss_per_fold.append(scores[0])

    # Increase fold number
    fold += 1

# == Provide average scores ==
print('------------------------------------------------------------------------')
print('Score per fold')
for i in range(0, len(acc_per_fold)):
    print('------------------------------------------------------------------------')
    print(f'> Fold {i+1} - Loss: {loss_per_fold[i]} - Accuracy: {acc_per_fold[i]}%')
print('------------------------------------------------------------------------')
print('Average scores for all folds:')
print(f'> Accuracy: {np.mean(acc_per_fold)} (+- {np.std(acc_per_fold)})')
print(f'> Loss: {np.mean(loss_per_fold)}')
print('------------------------------------------------------------------------')

------------------------------------------------------------------------
Training for fold 1 ...
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Score for fold 1: loss of 0.6166592240333557; accuracy of 59.090906381607056%
------------------------------------------------------------------------
Training for fold 2 ...
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Score for fold 2: loss of 0.33529379963874817; accuracy of 100.0%
------------------------------------------------------------------------
Training for fold 3 ...
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoc

### Exercises

1. Change hyperparameters and check accuracy scores.
2. (Harder) Write a script searching for the best hyperparameter configuration.

#### Hint for exercise 2

Think about the hyperparameter search.
Which kind of approach would you choose? A grid search or an exploration on a random set of points?

[Answer here](https://analyticsindiamag.com/why-is-random-search-better-than-grid-search-for-machine-learning/#:~:text=One%20of%20the%20drawbacks%20of,aliasing%20around%20the%20right%20set.). Try to answer on your own before open the link.

### Click to visualise the discussion
<details>
  <summary>Click to expand!</summary>
  
    In <a href=https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf>this</a> great (and well written paper) the authors add a mathematical proof why random search is better than grid search for hyperparameter tuning.
  
</details>

