# Leave-one-out
Leave-one-out is a a case of k-fold [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out) technique to train a machine learning model. The k-fold technique splits the data into k separate folds or subsets where k-1 are used for training and one is used for validation. The procedure is repeated k times, every time changing the test set so that, after k training steps, all the data is used. With leave-one-out the number of folds is equal to the number of data points and the validation set is made up of one single data point at each train and validation step. Cross-validation and leave-one-out allow the best use of the data for training. Since it is computationally expensive it is used only with very small dataset. In this notebook we will create an ensemble of [Neural Network](https://scikit-learn.org/stable/modules/neural_networks_supervised.html) models (aka Multi Layer Perceptron), whose weights are randomly initialized, trained on a small dataset using the leave-one-out method. The final output of the ensemble model will be the mean value of the output of each member of the ensemble.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPRegressor
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score 
from time import time
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from matplotlib import dates
import matplotlib.dates as mdates
import warnings
warnings.filterwarnings('ignore')
print("NumPy version: %s"%np.__version__)
print("Pandas version: %s"%pd.__version__)
print("Matplotlib version: %s"%mpl.__version__)

NumPy version: 1.25.2
Pandas version: 2.1.1
Matplotlib version: 3.8.0


## The Faostat dataset

In [16]:
global_crop_yield_path = 'data/global_dataset.csv'
global_crop_yield_df = pd.read_csv(global_crop_yield_path, index_col=0)
global_crop_yield_df.head(2)

Unnamed: 0,Maize (100g/ha),Millet (100g/ha),Temp. Anom. (°C),Prec.Anom. (mm),CO2 (ppm),Manure (Mt),Nitrogen (tons),Phosphate (tons),Potash (tons)
1961-12-31,19423,5925,0.211,15.318908,317.64,18350920000.0,11486265.27,10888968.81,8626724.57
1962-12-31,19796,5619,0.038,0.7689,318.45,18729180000.0,12969831.11,11534554.43,9146891.38


In [17]:
MAIZE = 0
MILLET = 1
TEMP_ANOM = 2
PRECIP_ANOM = 3
CO2 = 4
MANURE = 5
NITROGEN = 6
PHOSPHATE = 7
POTASH = 8

In [18]:
X_tmp = global_crop_yield_df.iloc[:, [TEMP_ANOM, PRECIP_ANOM, MANURE, NITROGEN, CO2]].to_numpy()
y_tmp = global_crop_yield_df.iloc[:, [MAIZE]].to_numpy()

In [31]:
X_tmp[0]

array([2.11000000e-01, 1.53189083e+01, 1.83509152e+10, 1.14862653e+07,
       3.17640000e+02])

In [20]:
num_inputs = X_tmp.shape[1]
num_inputs

5

## Data preprocessing
We use the scikit-learn [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) as the transformer in our pipeline.

In [21]:
scaler_x = preprocessing.StandardScaler().fit(X_tmp)

In [22]:
mean_x = scaler_x.mean_[0]
mean_x

0.5707741935483871

In [23]:
variance_x = scaler_x.scale_[0]
variance_x

0.5387043721971724

In [24]:
X = scaler_x.transform(X_tmp)
X[0]

array([-0.66785089,  0.77731939, -2.40701305, -1.98495063, -1.40643863])

In [25]:
scaler_y = preprocessing.StandardScaler().fit(y_tmp)
y = scaler_y.transform(y_tmp)
y[0]

array([-1.63191475])

In [26]:
mean_y = scaler_y.mean_[0]
mean_y

38951.403225806454

In [27]:
variance_y = scaler_y.scale_[0]
variance_y

11966.558404448891

## The Multi-layer Perceptron estimator 
The [MLPRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor) is the scikit-learn implementation of the artificial neural network

### Hold-out training method with train and test sets
The basic procedure to train a ML model is to split the dataset into a train and test set, use the train set for model's parameters training and finally test the generalization performance of the model on the test set. 

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X_tmp, y_tmp, test_size=0.20)
len(X_train), len(X_test)

(49, 13)

In [65]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

pipe_lr = make_pipeline(StandardScaler(),                        
                        MLPRegressor(hidden_layer_sizes=(4, 1),
                                     activation='tanh',
                                     random_state=None,
                                     solver='lbfgs',
                                     max_iter=200,
                                     verbose=True))

pipe_lr.fit(X_train, y_train)
y_pred = pipe_lr.predict(X_test)
test_acc = pipe_lr.score(X_test, y_test)
print(f'Test accuracy: {test_acc:.3f}')

Test accuracy: -0.118


### Hold-out training method with train, validation, and test sets
We may have to iterate the process several time to achieve a good result but at that point the test set has leaked information to the model and as a result its generalization performances with new data might not be so good as expected. A better approach is to split the dataset into three sets: train, validation, and test set.   

## Ensemble models train and validation

In [45]:
num_observations = X.shape[0]
num_observations

62

In [46]:
import random 
from random import randint
num_ensemble_members = 5
random_states = [randint(2, num_observations) for p in range(1, num_ensemble_members + 1)]
len(random_states)

5

In [25]:
print(random_states[:])

[29, 54, 60, 41, 28]


In [26]:
ensemble_members = np.zeros((num_observations,num_ensemble_members))
ensemble_members.shape

(62, 5)

In [27]:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
num_splits = loo.get_n_splits(X)

def ensemble_train_validation(ensemble_members, num_inputs, X_train, y_train):
    # This function instantiate an ensemble of MLP models. After the instantiation each
    # ensemble member model is trained and validated on random subset of the training set 
    # passed as an argument to the function. The training set is previously scaled. The full 
    # data set is used to compute the predictions of each ensemble member
    num_ensemble_members = ensemble_members.shape[1]
    for member_id in range(0, num_ensemble_members):
        random_state = random_states[member_id]
        tic = time()
        # Ensemble member model instantiation
        mlp_model = MLPRegressor(
               hidden_layer_sizes=(num_inputs, 1),
               activation='tanh',
               #learning_rate_init=0.01, not used with lbfgs solver
               early_stopping=True,
               random_state=random_state,
               solver='lbfgs',
               max_iter=20
            )
        # Ensemble member model training with leave-one-out cross-validation
        print('Ensemble member {0:d}'.format(member_id))
        for i, (train_index, val_index) in enumerate(loo.split(X_train)):
            print('Fold {0:d}'.format(i))
            X_train_member = X_train[train_index]
            X_val = X_train[val_index]
            y_train_member = y_train[train_index]
            y_val = y_train[val_index]
            mlp_model.fit(X_train_member, y_train_member)
            time_stop = time() - tic
            print('done in {:.3f}'.format(time_stop))
            # Ensemble member model performances - training set
            print('X_train_ens_member: {}'.format(X_train_member.shape))
            print('y_train_ens_member: {}'.format(y_train_member.shape))
            member_model_train_score = mlp_model.score(X_train_member, y_train_member)
            print('Train R2 score: {:.2f}'.format(member_model_train_score))
            # Ensemble member model performances - validation set (one sample)
            print('X_val: {}'.format(X_val.shape))
            print('y_val: {}'.format(y_val.shape))
            member_model_validation_score = mlp_model.score(X_val, y_val) 
            print('Validation R2 score: {:.2f}'.format(member_model_validation_score))
            # Ensemble model predictions
            #model_sample = mlp_model.predict(X)
            #ensemble_members[:, member_id] = model_sample
        mlp_model=None

In [28]:
ensemble_train_validation(ensemble_members, num_inputs, X_train, y_train)

Ensemble member 0
Fold 0
done in 0.016
X_train_ens_member: (48, 5)
y_train_ens_member: (48, 1)
Train R2 score: 0.98
X_val: (1, 5)
y_val: (1, 1)
Validation R2 score: nan
Fold 1
done in 0.035
X_train_ens_member: (48, 5)
y_train_ens_member: (48, 1)
Train R2 score: 0.99
X_val: (1, 5)
y_val: (1, 1)
Validation R2 score: nan
Fold 2
done in 0.073
X_train_ens_member: (48, 5)
y_train_ens_member: (48, 1)
Train R2 score: 0.98
X_val: (1, 5)
y_val: (1, 1)
Validation R2 score: nan
Fold 3
done in 0.105
X_train_ens_member: (48, 5)
y_train_ens_member: (48, 1)
Train R2 score: 0.98
X_val: (1, 5)
y_val: (1, 1)
Validation R2 score: nan
Fold 4
done in 0.125
X_train_ens_member: (48, 5)
y_train_ens_member: (48, 1)
Train R2 score: 0.99
X_val: (1, 5)
y_val: (1, 1)
Validation R2 score: nan
Fold 5
done in 0.145
X_train_ens_member: (48, 5)
y_train_ens_member: (48, 1)
Train R2 score: 0.98
X_val: (1, 5)
y_val: (1, 1)
Validation R2 score: nan
Fold 6
done in 0.163
X_train_ens_member: (48, 5)
y_train_ens_member: (48, 1)

## Leave-one-out
Test of the scikit-learn [leave-one-out](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html) data folding method.

In [90]:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()

In [89]:
A = np.array([ \
             [1, 2, 3, 4], \
             [5, 6, 7, 8], \
             [10, 11, 12, 13], \
             [14, 15, 16, 17]])
print(A)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [10 11 12 13]
 [14 15 16 17]]


In [86]:
b = np.array([6, 7, 8, 9])
print(b)

[6 7 8 9]


In [91]:
num_splits = loo.get_n_splits(A)
num_splits

4

In [98]:
for i, (train_index, test_index) in enumerate(loo.split(A)):
    print('Fold {:d}'.format(i))
    print('Train: index={}. Test: index={}'.format(train_index, test_index))
    print('Train set:\n{}'.format(A[train_index]))
    print('Test set:\n{}'.format(A[test_index]))

Fold 0
Train: index=[1 2 3]. Test: index=[0]
Train set:
[[ 5  6  7  8]
 [10 11 12 13]
 [14 15 16 17]]
Test set:
[[1 2 3 4]]
Fold 1
Train: index=[0 2 3]. Test: index=[1]
Train set:
[[ 1  2  3  4]
 [10 11 12 13]
 [14 15 16 17]]
Test set:
[[5 6 7 8]]
Fold 2
Train: index=[0 1 3]. Test: index=[2]
Train set:
[[ 1  2  3  4]
 [ 5  6  7  8]
 [14 15 16 17]]
Test set:
[[10 11 12 13]]
Fold 3
Train: index=[0 1 2]. Test: index=[3]
Train set:
[[ 1  2  3  4]
 [ 5  6  7  8]
 [10 11 12 13]]
Test set:
[[14 15 16 17]]
