# Cross-validation

With the code developed so far, it is possible to train an ANN and provide an estimate of the results it would offer in its real execution (with unseen patterns, represented by a test set). However, in this last aspect there are two factors to consider, as a consequence of the non-deterministic nature of the process we are following:

- The partitioning of the set of patterns into training/test is random (hold out), and is therefore overly dependent on good or bad luck in choosing training and test patterns.
- ANN training is not deterministic, as the initialisation of the weights is random. As before, it is too dependent on good or bad luck to start the training at a good or bad starting point.

For these two reasons, the test result of a single training is not significant when assessing the goodness of fit of the model in the presence of unseen patterns. To solve this problem, the experiment is repeated several times and the results are averaged. This can be implemented in a simple way by means of a loop; however, it is necessary to do this in an orderly way as there are two different sources of randomness.

Firstly, to minimise the randomness due to the partitioning of the data set, it is necessary to have a method that ensures that each data is used for training at least once, and for testing at least once. The most commonly used method is cross-validation. In this method, the data set is split into k disjoint subsets and k experiments are performed. In the k-th experiment, the k subset is separated for testing, and the remaining k-1 substes are used for training, performing a k-fold cross-validation. A common value is k=10, which gives a 10-fold cross-validation. Finally, the test value corresponding to the appropriate metric will be the average value of the values of the k experiments.

A widely used variant of cross-validation is stratified cross-validation. In this case, each subset is created in such a way as to keep the proportion of patterns of each class the same (or similar) as in the original dataset. This is particularly used when the data set is imbalanced.

It is usual to save not only the mean, but also the k values, in order to subsequently perform a paired hypothesis test with another model. To do this, it is necessary that both models have been trained using the same training and test sets.

This way of evaluating the model is often considered to be slightly pessimistic, i.e. the results obtained in tests are slightly worse than those that would be obtained from real training with all available data. In a hold out experiment, as mentioned above, several data are separated for testing. This means that the model is trained with less data than is available, and that by chance the data separated for testing can be of great importance (especially if there is little data). For this reason, when training with less data and possibly no "important" data, hold out is considered a pessimistic assessment. In the same way, cross-validation also separates data for testing, so it does not train on all available data, and is therefore also pessimistic. However, it is guaranteed that all data are used at least once in training and once in testing, thus trying to minimise the impact of chance in separating data, so it is considered only a slightly pessimistic evaluation.

Doing this is as simple as splitting the data set and performing a loop with k iterations in which at the k-th iteration a model is trained and evaluated with the corresponding sets. However, if the model is not deterministic, the result obtained at the k-th iteration will not be meaningful, since it is again dependent on chance. In this case, what needs to be done is a second nested loop within iteration k in which the model is repeatedly trained, and finally an average of the results is made to finally output the result of iteration k. The number of trainings must be high for the average results to be really significant, at least 50 trainings.

### Question

If this second loop is performed with a deterministic model, what will be the standard deviation of the test results obtained? Is there a difference between performing this second loop and averaging the results, or doing a single training?

`Answer:` if a model is deterministic, it means that given the same inputs it will produce the same outputs, so all the trainings will result in the same model. Then, the outputs produced by those models will be the same, so the mean of all the results is the same as the result of one model and the standard deviation is zero. Because of this, a deterministic model can be trained just once. 

As already explained, calculating the average of the outputs will result in the same values as the output of a single model, since all the models will return the same results, so performing a second loop and averaging the results will have no difference with doing a single training.

In this way, it is possible to evaluate a model together with its hyperparameters in solving a problem. A very common situation is to compare several models (or the same model with different hyperparameters), for which this scheme has to be applied with an important caveat: **the sets used in the cross-validation must be the same for each model**. Since the distribution of patterns in different sets is random, having the same subsets in different runs is achieved by setting the random seed at the beginning of the program to be executed. Setting the random seed not only allows the same subsets to be generated, but is also important in order to be able to repeat the results in different runs.

It is also important to bear in mind that this methodology allows estimating the real performance of a model (although slightly pessimistic). The final model that would be used in production would be the result of training it with all the available patterns, since, as seen in the theory class, and very generally speaking, the more patterns you train with, the better the model will be.

In this assignment, you are asked to:

1. Develop a function called `crossvalidation` that receives a value `N` (equal to the number of patterns), and a value `k` (number of subsets into which the dataset is to be split), and returns a vector of length N, where each element indicates in which subset that pattern should be included.

    To do this function, one possibility is to perform the following steps:
    
    - Create a vetor with k sorted elements, from 1 to k.
    - Create a new vector with repetitions of the previous vector until its length is greater than or equal to N. The functions `repeat` and `ceil` can be used for this purpose.
    - Take the first N values of this vector.
    - Shuffle this vector (using the function `shuffle!` and return it. To use this function, the module `Random` should be loaded.
    
    No loop function should be used in the developed function.

In [1]:
using Random

function crossvalidation(N::Int64, k::Int64)
    idx = repeat(1:k, Int32(ceil(N / k)))[1:N]
    shuffle!(idx)
    return idx
end;

2. Create a new function called `crossvalidation`, which in this case receives as first argument `targets` of type `AbstractArray{Bool,2}` with the desired outputs, and as second argument a value `k` (number of subsets in which the dataset will be split), and returns a vector of length N (equal to the number of rows of targets), where each element indicates in which subset that pattern must be included. This partition has also to be stratified. To do this, the following steps can be followed:

    - Create a vector of indices, with as many values as rows in the `target` matrix.
    - Write a loop that iterates over the classes (columns in the `target` matrix), and does the following:
        - Take the number of elements belonging to that class. This can be done by making a call to the `sum` function applied to the corresponding column.
        - Make a call to the `crossvalidation` function developed earlier passing as parameters this number of elements and the value of k.
        - Update the index vector positions indicated by the corresponding column of the `targets` matrix with the values of the vector resulting from the call to the `crossvalidation` function.
        
        ### Question
        
        Could you perform these 3 operations in a single function call?
        
        ```Answer:``` I could do it in a single line (+ loop), as shown below.
        </br>
    - Return the vector of indices.
    
    As it can be seen in this explanation, a loop iterating all classes can be used in this function. However, you need to make sure that each class has at least k patterns. A usual value is k=10. Therefore, it is important to make sure that you have at least 10 patterns of each class.
        
    ### Question
    
    What would happen if any class has a number of patterns less than k? What would be the consequences for calculating metrics?
    
    ```Answer:``` There gonna be splits without patterns for that class, since the function `crossvalidation` will cut the vector of `k` elements with the first `N` positions (`k > N = nº instances`). This way, the indeces for the last `k - N` splits will have no patterns for that class. As a consequence, the value of the metrics will be equal to 0 for that class for some iteration of the cross-validation.
    
    > If, for whatever reason, it is impossible to ensure that you have at least 10 patterns of each class, one possibility would be to lower the value of k. In this case, consult with the teacher to assess this option, and what impact it might have on the final result of the trained models. 
    
    ```Answer:``` It could be done. However, it must be taken into account that low values of k will be more efficient (less calculus) but the results will depend more on the random initialization of the network since less training will be done and the random split of the subsets.

In [2]:
function crossvalidation(targets::AbstractArray{Bool,2}, k::Int64)
    @assert all(sum(targets, dims=1) .>= k) "There are no enough instances per class to perform a $(k)-crossvalidation"
    
    idx = Int.(zeros(size(targets, 1)))
    for class in 1:size(targets, 2)
        idx[targets[:, class]] .= crossvalidation(sum(targets[:, class]), k)
    end
    
    return idx
end;

3. Perform a final function called crossvalidation, but in this case with the first parameter `targets` of type `AbstractArray{<:Any,1}` (i.e. a vector with heterogeneous elements), the same second argument, and perform stratified cross-validation.

    In this case, the steps to follow in this function are not specified. However, they are similar to the previous one. A simple way to do it would be to call the function `oneHotEncoding` passing the vector `targets` as an argument.
    
      ### Question
      
      Could you develop this function without calling oneHotEncoding?
      
      ```Answer:``` Yes, as shown in the commented code below (adaptation of the previous crossvalidation function).

In [3]:
include("utils.jl")
function crossvalidation(targets::AbstractArray{<:Any,1}, k::Int64)
    return crossvalidation(oneHotEncoding(targets), k)
    
    #####################
    #  alternative way  #
    #####################
    
    """
    classes = unique(targets)
    idx = Int.(zeros(size(targets, 1)))
    
    for class in classes
        classIdx = (targets .== class)
        numInstances = sum(classIdx)
        @assert (numInstances .>= k) "There are no enough instances per class to perform a $(k)-crossvalidation"
        
        idx[classIdx] .= crossvalidation(numInstances, k)
    end
    
    return idx
    """
    
end;

4. **Integrate these functions into the code developed so far** and define two functions to train ANNs following the stratified cross-validation strategy. To do this:

- First, it is necessary to set the random seed to ensure that the experiments are repeatable. This can be done with the `seed!` function of the `Random` module.
- Once the data is loaded and encoded, generate an index vector by calling the `crossvalidation` function.
- Create a function called `trainClassANN`, which receives as parameters the topology, the training set and the indices used for cross-validation. Optionally, it can receive the rest of the parameters used in previous assignments. Inside this function, the following steps may be followed:
    - Create a vector with k elements, which will contain the test results of the cross-validation process with the selected metric. If more than one metric is to be used, create one vector per metric.
    - Make a loop with k iterations (k folds) where, within each iteration, 4 matrices are created from the desired input and output matrices by means of the index vector resulting from the previous function. Namely, the desired inputs and outputs for training and test. As always, do this process of creating new matrices without loops.
    - Within this loop, add a call to generate the model with the training set, and test with the corresponding test set according to the value of k. This can be done by calling the `trainClassANN` function developed in previous assignments, passing as parameters the corresponding sets.
    - As indicated in the previous assignment, the training of ANNs is not deterministic, so that, for each iteration k of the cross-validation, it will be necessary to train several ANNs and return the average of the test results (with the selected metric or metrics) in order to have the test value corresponding to this k.
    - Furthermore, in the case of training ANNs, the training set can be split into training and validation if the ratio of patterns to be used for the validation set is greater than 0. To do this, use the `holdOut` function developed in a previous assignment.
    - Once the model has been trained (several times) on each fold, take the result and fill in the vector(s) created earlier (one for each metric).
    - Finally, provide the result of averaging the values of these vectors for each metric together with their standard deviations.
    - As a result of this call, at least the test value in the selected metric(s) should be returned. If the model is not deterministic (as is the case for the ANNs), it will be the average of the results of several trainings.
- Once this function is done, develop a second one, of the same name, so that it accepts as desired outputs a vector instead of an array, as in a previous assignment, and its operation is simply to make a call to this newly developed function.

> **Remarks**:
> - Although we have only seen how to train ANNs, in the next assignment we will use other models contained in another library (Scikit-Learn). The idea is to use the same code used for cross-validation with this global loop, changing only the line in which the model is generated.
> - Note that other Machine Learning models are deterministic, so they do not need the inner loop (whenever they are trained with the same data they return the same outputs), but only the loop for each fold.

In [4]:
function trainClassANN(topology::AbstractArray{<:Int,1}, 
        trainingDataset::Tuple{AbstractArray{<:Real,2}, AbstractArray{Bool,2}}, 
        kFoldIndices::	Array{Int64,1}; 
        transferFunctions::AbstractArray{<:Function,1}=fill(σ, length(topology)), 
        maxEpochs::Int=1000, minLoss::Real=0.0, learningRate::Real=0.01, repetitionsTraining::Int=1, 
        validationRatio::Real=0.0, maxEpochsVal::Int=20)
    
    inputs, targets = trainingDataset
    
    # crossvalidation variables
    k = maximum(kFoldIndices)
    testAccsK = zeros(k)
    
    # Train with k different splits
    for ki in 1:k
        
        # Use the patterns with no k index for train
        trainingInputs = inputs[kFoldIndices .!= ki, :]
        trainingTargets = targets[kFoldIndices .!= ki, :]
        
        # Split the training subset into train and validation
        trainingIdx, validationIdx = holdOut(size(trainingInputs, 1), validationRatio) 
        trainingInputs, validationInputs = trainingInputs[trainingIdx, :], trainingInputs[validationIdx, :]
        trainingTargets, validationTargets = trainingTargets[trainingIdx, :], trainingTargets[validationIdx, :]
        
        # Use the patterns with the k index for test
        testInputs = inputs[kFoldIndices .== ki, :]
        testTargets = targets[kFoldIndices .== ki, :]
        
        # Train each network several times and save the accuracy
        accs = zeros(repetitionsTraining)
        
        for i in 1:repetitionsTraining
            ann, trainingLosses, validationLosses, testLosses, trainingAccs, validationAccs, testAccs = trainClassANN(
                topology, (trainingInputs, trainingTargets), validationDataset=(validationInputs, validationTargets), 
                testDataset=(testInputs, testTargets), transferFunctions=transferFunctions, maxEpochs=maxEpochs, 
                minLoss=minLoss, learningRate=learningRate, maxEpochsVal=maxEpochsVal)
            
            # If using validation, the model may not correspond to the last epoch so we cannot just get
            # testAccs[end] as the accuracy of the model
            accs[i] = accuracy(ann(testInputs')', testTargets)
        end
            
        # Compute the mean of the results and save them
        testAccsK[ki] = mean(accs)
    end
    
    # Return the average and std of the metrics in the different k folds
    return mean(testAccsK), std(testAccsK)
end;

In [5]:
function trainClassANN(topology::AbstractArray{<:Int,1},
        trainingDataset::Tuple{AbstractArray{<:Real,2}, AbstractArray{Bool,1}},
        kFoldIndices::	Array{Int64,1};
        transferFunctions::AbstractArray{<:Function,1}=fill(σ, length(topology)),
        maxEpochs::Int=1000, minLoss::Real=0.0, learningRate::Real=0.01, repetitionsTraining::Int=1, 
        validationRatio::Real=0.0, maxEpochsVal::Int=20)
    
    trainingInputs, trainingTargets = trainingDataset
    trainingTargets = reshape(trainingTargets, (length(trainingTargets), 1))
    
    return trainClassANN(topology, (trainingInputs, trainingTargets), kFoldIndices, transferFunctions=transferFunctions, 
        maxEpochs=maxEpochs, minLoss=minLoss, learningRate=learningRate, repetitionsTraining=repetitionsTraining,
        validationRatio=validationRatio, maxEpochsVal=maxEpochsVal)
end;