# Multiclass classification

When solving a classification problem, many existing machine learning models allow only two classes to be separated, usually referred to as "positive" and "negative". Positive patterns are commonly those related to what is to be detected, such as disease, alarm, or a type of object in an image. Negative patterns are often characterised by the absence of this characteristic that positive patterns have. To develop an ANN that classifies into two classes, a single neuron is needed in the output layer, with a logarithmic (or similar) sigmoidal transfer function, such that the output of the ANN will be between 0 and 1, and can be interpreted as the ANN's certainty in classifying a pattern as "positive". The classification into "negative" or "positive" is done in a simple way, by applying a threshold which is typically 0.5, although this can be changed.

However, there are many occasions when a system that is able to classify into more than two classes is desired. A simple example is a system that wants to classify an image according to whether a dog, cat or mouse, or some other type of animal is observed. In this case, you want to develop a 4-class classification system: "dog"/"cat"/"mouse"/"other". If an ANN to distinguish between these 3 animals is required, an output neuron for each class is needed, including the "other" class (4 output neurons in total).

In the multiclass classification scheme, as has been done in previous assignments, an encoding called one-hot-encoding is generally used, which is based on obtaining a boolean value for each pattern and each class, in such a way that each boolean value will be equal to 1 if that pattern belongs to that class, and 0 otherwise. When training an ANN with this scheme, each output neuron can be understood as a model specialised in classifying in a given class. In this type of networks, a linear transfer function is usually used in the output layer, whereby negative outputs indicate that a neuron does not classify the pattern into that class (i.e. from the point of view of that class it classifies it as "negative"), and positive outputs indicate that a neuron classifies the pattern as that class (i.e. from the point of view of that class it classifies it as "positive"). The absolute value of a neuron's output indicates that neuron's confidence in the classification. Finally, the softmax function receives these classification values and transforms them in such a way that they are between 0 and 1, and add up to 1, interpreted as the probability of belonging to each class. The pattern will be classified into the class whose output value is the highest. The softmax function is defined as follows: 

$$
softmax(y^i) = \frac{e^{y^i}}{\sum_j{e^{y^j}}}
$$

where $y^i$ is the output of the $i$-th neuron. For example, in a 3-class classification problem, if the outputs from the 3 neurons are `[2, 1, 0.2]`, they would classify the inputs as belonging to their respective classes, although the first one with much greater certainty. After applying the softmax function, the respective probabilities will be `[0.65, 0.24, 0.11]`, so the pattern will be classified as the first class.

In this way, the softmax function converts the real values produced by the output neurons into probability values, so that the more negative a value is (the more certainty of not belonging to that class), the closer it is to 0, and the more positive a value is (the more certainty of belonging to that class), the closer it is to 1. As indicated above, the sum of output probabilities will be equal to 1. Because of this fact, a fourth special class "other" is needed in the example above and in any other example where a pattern may not belong to any of the predefined classes.

### Question

Why is this extra class necessary when using the softmax function?

`Tip:`  write in Julia `softmax([-1, -1, -0.2])`, and interpret the inputs (what does the vector `[-1, -1, -0.2]` represent and how is it interpreted?) and outputs of the function (how much do the values add up to? what does each say?). To use this function, import it from the Flux library.

`Answer:` The vector `[-1, -1, -0.2]` represents that all of the 3 neurons that returned this result does not classify the pattern in the class that they model, since they are negative values. However, the `softmax` function works in a way that the sum of every final value adds up to 1.0 (0.999999... in this case), so it 'forces' to give a classification in most of the cases, where the greatest value (even if negative) is translated as the class with the higher confidence and in turn detected as such. In this case, the value -0.2 is the highest, and it is converted into a probability of ~0.53, so the model finally decides that the class detected is the third one, when none of the neurons really detected it.

The best case where we could detect that no class is detected is when the three negative values are very close to each other, so the `softmax` will transformed them into confidences of 0.333 (since we are working with 3 values) and then if a threshold is applied (e.g. 0.5) will detect that all of them are negative. For cases where those 3 negative values are not close to each other, then one of them (the greater one) is probably going to be translated as a confidence > 0.5, and the final result will indicate that the pattern belongs to that category, when that neuron actually returned a negative (no class) value.

For this reason, an extra 'other' class is needed, so when the other neurons return negative values for their classes, the neuron assigned to the 'other' class will be the greater and, after applying the `softmax`, it will be the one with higher confidence, so we are able to detect when no classification is returned by the model.

In [12]:
using Flux: softmax;
x = softmax([-1, -1, -0.2])
s = sum(x)
x,s

([0.23665609135556676, 0.23665609135556676, 0.5266878172888664], 0.9999999999999999)

### Question

Might it not be necessary to create the additional class? What modification would have to be made to the ANN? How would the output be interpreted? How would the output class be generated based on the outputs of the output neurons?

`Answer:` If softmax is used, as discussed in the previous question, the model must consider all possible classes in the domain to avoid returning erroneous classifications. However, there exist other approaches that allow not to contemplate all the classes of the domain, the modifications to be made on the ANN (coming from softmax) are the following:

1. We could generate independent outputs with one neuron per category considered. For that, a sigmoid function is used on every neuron of the last layer instead of a softmax, so the outputs are interpreted as the confidence of each neuron that the pattern belongs to the class they model. This way, the sum of all the outputs can be greater than 1. Then, the final category is taken from the class that corresponds to the highest confidence between all the independent predictions.


2. Similar to the previous approach, we could generate as many binary predictors as needed to execute a one-vs-all classification, so each model would use just one neuron with the sigmoid activation. Finally, we could, for example, considere as the final output the class that corresponds to the classifier with the positive result among all the models.

In both cases, if all the confidences are very low and do not reach the minimum threshold, all the results will be negative so we can detect that no classification was performed (or 'other' class is detected).

### Question

In general, how does the output of a model have to be in order not to need this fourth class?

`Answer:` if using a softmax function in the last layer, then there must be a neuron for every possible class in the domain. If that is not possible, then it has to generate independent confidences for each class. This way, a sigmoid function is used on every neuron instead of a softmax, giving independent confidences for every class among which the greatest one is considered as the final result.

### Question

Does a kNN model need this fourth class?

`Answer:` kNN, as its name suggests, will return the n nearest neighbours. Therefore, if we want to be able to distinguish between these 3 classes + 'other', we must take into account these 4 classes and have as representative samples as possible for every of them. However, if the 'other' class is very sparse due to the fact that it can contain every kind of pattern not belonging to the other 3 classes (features dissimilar even from instances of the same class), then the neighbourhood will hardly be dominated by the 'other' class, so probably this class may not affect the output of the model.

In contrast, the ANN uses all the weights/knowledge of the network to perform the classification, so in that model makes sense to use the 'other' class.

### Question

How many classes would be necessary if an ANN wanted to recognise those 3 types of animals, and, if it is not one of them, to say whether it is an animal or not? What if the model is a kNN?

`Answer:` using a softmax function, it would be necessary a total of 5 classes: three for the specific types of animals, one to detect if it is any other type of animal and one more to the rest of possible not animal things.

As commented previously for the kNN, we should have those 5 classes as well, but due to the fact that the 'other' class is very sparse it may be useless. To detect if it is just an animal, we must consider if the features of all the possible animals may form a cluster that can be distinguished from non-animal classes. If it is possible, then it would make sense to use an 'animal' class, otherwise it would be the same as with 'other'.

Therefore, the "positive"/"negative" scheme no longer applies if more than two classes are required. The problem in these cases is that many of the machine learning models are only capable of separating two classes, so theoretically they could not be used. An example of such systems are Support Vector Machines (SVM), which are discussed in more detail in the theory class. Modifications have been made to the formulation of this model to allow multi-class classifications; however, in practice they are not commonly used, and instead a strategy that allows binary SVMs to be used to classify into multiple classes is often employed.

There are two main strategies for converting multi-class problems into binary classification problems. These strategies are called "one-against-one" or "one-against-all". Both are explained in theory class, but since "one-against-all" is much more widely used, this strategy will be used in the following.

The "one-against-all" strategy is based on generating L binary classifiers for a classification problem of L classes, one per class. In the l-th problem, class l must be separated from the rest, i.e., the patterns belonging to that class will be considered "positive", and those not belonging to it will be considered "negative". Continuing with the previous example of animals, 3 different classification problems would have to be solved: one to classify "dog"/"not dog", one to classify "cat"/"not cat", and one to classify "mouse"/"not mouse". Three classifiers would therefore be trained with the same inputs but with different desired outputs for each problem.

### Question

In the previously described problem, 4 classes were used for these 3 animals, including the class "other". Why not train a classifier for this class in a "one-against-all" scheme?

`Answer:` because in this case the classifiers are independent, and we know that no classification is made (or 'other' is detected) when all the classifiers do not return a positive classification, so there is no positive class. In addition, the fact that you can use a softmax and that the outputs are dependent on it, will facilitate the training and creation of the most appropriate weights.

Once the binary classifiers are trained, any given pattern is fed into all the classifiers and, depending on the output, a decision is made. If only one of the systems has positive output, or none of the three classifies it as positive, the decision is clear. However, sometimes more than one classifier will give a positive output for the same pattern. Fortunately, many classifiers give information about the level of certainty or confidence they have that the pattern is classified as "positive". If more than one binary model classifies the pattern as positive, it will be assigned to the class corresponding to the classifier that has a higher certainty in its classification.

### Question

Would it be possible to use the outputs of those 3 classifiers as the input of the softmax function? What would be the consequences?

`Answer:` We could, but it doesn't make any sense since the classifiers are independent, so their weights and biases represent different knowledge. In addition, the softmax function will force to classify into one class, even if all the outputs are negative. Then, if all the 3 classifiers return negative values, the softmax function will take the highest one and return the positive class of its corresponding classifier as the detected class. **Doing this will not solve the problem of the fourth class, so, as we have said, it does not make sense.**

### Question

In general, when there are L classes and a pattern may not belong to any of them, what is the impact of using the softmax function on the outputs? In which cases could it be used? Why?

`Answer:` the softmax function will try to detect always one class, giving the higher probability to the highest output, even if they are negative, so the class detected may be wrong if it does not belong to any of the classes considered by the model.

It could be used when we are sure that all the patterns passed to the model are represented by some of the neurons because there will be a real positive prediction among all of them, and the softmax will assign to it the higher probability correctly. If the domain is not restricted, then independent classifications could be done or an 'other' class could be added, instead of the softmax.

### Question

The softmax function is useful to get a loss value to train the ANN. However, if it were not used in the animal example above, would the fourth class "other" be necessary?

`Answer:` no, because we could know that no class was predicted by looking if the outputs for the other classes are negative. However, another function should be used to scale the outputs to the target range, so the loss can be calculated and the model trained correctly.

Finally, it is necessary to consider a different scenario when assigning patterns to classes. So far, and in most situations, the classes considered are mutually exclusive, i.e. in the example above, an animal is either a dog, a cat, a mouse, or none of the 3, but it cannot be of several classes at the same time. This is the most common case, but occasionally a problem will have classes that are not mutually exclusive. For example, when classifying animal sounds according to the animal that makes them, it may happen that several animals are mixed in one sound. In these cases, the use of a linear transfer function in the last layer together with the softmax function would not work, since, naturally, the sum of the probabilities of belonging to the classes may be greater than 1 (it may belong to several classes at the same time). For these cases, the scheme that can be used to train ANNs is to use logarithmic sigmoidal transfer functions in the last layer (instead of linear), which give an output between 0 and 1, and not to perform transformation using the softmax function. In this way, the final output of each output neuron is independent of the rest of the output neurons, and more than one can take values close to 1. The output of each neuron would again be interpreted as the probability of belonging to that class, but in this case the sum of the probabilities does not have to be 1 (they are independent). Not applying the softmax function has two advantages: the first, already mentioned, is that it allows classification into non-mutually exclusive classes; the second is that an additional class ("other" in the example above) is no longer needed for cases where a set of inputs may not belong to any of the given classes.

### Question

Why is this extra class no longer needed?

`Answer:` because we can detect that it does not belong to any class by thresholding the outputs of every neuron and checking that all of them return a negative result.

Given a set of inputs, as always, it is classified into the class whose output neuron has shown the highest confidence. This scheme of non-mutually exclusive outputs is similar to the "one-against-all" scheme, in which one classifier per class is trained in parallel. The classifiers are independent and the final class is that of the classifier that has the highest certainty of belonging to that class. If all classifiers return "negative" as a classification and there is no possibility of not belonging to any class, the classifier with the lowest certainty of being negative is classified in the corresponding class. If all classifiers return "negative" as a classification and there is a possibility of non-class membership, it is simply classified as "other".

The following table shows a summary of the different scenarios when using an ANN to solve a classification problem. Note that in the case of binary classification, the possibility that a set of entries do not belong to any class is not considered, since in this case we would be in multi-class classification.

In the case of using a "one-against-all" strategy, this would be similar to the last row, except that the interval would not necessarily be `[0, 1]`, but would be conditioned by the model used, and therefore the threshold as well. For example, the outputs of a SVM range from $-\infty$ to $+\infty$, so the typical threshold is set to 0.

Another factor to consider when dealing with multiclass problems is the performance metric. Most of the metrics studied (PPV, sensitivity, etc.) correspond to binary classification problems. When the number of classes is greater than 2, these metrics can still be used; however, their use is slightly different.

When the number of classes is greater than two, the PPV, NPV, sensitivity and specificity metrics can be calculated separately for each class. Thus, from the point of view of a particular class, that class will be referred to as the positive class and the rest of classes will be put together in the negative class. In this way, from the exclusive point of view of that class, TP, TN, FP and FN can be calculated, and from them the sensitivity, specificity, PPV and NPV values for that particular class, and finally the F-score value. This way of treating classes separately is similar to the development of several classifiers in the "one-against-all" strategy (in the case of training binary classifiers that do not allow multi-class classification). Once these values have been calculated, they can be combined into a single value that will be used to evaluate the performance of the classifier. In this regard, there are 3 strategies: macro, weighted, and micro. We will use only the first two:

- **Macro**. In this strategy, those metrics such as the PPV or the F-score are calculated as the arithmetic mean of the metrics of each class. As it is an arithmetic average, it does not consider the possible imbalance between classes.
- **Weighted**. In this stratey, the metrics corresponding to each class are averaged, weighting them with the number of patterns that belong (desired output) to each class. It is therefore suitable when classes are unbalanced.
- **Micro**. TP, FN, and FP are calculated globally. When the classes are not mutually exclusive, the micro-PPV or micro-F-score is equal to the accuracy value. Therefore, this metric is useful when there are mutually exclusive classes. 

In this assignment, you are asked to:

1. Develop the code necessary to perform a "one-against-all" strategy. Although it is not necessary to develop it for multiclass classification with ANNs, it will be used in future assignments. A simple way of doing it is the following:

    - Calculate the number of classes and create a 2-dimensional matrix of real values, with as many rows as patterns and as many columns as classes.
    
    ```Julia
    outputs = Array{Float32,2}(undef, numInstances, numClasses);
    ```
    
    - Make a loop that iterates over each class. Inside this loop, the desired outputs corresponding to that class are created and the corresponding model is trained with those inputs and the new desired outputs corresponding to that class. In other words, a model is created for each class that indicates whether or not the pattern belongs to that class. Subsequently, this model is applied to the inputs (training and/or test) to calculate the outputs, which will be copied into the previously created matrix. The code would be similar to the following, in which a supposed fit function has been used to train a binary classification model:
    
    ```Julia
    for numClass in 1:numClasses
        model = fit(inputs, targets[:,[numClass]]); 
        outputs[:,numClass] .= model(inputs);
    end;
    ```
    
    ### Question

    In this code it has also been assumed that `targets` is of type `AbstractArray{Bool,2}`. How could this be done if it were a vector with classes of any type (e.g. containing ["car", 17, "motorbike"]), i.e. of type Array{Any,1}?
    
    `Answer:` it could be done by calculating the `numClasses` as the length of the list of unique values in the targets (`uniqueClasses = unique(targets)`). By iterating these classes, inside the loop, it would be used `targets .== uniqueClasses[numClass]` as the targets.
    <br/>
    
     - Once the outputs are in the `outputs` matrix, the highest value is taken for each row (each pattern), i.e. the class of the model that has the highest certainty that it belongs to "its" class is taken.
    
        - Optionally, the softmax function can be passed. The end result is the same: the class with the highest value will be taken. However, the softmax function allows you to interpret the outputs as the probability of belonging to each class. One problem in using softmax is that it is prepared for use with ANNs, so it expects each pattern to be in a row. To solve this, you would have to transpose the outputs matrix and transpose the result back, as follows:
        
        ```Julia
        outputs = softmax(outputs')';
        ```
        
     - To take the highest output for each class, it can be done in a similar way as in practice 2 the accuracy was calculated in the case of having more than 2 classes and the patterns arranged in a row. The code could be similar to the following:
     
     ```Julia
     vmax = maximum(outputs, dims=2);
     outputs = (outputs .== vmax);
     ```
     In this way, a matrix of Boolean outputs is generated with the class to which each pattern belongs, which can be used to compare with the target matrix to calculate the different performance metrics.
     
     ### Question
     
     The last piece of code may present problems in case several models generate the same output. Where would the problem be, and how would it could be solved?
     
     `Answer:` since the maximum would correspond to more than one value of the row, then after checking `outputs = (outputs .== vmax)` multiple columns may have a 1 in the same row (multiple classes detected). It could be solved by leaving only the first 1 and setting the others as 0 in case that this draws happen in a row.

In [1]:
# We assume this is some kind of pseudo-code, since the fit! functions belongs to 
# the Unit6, ScikitLearn library

function oneVSall(inputs::AbstractArray{<:Real,2}, targets::AbstractArray{Bool,2}, model)
    numInstances, numClasses = size(inputs)    
    outputs = Array{Float32,2}(undef, numInstances, numClasses)
    
    for numClass in 1:numClasses
        model_trained = deepcopy(model)  # train each binary model from scratch
        model_trained = fit!(model_trained, inputs, targets[:,[numClass]]);
        outputs[:,numClass] .= model_trained(inputs);
    end
    
    #vmax = maximum(outputs, dims=2);
    #outputs = (outputs .== vmax);
    
    # Set to 1 only the first maximum element found in each row
    maxIdx = findmax(outputs, dims=2)[2]
    outputs .= Bool(0)
    outputs[maxIdx] .= 1
    
    return outputs
end;

2. Develop a function called `confusionMatrix` (same name as in the previous assignment) that returns the values of the metrics adapted to the condition of having more than two classes. To do so, include an additional parameter that allows to calculate them in the *macro* and *weighted* forms.

    This function should receive two matrices: model outputs (`outputs`) and desired outputs (`targets`), both of Boolean elements and dimension 2, with each pattern in a row and each class in a column. The first thing this function should do is to check that the number of columns of both matrices is equal and is different from 2. In case they have only one column, these columns are taken as vectors and the confusionMatrix function developed in the previous assignment is called.
    
    ### Question
    
    Why are two-column matrices invalid?
    
    `Answer:` because it means that there are two classes, so it can be coded as a 1/0 with just one column. For more classes, at least a three-column matrix is needed, with one column per class.
    
    If both matrices have more than 2 columns, the following steps can be followed:
    
    - Reserve memory for the sensitivity, specificity, PPV, NPV and F-score vectors, with one value per class, initially equal to 0. To do this, the `zeros` function can be used.
    
    - Iterate for each class, and, if there are patterns in that class, make a call to the `confusionMatrix` function of the previous assignment passing as vectors the columns corresponding to the class of that iteration of the outputs and targets matrices. Assign the result to the corresponding element of the sensitivity, specificity, PPV, NPV and F1 vectors.
    - Reserve memory for the confusion matrix.
    - Perform a double loop in which booth loops iterate over the classes, to fill all the confusion matrix elements.
    - Aggregate the values of sensitivity, specificity, PPV, NPV, and F-score for eachclass into a single value according to the *macro* or *weighed* strategy, as specified in the input argument.
    - Finally, calculate the accuracy value with the `accuracy` function developed in a previous assignment, and calculate the error rate from this value.

In [2]:
include("utils.jl")

function confusionMatrix(outputs::AbstractArray{Bool,2}, targets::AbstractArray{Bool,2}; weighted::Bool=true)
    
    # Check that the number of classes is correct and both outputs and targets have the same number of samples
    numClasses = size(outputs, 2)
    @assert (numClasses == size(targets, 2)) "outputs and targets have different number of classes"
    @assert (numClasses != 2) "Cannot exist outputs or targets with 2 columns"
    
    if (numClasses == 1)
        return confusionMatrix(outputs[:,1], targets[:,1])
    end
    
    # Calculate metrics
    sensitivity = zeros(numClasses)
    specificity = zeros(numClasses)
    ppv = zeros(numClasses)
    npv = zeros(numClasses)
    f1Score = zeros(numClasses)
    
    numInstances = sum(targets, dims=1)
    for numClass in 1:numClasses
        if (numInstances[numClasses] == 0)
            continue
        end
        _, _, sensitivity[numClass], specificity[numClass], ppv[numClass], npv[numClass], f1Score[numClass], _ = confusionMatrix(outputs[:,numClass], targets[:,numClass])
    end
    
    # Fill confusion matrix
    matrix = zeros(numClasses, numClasses)
    for numClassTarget in 1:numClasses, numClassOutput in 1:numClasses
        matrix[numClassTarget, numClassOutput] = sum(targets[:, numClassTarget] .* outputs[:, numClassOutput])
    end
    
    # Aggregate metrics according to the strategy specified
    if (weighted)
        weightClasses = vec(numInstances ./ sum(numInstances))
        sensitivity = sum(sensitivity .* weightClasses)
        specificity = sum(specificity .* weightClasses)
        ppv = sum(ppv .* weightClasses)
        npv = sum(npv .* weightClasses)
        f1Score = sum(f1Score .* weightClasses)
    else
        sensitivity = mean(sensitivity)
        specificity = mean(specificity)
        ppv = mean(ppv)
        npv = mean(npv)
        f1Score = mean(f1Score)
    end
    
    acc = accuracy(outputs, targets)
    errorRate = 1 - acc
    
    return acc, errorRate, sensitivity, specificity, ppv, npv, f1Score, matrix
end;

3. Develop another function called `confusionMatrix` in which the first parameter `outputs` is of type `AbstractArray{<:Real,2}`, and `targets` is of type `AbstractArray{Bool,2}` (the same as before). What this function should do is to convert the first parameter to an array of boolean values (using the function `classifyOutputs`) and call the previous function.

In [3]:
function confusionMatrix(outputs::AbstractArray{<:Real,2}, targets::AbstractArray{Bool,2}; weighted::Bool=true)
    return confusionMatrix(classifyOutputs(outputs), targets; weighted=weighted)
end;

4. Override this function once again by developing another function of the same name that performs the same task, but this time taking as inputs two vectors (`targets` and `outputs`) of the same length, whose elements are of any type (i.e., they are of type `AbstractArray{<:Any}`), plus the additional parameter that allows to aggregate the metrics through the *macro* and *weighted* strategies. The elements of these vectors represent the classes, represented in different ways. For example, classes can be ["dog", "cat", 3].

    Obviously, it is necessary that all the output classes (vector `outputs`) are included in the desired output classes (vector `targets`). Include, therefore, a defensive programming line to ensure this.
    
      - Write this line without any loop. To do this, it may be useful to refer to the functions `all`, `in` and `unique`. At the end of this assignment, the solution of how to do this line is given.
        
      - As you will see in the following assignment, this line should not really be there, since it is possible that some produced output is not among among the desired outputs. This line is another example of a small exercise to practice vector programming, but once the practice is done it should be temporarily removed. The following assignment excludes the possibility of this happening by splitting the patterns in an stratified way, so this line can be added again.
        
        ### Question
        
        How is it possible that an output is not among the desired outputs? In which cases can this occur?
        
        `Answer:` This could happen in the validation or test of the network. If a non-representative set of desired outputs is obtained, as for example could happen when using cross-validation, the model could misclassify a pattern and its class is not found in the non-representative partition used to validate the model performance. In summary, this could happen when the validation or test subset is non-representative and lacks some of the classes of the domain.
        
    To develop this function, it is necessary to first take the possible classes of both `outputs` and `targets` by means of the `unique` function. Once this is done, both matrices, `outputs` and `targets`, will be encoded through the function `oneHotEncoding` passing as argument this vector of classes just calculated. With the result of these two encodings, the `confusionMatrix` function can be called.
    
    ### Question
    
    It is important that the class vector is calculated first and passed in both calls. What could happen if this is not done in this way?
    
    `Answer:` the order of the classes in the vectors could change, so the columns after the one-hot-encoding would belong to different classes in the outputs and in the targets.

In [4]:
function confusionMatrix(outputs::AbstractArray{<:Any,1}, targets::AbstractArray{<:Any,1}; weighted::Bool=true)
    @assert (all([in(output, unique(targets)) for output in outputs])) "targets does not contain all the classes in outputs"
    
    classes = unique(targets);
    return confusionMatrix(oneHotEncoding(outputs, classes), oneHotEncoding(targets, classes); weighted=weighted);
end;

### Learn Julia

The defensive programming line to ensure that all classes of the `output` vector are included in the desired output vector is as follows:

```Julia
@assert(all([in(output, unique(targets)) for output in outputs]))
```