# 4. Understanding Classification Problems

In previous workshops, we focused on regression problems, learning how to predict continuous variables using methods like Random Forest and Neural Networks. Today, we will work on a different type of problem: **classification**. Specifically, we will use machine learning to predict a **sediment categorical characteristic**, based on its **location** and some **physical characteristics**.

Our dataset comes from the Geological Survey of the Netherlands and contains descriptions of sediments from the North Sea. Today, we will use a small, pre-processed subset of the dataset, but you can download the full dataset (and many other geological datasets!) at  [DINOloket](https://www.dinoloket.nl/en/subsurface-data). 


![DINOloket](images/5_DINOloket.png)

## 4.1 Problem Definition

In this workshop, we will use a dataset containing sample descriptions of sediments from the North Sea. When a sample is collected, the Geological Survey of the Netherlands (GDN as denoted in Dutch) follows a standard method to describe the sediment. Using this "Standard Drill Description Method" ([Standaard Boor Beschrijvingsmethode](https://www.grondwatertools.nl/sites/default/files/GDN_SBB-NITG-00-141-A-(3)_20161111.pdf)) the GDN aims to systematically capture multiple characteristics of the collected samples. This method does not only apply to marine sediments, but to any sample that is described by the GDN. Of course, some characteristics only apply to certain types of samples. 

While some of these descriptions can be made quickly, others require laboratory analysis, which is more time-consuming and resource-intensive. Today, we will try to predict one of the time-consuming measurements (i.e. **Medium sand size category**) based on **location** and some easy-to-describe **sediment properties**.

The **Medium sand size category** corresponds to **7** different categories in our dataset based on the size sand size of the sample. This measurement only applies to samples described as sand and those that have a representative portion of sand admixture. 

| Class            | Sand Median (µm)     | Code  |
|-------------------|----------------------|-------|
| Extremely fine    | 63 ≤ x < 105           | ZUF   |
| Very fine         | 105 ≤ x < 150          | ZZF   |
| Moderately fine   | 150  ≤ x < 210          | ZMF   |
| Moderately coarse | 210 ≤ x < 300          | ZMG   |
| Very coarse       | 300 ≤ x < 420          | ZZG   |
| Extremely coarse  | 420  ≤ x< 2000         | ZUG   |

**Other categories (ABM = NEN209 and ONB)**:

- Coarse category: 210 - < 2000 µm (ZGC)


Below are the predictor variables and the target variable for this exercise. Note that the sediment properties (e.g., color, calcareous portion) are also classified according to the categories in the 'Standard Drill Description Method'. If you want more details about these features, refer to the [document](https://www.grondwatertools.nl/sites/default/files/GDN_SBB-NITG-00-141-A-(3)_20161111.pdf) (information in Dutch).


| Feature Name (English)       | Feature Name (Dutch)              | Explanation                                | Reference (Page) |
|-------------------------------|------------------------------------|--------------------------------------------|------------------|
| Sample ID                    | NITG.nr                           | Sample ID                                 |                  |
| X coordinate                 | X.coordinaat                      | X coordinate (CRS:28892)                  |                  |
| Y coordinate                 | Y.coordinaat                      | Y coordinate (CRS:28892)                  |                  |
| Height with respect to NAP   | Maaiveldhoogte..m.tov.NAP         | Z coordinate (depth)                      |                  |
| Color                        | Kleur                             | Color based [SBB format L4]               | 47               |
| Calcareous portion           | Kalkgehalte                       | Calcareous content [SBB format L14]       | 75               |
| Main soil type               | Hoofdgrondsoort                   | Main soil type based [SBB format L3.1]    | 35               |
| Organic portion              | Organische Stof                   | Organic portion [SBB format: L9]          | 65               |
| Sand median class            | Zandmediaanklasse                 | Sand median [SBB format: L7.2.1]          | 52               |


## 5.2 Predicting probabilities

Bridging the gap between regression and classification task might sound difficult, but it can be achieved with a few simple steps:

1. **From class to number**: To be able to use our regression models to predict classes, we need to convert those to numbers. If the classification task is binary (only two possibilities) then this is as simple as using 1 and 0. For multi-class problems a single number is not enough, so we assign each class to a vector with the same size as the number of classes. These vectors are filled with 0s except in one position correspondig to the respective class, which is filled by a 1. This is why this approach is known as *one-hot encoding*.

> **Attention**: You might be wondering why not to predict a single number and simply assign the additional classes to another value, for example to 2. This is actually a very bad idea as it would assume that your classes are ordered and would punish errors unevenly during training.

2. **From number to probability**: In the previous step we converted our class to a number, but only to 1s or 0s, but our regressor models can only predict continuous numbers. The trick here, is that instead of directly predicting the class, we predict the probability of that particular class. To convert it to our binary outputs we set a *probability threshold*, usually 0.5, for deciding between the two. For multiple classes, we can simply take the one with highest probabilty.

3. **From probability to regression**: The final step is how to make our regressor model only predict values (or vectors of values) between one and zero. For binary problems, this can be easily solved by applying the *sigmoid* function to the output of the regression model. For multi-class problems, there is another function called *softmax* that can be applied to our predicted vector to ensure that their components sum up to one, as we would expect from a set of probabilities.

Let's see this concept in practice by training the simplest classifier available, the logistic regression. In this case, the regressor used for predicting the probabilities is a simple multi-linear regression. To get started, let's see how our output actually looks like, and transform it to a numerical value as we have explained.

In [None]:
# Transform the output to a binary number


Perfect, now that we have our outputs as numbers, we can train the logistic regression model on our binary data. Once the training is complete, we can check the linear coefficients of such model.

In [None]:
# Define the logistic regression model and train it with our data
log_reg_binary = 
log_reg_binary.fit(X_train_binary, y_train_binary.ravel())

# Print the coefficients of the linear regression model
print()

This already gives us very important information as to which variables are positively correlated with #OUTCOME# and which the opposite according to the sign of their coefficients.

Now let's try to calculate what the prediction should be for the first point in the test set by following the steps that we introduced before - but now the other way around!

In [None]:
# Define the predictors for the value we want to predict
X = X_test_binary[0]

# Calculate the output of the regression model
regression_output = 
print()

# Calculate the probability by using the sigmoid function
probability = 
print()

# Calculate the final class according to a predefined probability treshold
probability_threshold = 0.5
predicted_class = int(probability > probability_threshold)
print()

# Check that the logistic regression returns the same prediction
log_reg_class = log_reg_binary.predict(X)
print()

Great! Now that we understood how the prediction was made, let's use it on all the test data and test the effect of changing the probability threshold.

In [None]:
# Plot a map of the test data classes


# Plot a map of the predicted classes with a threshold of 0.25, 0.5 and 0.75


As you might have expected, as we increase the probability threshold we predict the class we assigned to 0 more often. Essentially, we require the model to be increasingly confident that the data corresponds to class 1 to classify it as such.

It is best to tune the probability threshold to ensure that we don't overpredict one of the variables, especially if the data is imbalanced. Keep tuned for a future workshop on imbalanced data in particular if you are interested in the topic!

## 5.3 Evaluating the models

### The confusion matrix

The best way to visualize the results of a classifier model is through a "confusion matrix". This is nothing more than a table which columns indicate the classes predicted by the ML model and which rows are the actual classes from the data. This is what it looks like for a binary problem:

| Class           | Predicted Positive         | Predicted Negative         |
|------------------------|----------------------------|----------------------------|
| **Actual Positive**        | **TP** (True Positive)     | **FN** (False Negative)    |
| **Actual Negative**        | **FP** (False Positive)    | **TN** (True Negative)     |

Although the two binary classes are usually referred to as "positive" and "negative", it can be any two type of classes, so negative does not necessarily imply "bad". In our case, for example, the classes refer to #INSERT BINARY CLASS DESCRIPTION#. Let's generate a confusion matrix with the predictions from our logistic model we trained before. This can be easily computed with the ```confusion_matrix``` function in ```sklearn.metrics```.

In [None]:
from sklearn.metrics import confusion_matrix

# Make predictions with the logistic regression model for the test set
y_pred_binary = log_reg_binary.predict(X_test_binary)


# Create the confusion matrix with the logistic regression model predictions
confusion_matrix(y_test_binary, y_pred_binary)

While simply displaying the confusion matrix is already quite informative of model performance, sometimes we want to test on specific metrics. Many of the common metrics used to evaluate the performance of classifier models can actually be computed from the confusion matrix directly. Here are some examples:

* **Accuracy**: Probably the best known classification metric, it evaluates the percentage of samples that were classified into the correct class.

$$Accuracy = \frac{TP+TN}{TP+FP+TN+FN}$$

* **Precision**: A high precision indicates there were not many false alarms of a specific (in binary, the positive) class.

$$Precision = \frac{TP}{TP+FP}$$

* **Recall**: A high recall indicates that there were not many missed cases of a specific (in binary, the positive) class. 

$$Recall = \frac{TP}{TP+FN}$$

First, try to compute those metrics with pen and paper from the confusion matrix that we just saw. When you are done, you can compare your results with those obtained by using the corresponding Scikit-learn functions.

In [None]:
# Compute the accuracy of the logistic regression model
accuracy = sklearn.metrics.accuracy_score(y_test_binary, y_pred_binary)
print(f"The accuracy of the logistic regression model is: {accuracy}")

# Compute the precision of the logistic regression model
precision = sklearn.metrics.precision_score(y_test_binary, y_pred_binary)
print(f"The precision of the logistic regression model is: {precision}")

# Compute the recall of the logistic regression model
recall = sklearn.metrics.recall_score(y_test_binary, y_pred_binary)
print(f"The recall of the logistic regression model is: {recall}")


### A more balanced metric

While accuracy is the most straightforward way to determine the performance of a classification model, it might not always be the most suitable. This is especially true when the data that is being predicted is imbalanced, that is, when we have many more instances of one class than the rest. Then, it is common that the classifier learns to predict in favour of the majority class, performing really poorly in the rest. For those cases, there is a better metric that we can use that combines both precision and recall to obtain a more balanced metric for performance, the *F1-score*.

$$F1\ score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$$

Let's compute both accuracy and F1-score for our current data, then introduce a class imbalance artificially and compare them again. Which of the two do you feel better relates to model performance, from what you can see in the confusion matrix?

In [None]:
# Calculate the F1 score of the logistic regression model
f1_score = sklearn.metrics.f1_score(y_test_binary, y_pred_binary)

# Print both accuracy and F1 score
print(f"Accuracy: {accuracy} - F1 score: {f1_score}")

In [None]:
# Re-assign classes to artificially unbalance the dataset

# Train a new logistic regression model with the unbalanced dataset

# Make predictions with the logistic regression model for the test set

# Create the confusion matrix with the logistic regression model predictions

In [None]:
# Compute the accuracy and F1 score of the logistic regression model

## 5.4 Multi-class tasks

### Predicting multiple classes

While binary problems are relatively common, many times we would like to predict multiple classes. This is the case, for example, if we want to determine the sand grain size category from our soil data. In this case we also want to use a more powerful algorithm, so we will compare the results of artificial neural networks (ANN) with those of a random forest (RF). Let's start by using *one-hot encoding* to convert our classes to numerical vectors and then train our models. The random forest model can work fine with classes, so we do not need to encode our outputs for it.

In [None]:
# Use one-hot encoding on the sand size data and print the results

 

In [None]:
# Train the ANN and RF models



Now that the models are trained, we can show the predictions on a map.

In [None]:
# Plot a map of the test data classes


# Plot a map of the predicted classes for both ANN and RF


### Generalizing the metrics

To evaluate the models we can use the same tools that we did for the binary problem with some slight differences. Let's see how the confusion matrix looks for this task. 

In [None]:
# Show the confusion matrix


As we keep adding classes the confusion matrix gets more and more cluttered, increasing the usefulness of using metrics. All the metrix that we have seen previously can be generalized to multi-class problems, with some slight differences. For the accuracy, for example, we will simply need to sum all elements in the diagonal and divide by the total data sample size.

In [None]:
# Compute the accuracy for each of the models


Recall and precision are now defined for each class. In our example, we might be especially concerned about correctly classifying extremely fine-grained sands, since those can easily infiltrate in machinery that might be installed on these sites reducing their useful lifespan greatly. Which of the two variables should we then compare?

In [None]:
# Compute the recall of both models for the extremely fine grain size


## 4.4 Final remarks

In this workshop you have learnt the basics of how to tackle a classification task, from the output definition to training and of course evaluating your model. To cement this knowledge, try and use the same data but choose another variable as your target, for example the color. You can re-use some of the code above. Good luck!

In [None]:
# Define the predictor and target variables

# Split between train and test sets

# Train the classification model of your choice

# Evaluate the model performance on the test set
