## 5. Understanding Classification Problems

In previous workshops, we focused on regression problems, learning how to predict continuous variables using methods like Random Forest and Neural Networks. Today, we will work on a different type of problem: **classification**. Specifically, we will use machine learning to predict a **sediment categorical characteristic**, based on its **location** and some **physical characteristics**.

Our dataset comes from the Geological Survey of the Netherlands and contains descriptions of sediments from the North Sea. Today, we will use a small, pre-processed subset of the dataset, but you can download the full dataset (and many other geological datasets!) at  [DINOloket](https://www.dinoloket.nl/en/subsurface-data). 


![DINOloket](images/5_DINOloket.png)

## 5.1 Problem Definition

In this workshop, we will use a dataset containing sample descriptions of sediments from the North Sea. When a sample is collected, the Geological Survey of the Netherlands (GDN as denoted in Dutch) follows a standard method to describe the sediment. Using this "Standard Drill Description Method" ([Standaard Boor Beschrijvingsmethode](https://www.grondwatertools.nl/sites/default/files/GDN_SBB-NITG-00-141-A-(3)_20161111.pdf)) the GDN aims to systematically capture multiple characteristics of the collected samples. This method does not only apply to marine sediments, but to any sample that is described by the GDN. Of course, some characteristics only apply to certain types of samples. 

While some of these descriptions can be made quickly, others require laboratory analysis, which is more time-consuming and resource-intensive. Today, we will try to predict one of the time-consuming measurements (i.e. **Medium sand size category**) based on **location** and some easy-to-describe **sediment properties**.

The **Medium sand size category** corresponds to **7** different categories in our dataset based on the size sand size of the sample. This measurement only applies to samples described as sand and those that have a representative portion of sand admixture. 

| Class            | Sand Median (µm)     | Code  |
|-------------------|----------------------|-------|
| Extremely fine    | 63 ≤ x < 105           | ZUF   |
| Very fine         | 105 ≤ x < 150          | ZZF   |
| Moderately fine   | 150  ≤ x < 210          | ZMF   |
| Moderately coarse | 210 ≤ x < 300          | ZMG   |
| Very coarse       | 300 ≤ x < 420          | ZZG   |
| Extremely coarse  | 420  ≤ x< 2000         | ZUG   |

**Other categories (ABM = NEN209 and ONB)**:

- Coarse category: 210 - < 2000 µm (ZGC)


Below are the predictor variables and the target variable for this exercise. Note that the sediment properties (e.g., color, calcareous portion) are also classified according to the categories in the 'Standard Drill Description Method'. If you want more details about these features, refer to the [document](https://www.grondwatertools.nl/sites/default/files/GDN_SBB-NITG-00-141-A-(3)_20161111.pdf) (information in Dutch).


| Feature Name (English)       | Feature Name (Dutch)              | Explanation                                | Reference (Page) |
|-------------------------------|------------------------------------|--------------------------------------------|------------------|
| Sample ID                    | NITG.nr                           | Sample ID                                 |                  |
| X coordinate                 | X.coordinaat                      | X coordinate (CRS:28892)                  |                  |
| Y coordinate                 | Y.coordinaat                      | Y coordinate (CRS:28892)                  |                  |
| Height with respect to NAP   | Maaiveldhoogte..m.tov.NAP         | Z coordinate (depth)                      |                  |
| Color                        | Kleur                             | Color based [SBB format L4]               | 47               |
| Calcareous portion           | Kalkgehalte                       | Calcareous content [SBB format L14]       | 75               |
| Main soil type               | Hoofdgrondsoort                   | Main soil type based [SBB format L3.1]    | 35               |
| Organic portion              | Organische Stof                   | Organic portion [SBB format: L9]          | 65               |
| Sand median class            | Zandmediaanklasse                 | Sand median [SBB format: L7.2.1]          | 52               |
