# Multiclass classification on Wine

The popularity of wine has been decreasing recently, and maybe unjustly so. It may come across as a snobbish version of beer but is in fact a drink for the masses (or was that in the middle ages). Let's reminiscence about this topic for a while. Perhaps even enjoying a glass, or a bottle?

You might need it after too much data and code flung at your head in high speed.

## Load the dataset

Once again this dataset can easily be downloaded from sklearn.

In [None]:
from sklearn.datasets import load_wine
import pandas as pd

wine = load_wine(as_frame=True)
df = pd.concat([wine.data, wine.target.rename("target")], axis=1)

## Explore

Let's do some graphs to explore the dataset first.

Show the class distribution. This means counting the amount of different values in the column 'target'. This is most easily done by using a [seaborn countplot](https://seaborn.pydata.org/generated/seaborn.countplot.html).

In [None]:
# Up to you!



Next up is a very nice plot: try the [seaborn pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html) for "alcohol", "malic_acid", "color_intensity" and "hue". Set the "hue" of the graph to our 'target' column.

In [None]:
# Up to you!



You'll get a 4x4 matrix. In the center diagonal you'll see the distibutions of the data in a certain wine-type. Ideally speaking these are three completely separate normal distributions, but that isn't always the case. The further the peaks are apart (and the less the monuntains overlap), the better this feature will be at distinguishing (and predicting) our wine type.

At the intersection of two types of characterstics we see a scatter plot. These dot's are also colored in the wine type. The same goes here: the further the colors are apart, the better this combination of data will be to predict!

Don't worry if there's no real separation though. We are looking at 2D-data, but the computer can use many more dimensions, even if we can't imagine them anymore (or draw them on nice graphs).

Datavisualisation is a lot about simplifying the data - showing the underlying truth without the noise. And one of the best plots for simplifying is the boxplot.

For "alcohol", "malic_acid", "color_intensity" and "proline", draw the boxplots for the three different types of wine.

In [None]:
# Up to you!



Some skewing, some outliers, but nothing too exciting.

A very interesting graph is the correlation matrix. It shows in what measure the different feature move together (e.g. one goes up, to other goes up to) (or also: one goes up, the other goes down). It's mainly interesting once we get started on feature selection, because a bunch of closely related features is dangerous for the model.

Draw a seaborn-heatmap about all data except for the target column.

In [None]:
# Up to you!



Some strong correlations (flavanoids and od280/od315_of...), also negative (hue and malic_acid). We left out the target column because it is categorical, where a correlation assumes numeric, continuous variables. Applying it directly to a categorical target can be:

- Misleading (especially if the classes are just codes without numerical meaning).
- Statistically inappropriate unless you encode the target in a meaningful way.

At this point we're just exploring the data, meaning want to understand how features relate to each other, independently of the target. This helps detect:

- Redundancies or high multicollinearity (e.g. if two features are strongly correlated, one might be dropped).
- Underlying structure or clusters in the data.


## Predicting

Let's try to predict the wine type by using a decision tree. (We'll go over decision trees in the next chapter. Now we just need some results to assess if this decision tree is any good at predicting wine type.) We've already loaded the data, but concatted it into one dataframe. Now we need to split it again. We'll also be needing a train/test-split.

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    confusion_matrix, classification_report
)

import matplotlib.pyplot as plt

# Load the dataset
# wine = load_wine() -> already loaded in first cell
X = wine.data
y = wine.target
feature_names = wine.feature_names
class_names = wine.target_names

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Train a decision tree classifier using a max depth of 4 and the normal random state (42). Predict and print a confusion matrix and a classification report!

In [None]:
# Up to you!



Looking good. Our precision is fine and the confusion matrix shows no real problems (only 2 false predictions).

We can visualize the tree itself using "plot_tree".

In [None]:
# Up to you!



When choosing a model there's a tradeoff between understandability and quality. A neural network is normally a very good model, but it can't be explained by humans. A decision tree is generally speaking not the best of models (it is in our case, but that's because the dataset is not too hard) but you could explain it to a 10-year old.

Funniest thing: split the data again using a random state of 24, retrain the model and show the decision tree again.

Note: make sure to use another name for your X and Y variables as we'll be getting back to the original model later on.

In [None]:
# Up to you!



You get a subtly different tree. How the model trains depends on the train/test split, which makes sense: we're giving the model different training data.

## Data mutilation

When we look back at the first graph we made we see that class 2 is slightly underrepresented. Find out how many samples we have of every class.

In [None]:
# Up to you!



We have 48 samples, which is 26% of the data. If we had only 4 samples in class 2 that would only be 3% of the data. The following code randomly removes rows from the dataset to make sure only 4 rows of class 2 remain.

(We're introducing a class imbalance here.)

In [None]:
# Filter rows where target is 2
class_2_rows = df[df['target'] == 2]

# Randomly sample 4 rows from class 2
class_2_sample = class_2_rows.sample(n=4, random_state=42)

# Keep all rows of class 0 and class 1
class_0_and_1_rows = df[df['target'] != 2]

# Combine the sampled class 2 rows with class 0 and class 1 rows
df_unbalanced = pd.concat([class_0_and_1_rows, class_2_sample], axis=0)

# Shuffle the resulting dataframe
df_balanced = df_unbalanced.sample(frac=1, random_state=42).reset_index(drop=True)

print(df_unbalanced['target'].value_counts())

X_ub = df_balanced.drop("target", axis=1)
y_ub = df_balanced["target"]

Now build another decision tree based on this imbalanced dataset. Use a max_depth of 2 (otherwise the data is too easy for the model).

In [None]:
# Up to you!



Note some errors occur because we have a class in which no predictions have been made. This was to be expected: a bad (undeep) model with a only little bit of data... In a final notebook we'd look into it and fix it.

## Comparing micro and macro average

First, print out the confusion matrices for the first and the unbalanced model.

In [None]:
# Up to you!



And just because we can (and because coding is nice), calculate the macro and micro-averages by hand.

In [None]:
# Up to you!



If all went well you should see that micro-sensitivity stayed high, but macro went down because we have class imbalance. We had the same in micro but choose not to worry as it only concerned a couple of rows.