# Winetrees

This notebook works with the wine-dataset and hopes to show you the benefits of k-fold and gridsearch.

Let's begin by importing the data

In [1]:
from sklearn.datasets import load_wine
X, y = load_wine(return_X_y=True)

In [4]:
import pandas as pd

# Convert X to DataFrame
df_wine = pd.DataFrame(X)

# Add y as a new column named "type"
df_wine['type'] = y

df_wine.head()

df_wine.to_csv('wine.csv', index=False)

## Data exploration

Get the info and the size of the wine-dataset.

In [None]:
#DELETE
print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"Number of classes: {len(set(y))}")

Now some data-exploring. Create three different graphs about your dataset. Let's say:

- A histogram of the first feature
- All boxplots
- Correlation matrix

In [None]:
#DELETE
import seaborn as sns
import numpy as np

import matplotlib.pyplot as plt

# Histogram of the first feature
plt.figure(figsize=(8, 5))
plt.hist(X[:, 0], bins=20, color='skyblue', edgecolor='black')
plt.title('Histogram of the First Feature')
plt.xlabel('Feature Value')
plt.ylabel('Frequency')
plt.show()

# Boxplot of all features
for i in range(X.shape[1]):
    plt.figure(figsize=(8, 2))
    plt.boxplot(X[:, i], vert=False, patch_artist=True, boxprops=dict(facecolor='lightblue'))
    plt.title(f'Boxplot of Feature {i + 1}')
    plt.xlabel('Feature Value')
    plt.ylabel('Feature Index')
    plt.show()

# Heatmap of the correlation matrix
correlation_matrix = np.corrcoef(X, rowvar=False)
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', cbar=True)
plt.title('Heatmap of Feature Correlation')
plt.show()

## Model creation

Normally we'd now split the dataset in a train and test set, but the dataset is very small. Make a decision tree of max 3 branches deep, using the random state of 42. Also setup KFold with 5 splits and shuffle.

In [None]:
#DELETE
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score

model = DecisionTreeClassifier(max_depth=5, random_state=42)
kf = KFold(n_splits=8, shuffle=True, random_state=42)

Now go over all the splits in the KFold-object you just made and store the accuracies in a list. Print the list and the mean of the list.

In [None]:
#DELETE
scores = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Fit the model
    model.fit(X_train, y_train)

    # Predict and score
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    scores.append(acc)

# Output results
print("Fold accuracies:", scores)
print("Average accuracy:", np.mean(scores))

Now play around with the tree depth and number of folds a bit. The idea is to get the highest possible average accuracy.

But wait, you may think, why not take the fold with the highest accuracy? Because this every fold is only trained on part of the dataset. If you were to take that particular model you'd be overfitting on that particular fold.

That's why we are playing around with the numbers. What we want to achieve is **hyper parameter tuning**, and the hyper parameter is the depth of the tree. Too high and we're overfitting, not high enough and we'll have a bad fit. By trying out all the numbers by hand we can see where the number is best.

And the k-fold helps us doing this without splitting the data in three parts (train, test, validation), because we don't have enough data for that (only 176 rows).

Another thing nagging in your mind right now: Why am I changing numbers manually and test them? Can't we automate that? And yes, we can.

## Gridsearch

Gridsearch is a way to automate hyper parameter tuning. We can input a number of parameters in which we want to test a model.

We'll be using it to tune another Decision tree parameter: min_samples_split. This parameter means "Don't split a node unless it has at least this many samples.". If we allow the model to keep on splitting nodes we'll end up with a lot of leaves in our tree, which means we're probably overfitting. Keeping it to big will lead to underfitting.

We'll try max_depths of 2, 3, 4, 5, 6 and min_samples_splits of 2, 4, 6, 10.


In [None]:
#DELETE
from sklearn.model_selection import GridSearchCV

model = DecisionTreeClassifier(random_state=42)

param_grid = {
    'max_depth': [2, 3, 4, 5, 6],
    'min_samples_split': [2, 4, 6, 10]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)

# Fit to the full dataset (CV will split internally)
grid_search.fit(X, y)

# Output
print("Best parameters:", grid_search.best_params_)
print("Best cross-validated accuracy:", grid_search.best_score_)

What happened now is that the grid_search made a grid 20 (=5x4) squares, combining all max_depths and all min_samples_splits. From this grid it deduced that a max_depth of 4 and a min_samples_split of 6 will work best.

Try to show the full grid as well.

In [None]:
#DELETE
for mean, std, params in zip(
    grid_search.cv_results_['mean_test_score'],
    grid_search.cv_results_['std_test_score'],
    grid_search.cv_results_['params']
):
    print(f"{params} => Accuracy: {mean:.3f} (+/- {std:.3f})")

In a heatmap even?

In [None]:
#DELETE

import pandas as pd

results = pd.DataFrame(grid_search.cv_results_)

heatmap_data = results.pivot(
    index="param_max_depth",
    columns="param_min_samples_split",
    values="mean_test_score"
)

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(heatmap_data, annot=True, fmt=".3f", cmap="viridis")
plt.title("GridSearchCV Mean Accuracy")
plt.ylabel("max_depth")
plt.xlabel("min_samples_split")
plt.tight_layout()
plt.show()