# The wine dataset

For the second exercise we'll be revisiting the wine dataset.

## Data import

First import and explore.

In [None]:
from sklearn.datasets import load_wine
import pandas as pd

X, y = load_wine(return_X_y=True)

df = pd.DataFrame(X)
df['target'] = y
df.head()

The boxplot is a great way of showing distributions. Show the boxplots for columns 1 through 12 in one graph.

In [None]:
#DELETE
df.drop('target', axis=1).boxplot(figsize=(15, 6))

You'll see one normal boxplot, column 12, and 12 (0-11) very small ones.

## 2D clustering

Let's create a scatterplot of 2 vs 12. Use 'plt.axis('equal')' to make sure both axis have the same scaling. (Also comment the line away to compare.)

In [None]:
#DELETE
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(df[2], df[12], c=df['target'], cmap='viridis', edgecolor='k')
plt.xlabel('Column 2')
plt.ylabel('Column 12')
plt.title('Scatterplot of Column 2 vs Column 12')
plt.colorbar(label='Target')
plt.axis('equal')  # Make x and y axes the same scale
plt.show()

Train a clustering model on this, unaltered data. Create three clusters.

In [None]:
#DELETE
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(df.drop('target', axis=1))


And now compare the predicted clusters and the actual groups in a crosstable.

In [None]:
#DELETE
df['kmeans_labels'] = labels

pd.crosstab(df['target'], df['kmeans_labels'], rownames=['True'], colnames=['Predicted'])

## Scaling

The solution is scaling. Apply this to the wine dataset. Copy the results in a new dataframe where you also drop the "kmeans_labels" you got before.

In [None]:
#DELETE
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = df.copy()
df_scaled.loc[:, 0:12] = scaler.fit_transform(df.loc[:, 0:12])
df_scaled.drop('kmeans_labels', axis=1, inplace=True)
df_scaled.head()

Now recreate the scatter plot from before. Does it still matter if you ask the model to equally scale the axis?

In [None]:
#DELETE
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(df_scaled[2], df_scaled[12], c=df_scaled['target'], cmap='viridis', edgecolor='k')
plt.xlabel('Column 2')
plt.ylabel('Column 12')
plt.title('Scatterplot of Column 2 vs Column 12')
plt.colorbar(label='Target')
plt.axis('equal')  # Make x and y axes the same scale
plt.show()

In [None]:
#DELETE
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(df_scaled.drop('target', axis=1))


Also create a crosstable to compare.

In [None]:
df_scaled['kmeans_labels'] = labels

pd.crosstab(df_scaled['target'], df_scaled['kmeans_labels'], rownames=['True'], colnames=['Predicted'])

That's more like it.

## Inference

Using the following row, we'll randomly select three rows from the original dataset.

In [None]:
sampled_df = df.groupby('target').apply(lambda x: x.sample(1, random_state=42)).reset_index(drop=True)
sampled_df

We won't be using the first model, as it's bad, so let's see where these end up using the second model?

In [None]:
#DELETE
# Predict cluster labels using the fitted kmeans model
predicted_labels = kmeans.predict(sampled_df.drop('target', axis=1).drop('kmeans_labels', axis=1))
predicted_labels

It appears they are all of the second group. But are they?

They are not. The data we are using for inference is unscaled, where the data we used to train the model was scaled. The problem is if we simply reapply the scaler to this data, it uses different parameters. Let's try it anyway:

In [None]:
#DELETE
from sklearn.preprocessing import StandardScaler

new_scaler = StandardScaler()
df_scaled_sample = sampled_df.copy()
df_scaled_sample.loc[:, 0:12] = new_scaler.fit_transform(sampled_df.loc[:, 0:12])
df_scaled_sample.drop('kmeans_labels', axis=1, inplace=True)
df_scaled_sample.head()

If we compare this data, we get the following table:

| Dataframe | Old value | New value |
| -----|-----|-----|
| Original | 1065.0 | 1.013009 |
| Original | 735.0 | -0.037874	|
| Sample | 1065.0 | 1.246235 |
| Sample | 714.0 | -1.202059 |

We have to use the original scaler. If you used the same variable to store the scaler you've made for the samples you have to retrain it on the original data. If not you can use the transform-method on the original scaler.

In [None]:
#DELETE
df_scaled_sample = scaler.transform(sampled_df.drop('target', axis=1).drop('kmeans_labels', axis=1))
cluster = kmeans.predict(df_scaled_sample)
print("Predicted cluster:", cluster)
sampled_df['kmeans_labels'] = cluster
sampled_df
pd.crosstab(sampled_df['target'], cluster, rownames=['True'], colnames=['Predicted'])


## Exporting

Scaling, although very beneficial for our model, adds an extra step in inference. It also requires us to export our scaler as well as our model when we want to deploy this model.

In [None]:
import joblib

joblib.dump(scaler, '../exports/scaler.pkl')
joblib.dump(kmeans, '../exports/kmeans_model.pkl')

And how would inferencing work now?

In [None]:
import joblib
# Load model and scaler if needed
scaler = joblib.load('../exports/scaler.pkl')
kmeans = joblib.load('../exports/kmeans_model.pkl')

# New sample(s), e.g., a NumPy array or DataFrame
X_new = [[13.2, 2.7, 2.5, 16.0, 100.0, 2.8, 3.2, 0.3, 2.1, 5.0, 1.1, 2.6, 1200.0]]

# Apply the SAME scaling
X_new_scaled = scaler.transform(X_new)

# Predict cluster
cluster = kmeans.predict(X_new_scaled)
print("Predicted cluster:", cluster)

Try restarting the jupyter-kernel to remove all existing variables. It should still work!