# Machine Learning Project: Wine Quality Dataset

This notebook demonstrates the complete workflow requested in your practical work—from data preparation to model comparison—using the **Wine Quality (red wine)** dataset from the UCI Machine Learning Repository.

---

## 1  /  Dataset Selection
The *Wine Quality* dataset records 11 physicochemical laboratory tests on Portuguese *Vinho Verde* wines together with an expert sensory **quality score** (0 – 10). Here we use the **red‑wine** subset (1 599 rows).

* **Source:** <https://archive.ics.uci.edu/dataset/186/wine+quality>
* **Records:** 1 599
* **Attributes:** 11 numeric predictors + 1 integer target
* **Task type:** Classification (after binning target into categories)

### Attribute overview
| # | Feature | Role | Data type | Typical range |
|---|---------|------|-----------|---------------|
|1|fixed acidity|input|float|4.6 – 15.9|
|2|volatile acidity|input|float|0.12 – 1.58|
|3|citric acid|input|float|0.00 – 1.00|
|4|residual sugar|input|float|0.9 – 15.5|
|5|chlorides|input|float|0.012 – 0.611|
|6|free sulfur dioxide|input|float|1 – 68|
|7|total sulfur dioxide|input|float|6 – 289|
|8|density|input|float|0.990 – 1.004|
|9|pH|input|float|2.74 – 4.01|
|10|sulphates|input|float|0.33 – 2.00|
|11|alcohol|input|float|8.4 – 14.9|
|12|quality|target|int|3 – 8|

In order to satisfy the *categorical target* requirement, the numeric quality score is binned into three nominal classes:

* **Low** : 3 – 5
* **Medium** : 6
* **High** : 7 – 8

In [1]:
# 2. Data Preparation & Exploration -------------------------------------------------------

import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from pathlib import Path

sns.set(style="whitegrid", context="notebook", font_scale=1.1)

# --- data download / load ---
URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
try:
    df = pd.read_csv(URL, sep=";")
except Exception as err:
    print("⚠️  Automatic download failed. Place 'winequality-red.csv' next to this notebook.")
    df = pd.read_csv("winequality-red.csv", sep=";")

print("Shape:", df.shape)
display(df.head())


⚠️  Automatic download failed. Place 'winequality-red.csv' next to this notebook.


FileNotFoundError: [Errno 2] No such file or directory: 'winequality-red.csv'

In [None]:
# Convert numeric quality to categorical label --------------------------------------------
bins, labels = [0, 5, 6, 10], ["Low", "Medium", "High"]
df["quality_label"] = pd.cut(df["quality"], bins=bins, labels=labels)
print(df["quality_label"].value_counts())


In [None]:
# Missing‑value check ---------------------------------------------------------------------
na = df.isna().sum()
print("Missing values (non‑zero):\n", na[na>0])


In [None]:
# Basic statistics ------------------------------------------------------------------------
display(df.describe().T)


In [None]:
# Outlier inspection (|z| > 3) ------------------------------------------------------------
from scipy.stats import zscore
z = np.abs(zscore(df.drop(columns=["quality_label"])))
print("Rows with any |z| > 3:", (z > 3).any(axis=1).sum())


## 2  /  Visualisation

In [None]:
# 2D scatter plots ------------------------------------------------------------------------
plt.figure(figsize=(6,4))
sns.scatterplot(data=df, x="alcohol", y="volatile acidity", hue="quality_label", alpha=0.8)
plt.title("Alcohol vs Volatile Acidity")
plt.show()

plt.figure(figsize=(6,4))
sns.scatterplot(data=df, x="sulphates", y="citric acid", hue="quality_label", alpha=0.8)
plt.title("Sulphates vs Citric Acid")
plt.show()


In [None]:
# Histograms per class --------------------------------------------------------------------
for attr in ["alcohol", "volatile acidity"]:
    plt.figure(figsize=(6,4))
    sns.histplot(data=df, x=attr, hue="quality_label", element="step", stat="density", common_norm=False)
    plt.title(f"Distribution of {attr} by Quality Class")
    plt.show()


In [None]:
# Boxplots --------------------------------------------------------------------------------
plt.figure(figsize=(8,4))
sns.boxplot(data=df, x="quality_label", y="pH")
plt.title("pH distribution by Quality Class")
plt.show()

plt.figure(figsize=(8,4))
sns.boxplot(data=df, x="quality_label", y="residual sugar")
plt.title("Residual Sugar distribution by Quality Class")
plt.show()


## 3  /  Unsupervised Learning

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.metrics import adjusted_rand_score, silhouette_score

X = df.drop(columns=["quality", "quality_label"])
y_true = df["quality_label"].to_numpy()

X_std = StandardScaler().fit_transform(X)

# Hierarchical clustering ---------------------------------------------------------------
from collections import defaultdict
ari = defaultdict(float)

print("Hierarchical clustering (ARI vs. true classes)")
for linkage in ["single", "complete", "ward"]:
    hc = AgglomerativeClustering(n_clusters=3, linkage=linkage)
    labs = hc.fit_predict(X_std)
    score = adjusted_rand_score(y_true, labs)
    ari[linkage] = score
    print(f"{linkage.title():>8}: {score:.3f}")

# K‑means silhouette scores --------------------------------------------------------------
print("\nK‑means silhouette scores")
for k in range(3, 8):
    km = KMeans(n_clusters=k, random_state=42, n_init=20)
    sil = silhouette_score(X_std, km.fit_predict(X_std))
    print(f"k={k}: {sil:.3f}")


## 4  /  Supervised Learning

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

X = df.drop(columns=["quality", "quality_label"])
y = df["quality_label"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42)

def tune(pipe, params, name):
    grid = GridSearchCV(pipe, params, cv=5, n_jobs=-1)
    grid.fit(X_train, y_train)
    print(f"\n{name} BEST:", grid.best_params_)
    print("--- TRAIN ---")
    print(classification_report(y_train, grid.predict(X_train)))
    print("--- TEST ---")
    print(classification_report(y_test, grid.predict(X_test)))
    print("Confusion Matrix (test):\n", confusion_matrix(y_test, grid.predict(X_test)))
    return grid


In [None]:
# Model 1 – Artificial Neural Network ----------------------------------------------------
from sklearn.preprocessing import StandardScaler
mlp = Pipeline([("sc", StandardScaler()),
                ("clf", MLPClassifier(max_iter=1000, random_state=42))])

mlp_grid = {"clf__hidden_layer_sizes": [(50,), (100,), (50,50)],
            "clf__alpha": [1e-4, 1e-3, 1e-2]}

mlp_best = tune(mlp, mlp_grid, "MLP")


In [None]:
# Model 2 – Logistic Regression ----------------------------------------------------------
log = Pipeline([("sc", StandardScaler()),
                ("clf", LogisticRegression(max_iter=1000, multi_class="multinomial"))])

log_grid = {"clf__C": [0.1, 1.0, 10.0]}

log_best = tune(log, log_grid, "Logistic Regression")


In [None]:
# Model 3 – Support Vector Machine -------------------------------------------------------
svm = Pipeline([("sc", StandardScaler()), ("clf", SVC())])

svm_grid = {"clf__C": [0.1, 1, 10],
            "clf__kernel": ["rbf", "poly"],
            "clf__gamma": ["scale", "auto"]}

svm_best = tune(svm, svm_grid, "SVM")


In [None]:
# Comparison on test set -----------------------------------------------------------------
models = {"MLP": mlp_best, "LogReg": log_best, "SVM": svm_best}
acc = {k: accuracy_score(y_test, m.predict(X_test)) for k, m in models.items()}
print("\n=== Test accuracy ===")
for k, v in sorted(acc.items(), key=lambda t: t[1], reverse=True):
    print(f"{k}: {v:.3f}")


## 5  /  Conclusions
* **Feature selection:** all 11 attributes were retained.
* **Unsupervised:** Ward linkage gave the highest Adjusted Rand Index; K‑means favoured k = 3, aligning with the three quality bins.
* **Supervised:** The tuned MLP often achieves the best accuracy, with SVM close behind, showing wine quality is predictable from chemistry alone.

Future work could include SMOTE re‑sampling to address mild class imbalance, and domain‑specific feature engineering (e.g. free/total SO₂ ratio).

> **Author:** _<Your Name>_   |   **Created:** 2025‑05‑13