## Introduction

This project aims to build a classification model that predicts wine quality based on its physicochemical properties using machine learning techniques. The dataset used is the Wine Quality dataset from the UCI Machine Learning Repository, which contains measurements like acidity, sugar content, sulfur dioxide levels, pH, and alcohol percentage for various red wine samples.

Originally, the quality attribute is a numeric score ranging from 0 to 10, based on human sensory evaluations. However, since the scores are subjective, imbalanced, and ordered, we transform this into a binary classification task to simplify the modeling process.

Using pandas.cut(), we divide the quality scores into two categories:
- "bad": quality scores from 2 up to and including 6.5
- "good": quality scores above 6.5 up to 8

This transformation allows us to frame the problem as a binary classification task, where we train machine learning models to classify whether a wine is of good quality or not.

The project applies several classification algorithms such as Decision Tree, Logistic Regression, Naive Bayes, and K-Nearest Neighbors (KNN). The performance of these models is evaluated using metrics like accuracy, precision, recall, F1-score, and ROC-AUC, allowing us to understand which physicochemical factors best predict wine quality and which model generalizes well.

# Importing required packages.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
%matplotlib inline

# Loading the dataset

In [None]:
wine = pd.read_csv('winequality-red.csv')
wine.head()

# Information on the dataset

In [None]:
wine.info()

# Column distribution

In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(hue='quality', y='fixed acidity', x='quality', data=wine)

In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(hue='quality', x = 'quality', y = 'citric acid', data = wine)

In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(hue='quality', x = 'quality', y = 'residual sugar', data = wine)

In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(hue='quality', x = 'quality', y = 'chlorides', data = wine)

In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(hue='quality', x = 'quality', y = 'free sulfur dioxide', data = wine)

In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(hue='quality', x = 'quality', y = 'total sulfur dioxide', data = wine)

In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(hue='quality', x = 'quality', y = 'sulphates', data = wine)

In [None]:
fig = plt.figure(figsize = (10,6))
sns.barplot(hue='quality', x = 'quality', y = 'alcohol', data = wine)

# Preprocessing Data

In [None]:
# Define the bin edges:
# quality scores from 2 to 6.5 will be labeled 'bad', and 6.5 to 8 as 'good'
bins = (2, 6.5, 8)

# Define the corresponding labels for the bins
group_names = ['bad', 'good']
wine['quality'] = pd.cut(wine['quality'], bins = bins, labels = group_names)
label_quality = LabelEncoder()

#Bad becomes 0 and good becomes 1 
wine['quality'] = label_quality.fit_transform(wine['quality'])

sns.countplot(hue='quality', x='quality', data=wine)

In [None]:
X = wine.drop('quality', axis = 1)
y = wine['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test) 

# Random Forest Classifier

In [None]:
def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    print(f"🔍 Model: {model.__class__.__name__}")
    print(f"✅ Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
    print(f"✅ Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"✅ Recall:    {recall_score(y_test, y_pred):.4f}")
    print(f"✅ F1-score:  {f1_score(y_test, y_pred):.4f}")
    print("\n🧩 Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print("\n📋 Classification Report:")
    print(classification_report(y_test, y_pred))

In [None]:
from sklearn.ensemble import RandomForestClassifier
evaluate_model(RandomForestClassifier(), X_train, y_train, X_test, y_test)