Wine Quality Dataset

In [None]:
#importing basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from  sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
df=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/winequality-red.csv')

Below is an outline of my approach to the wine quality classification project

I began with the wine quality dataset, which includes various physicochemical measurements (e.g., alcohol, volatile acidity, sulphates) as features and a “quality” score (ranging from 3 to 8) as the target label. Upon inspection, I confirmed that there were no missing values. However, I did find duplicate records and removed them



Since the original quality scores span from 3 to 8, I decided to convert this into a binary classification task. I labeled wines with a quality score of 7 or higher as “good,” and all others as “bad.” My thinking  was that, although scores of 5 or 6 may be somewhat subjective, a score of 7 or above generally indicates a broadly acceptable wine. After binarization, I discovered that only about 13% of the samples fell into the “good” category, resulting in a class imbalance.


I plotted histograms for each feature and observed that several of them exhibit skewed distributions. Given my initial plan to use tree-based models (which are generally robust to skewness), I chose not to apply any transformations at this stage.


To identify features with the greatest impact, I generated a correlation matrix. This revealed that variables such as alcohol content, volatile acidity, and sulphates accounted for a large portion of the variance. Despite this insight, I did not drop or modify any features because i think decision tree decision trees can handle and will spilit accordinly



After splitting the data into training and test sets, I applied Synthetic Minority Over‐sampling Technique (SMOTE) to the training data to balance the “good” and “bad” classes (only on training). Once SMOTE had been performed, I standardized all features (mean = 0, standard deviation = 1) before proceeding to modelling.
In my first modelling pipeline, I trained three classifiers:
•	Decision Tree
•	Random Forest
•	AdaBoost


I used F1 score as my primary evaluation metric because the class distribution was highly imbalanced. Unfortunately, each of these models achieved an F1 score of only approximately 0.50 on the test set, which I consider unsatisfactory.


Hoping to improve performance, I created a second pipeline that began with principal component analysis (PCA) for dimensionality reduction. My intention was that reducing dimensionality might indirectly address skewness without requiring explicit transformations for each feature. On the resulting principal components, I then trained:
•	k-Nearest Neighbors (KNN)
•	Support Vector Classifier (SVC)
•	Gradient Boosting Classifier


Despite this change, the F1 scores on the test set again hovered around 0.50, indicating that the adjustments did not substantially improve performance.


At this point, I am not seeing any meaning results and would like to seek your guidance on it. My initial thought process was to train a bunch of models and select the best performing one and then tune it for better results. Would like to hear your thoughts on my approach and where i could improve and whether my approach was correct or not.






In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.isnull().sum()
# no null values found

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
# hist plot to understand distribution
df.hist(bins=10, figsize=(10, 10))
plt.show()


In [None]:
df.duplicated().sum()

240 values are duplicates, removing them

In [None]:
df = df.drop_duplicates()
#check current shape of dataset
df.shape

In [None]:
import matplotlib.pyplot as plt
df.hist(bins=10, figsize=(10, 10))
plt.show()

In [None]:
# the mean and std deviation have changed but only slightly
df.describe()

In [None]:
# Bar plot for quality vs features
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(10,10)) #create a figure (size can be anything) and an gird of axes 4x3
axes = axes.flatten() # convert 2d to 1d so no need to do matrix like iteration

features = df.columns.tolist() #convert columns into a list so that they can be iterated
features.remove('quality') # drop target label

for i, col in enumerate(features):
    sns.barplot(x='quality', y=col, data=df, ax=axes[i])
    axes[i].set_title(f'{col} vs Quality')
    axes[i].set_xlabel('Quality')
    axes[i].set_ylabel(col)

plt.tight_layout() #adjusts the plot, prevents overlapping
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Box plot for quality vs features
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(10, 10))
axes = axes.flatten()

features = df.columns.tolist()
features.remove('quality')

for i, col in enumerate(features):
    # Changed sns.barplot to sns.boxplot
    sns.boxplot(x='quality', y=col, data=df, ax=axes[i])
    axes[i].set_title(f'{col} vs Quality')
    axes[i].set_xlabel('Quality')
    axes[i].set_ylabel(col)

plt.tight_layout()
plt.show()

In [None]:
# constructing a heatmap to understand the correlation between the columns
correlation = df.corr()
plt.figure(figsize=(10,10))
sns.heatmap(correlation, cbar=True, square=True, fmt = '.2f', annot = True, annot_kws={'size':8}, cmap = 'Blues')

In [None]:
import matplotlib.pyplot as plt
# Get the absolute correlation values with 'quality'
quality_correlation = correlation['quality'].abs().sort_values(ascending=False)

# Remove the correlation of 'quality' with itself
quality_correlation = quality_correlation.drop('quality')

# Select the top N most important features (you can adjust N)
n = 10
most_important_features = quality_correlation.head(n)

print("Most important features based on correlation with quality:")
print(most_important_features)

# Create a pie chart of the top most important features
plt.figure(figsize=(8, 8))
plt.pie(most_important_features, labels=most_important_features.index, autopct='%1.1f%%', startangle=140)
plt.title(f'Top {n} Most Important Features for Quality (Correlation)')
plt.show()

In [None]:
# checking the distribution of quality column
df['quality'].value_counts()


In [None]:
import matplotlib.pyplot as plt
# Visualize the distribution of quality
plt.figure(figsize=(5, 5))
sns.countplot(x='quality', data=df)
plt.title('Distribution of Wine Quality')
plt.xlabel('Quality')
plt.ylabel('Count')
plt.show()

# Print the count of each quality value in a table
quality_counts = df['quality'].value_counts().sort_index()
quality_counts = quality_counts[(quality_counts.index >= 3) & (quality_counts.index <= 8)]

print("Quality Value Counts:")
print(quality_counts.to_markdown(numalign="left", stralign="left"))

total_count = quality_counts.sum()
print(f"\nTotal Count: {total_count}")

In [None]:
#binarizing the target variable as Good (1) or bad (0)
# good based on whether it is orgianlly 7 or above

df['quality'] = [1 if x>=7 else 0 for x in df['quality']]
df['quality'].value_counts()

In [None]:
# plot the countplot of quality values

plt.figure(figsize=(5, 5))
sns.countplot(x='quality', data=df)
plt.title('Distribution of Binarized Wine Quality (0: Bad, 1: Good)')
plt.xlabel('Quality (0: Bad, 1: Good)')
plt.ylabel('Count')
plt.show()


will need to do imbalance handling but before that train test split

In [None]:
# naming convention X for features and lower case y for targets
X = df.drop("quality", axis=1)
y = df["quality"]

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)

In [None]:
print(X_train.shape)
y_train.value_counts()

In [None]:
# APPLY LOG TRANSFORMATION TO SKEWED FEATURES
# List of skewed features identified from the histograms
skewed_features = [
    'fixed acidity', 'volatile acidity', 'residual sugar', 'chlorides',
    'free sulfur dioxide', 'total sulfur dioxide', 'sulphates', 'alcohol'
]

# Apply log transformation (np.log1p handles zero values gracefully)
for col in skewed_features:
    X_train[col] = np.log1p(X_train[col])
    X_test[col] = np.log1p(X_test[col])

In [None]:
#  plot the histogram after the log transformation


# Plot histograms after log transformation
X_train.hist(bins=10, figsize=(5, 5))
plt.suptitle('Histograms of Features After Log Transformation (Training Data)', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
#Standardize the feature data
scaler = StandardScaler()
feature_names = X_train.columns

# Scale the data and immediately wrap it in a DataFrame to preserve feature names
X_train_scaled_df = pd.DataFrame(scaler.fit_transform(X_train), columns=feature_names)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=feature_names)

print("Class distribution before SMOTE:\n", y_train.value_counts())

In [None]:

print("Descriptive statistics of scaled training data:")
X_train_scaled_df.describe()

In [None]:
# handling imbalance using smote
from imblearn.over_sampling import SMOTE

# Apply SMOTE to the SCALED training data
# SMOTE is only applied to the training set to prevent the model from seeing synthetic
# versions of the test data.
# Because the input is a DataFrame, SMOTE will also output a DataFrame
smote = SMOTE(random_state=42)
X_train_scaled, y_train_smote = smote.fit_resample(X_train_scaled_df, y_train)

# The naming convention will remain the same here on but the order is now correct

In [None]:
# prompt: print the count of the data so that we know smote has worked

print("Count of the target variable after SMOTE:")
y_train_smote.value_counts()

In [None]:
# commenting out this part since the order was wrong
#scaler = StandardScaler()
#X_train_scaled = scaler.fit_transform(X_train_smote)
#X_test_scaled = scaler.transform(X_test)  # Important: transform only for the test data so that there is no leak

In [None]:
# Decision Tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix , accuracy_score , precision_score, recall_score, f1_score

#dafult spillter is best but without random_state set the output is different - tie breaking is random !
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train_scaled, y_train_smote)
y_pred_dt = dt.predict(X_test_scaled)

print("Decision Tree classifier")
print("confusion matrix (0  1)") # remember scikit learn uses 0 1 by default
print(confusion_matrix(y_test, y_pred_dt))
# look at the class 1 metrics
print(classification_report(y_test, y_pred_dt))
#printing all the main metrics for quick reference
print("\nImportant Model Evaluation Metrics:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_dt):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_dt):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_dt):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_dt):.4f}")

for imbalance classes accuracy is not enough we need to look at the precision an recall and specifically the F1 score since that i high when both recall and precision are high - so a better indicator for our use case


In [None]:
# random forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_scaled, y_train_smote)
y_pred_rf =rf.predict(X_test_scaled)
print("Random Forest")
print(confusion_matrix(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

print("\nImportant Model Evaluation Metrics:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_rf):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_rf):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_rf):.4f}")


In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Assuming X_train_scaled, y_train_smote, X_test_scaled, y_test are already defined
# (These would come from your data loading and preprocessing steps)

# --- Hyperparameter Tuning for Random Forest ---

# Define the parameter distribution for RandomizedSearchCV
# Using distributions for continuous/integer parameters for better exploration
param_grid_rf_conservative = {
    'n_estimators': [100, 150], # Only two values - the low and high from your original randint
    'max_features': ['sqrt', 0.8], # Focus on 'sqrt' (default-like) and a slightly higher fraction
    'max_depth': [15, 30], # Two mid-range values from your original 10-50 range
    'min_samples_split': [2, 10], # Default and a more regularized value
    'min_samples_leaf': [1, 5], # Default and a more regularized value
    'bootstrap': [True], # True is almost always preferred for Random Forests. Remove False for speed.
    'class_weight': ['balanced'] # Prioritize 'balanced' given your imbalanced data context. Remove None for speed.
}

# Initialize the RandomForestClassifier
rf_base = RandomForestClassifier(random_state=42)

# Initialize RandomizedSearchCV
# We target 'f1' as the scoring metric due to class imbalance
# n_iter: Number of parameter settings that are sampled. Increase for more exhaustive search.
# cv: Number of folds for cross-validation
# n_jobs: -1 means use all available processors
random_search_rf = RandomizedSearchCV(
    estimator=rf_base,
    param_distributions=param_distributions_rf,
    n_iter=100, # Increased iterations for better exploration
    cv=5,
    scoring='f1',
    n_jobs=-1,
    verbose=2,
    random_state=42
)

# Fit RandomizedSearchCV to the SMOTE-treated training data
print("\n--- Starting Random Forest Hyperparameter Tuning (Randomized Search) ---")
random_search_rf.fit(X_train_scaled, y_train_smote)

print("\n--- Tuning Complete ---")
print("Best parameters found for Random Forest:")
print(random_search_rf.best_params_)
print(f"Best F1-Score on training data (cross-validated): {random_search_rf.best_score_:.4f}")

# Get the best Random Forest model
best_rf_model = random_search_rf.best_estimator_

# Evaluate the best model on the test set
print("\n--- Evaluating Best Tuned Random Forest Model on Test Set ---")
y_pred_rf_tuned = best_rf_model.predict(X_test_scaled)

print("Tuned Random Forest")
print(confusion_matrix(y_test, y_pred_rf_tuned))
print(classification_report(y_test, y_pred_rf_tuned))

print("\nImportant Model Evaluation Metrics (Tuned Random Forest):")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf_tuned):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_rf_tuned):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_rf_tuned):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_rf_tuned):.4f}")


--- Starting Random Forest Hyperparameter Tuning (Randomized Search) ---
Fitting 5 folds for each of 100 candidates, totalling 500 fits


In [None]:
from sklearn.ensemble import AdaBoostClassifier
# AdaBoost Classifier
ab = AdaBoostClassifier(random_state=42)
ab.fit(X_train_scaled, y_train_smote)
y_pred_ab = ab.predict(X_test_scaled)

print("📌 AdaBoost Classifier")
print(confusion_matrix(y_test, y_pred_ab))
print(classification_report(y_test, y_pred_ab))

print("\nImportant Model Evaluation Metrics:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_ab):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_ab):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_ab):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_ab):.4f}")


PCA TRANSFORM


In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.90, random_state=42)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# K-Nearest Neighbors with PCA transformed data
knc = KNeighborsClassifier()
knc.fit(X_train_scaled, y_train_smote)
y_pred_knc_pca = knc.predict(X_test_scaled)

print("\n K-Nearest Neighbors with PCA transformed data")
print("confusion matrix (0  1)")
print(confusion_matrix(y_test, y_pred_knc_pca))
print(classification_report(y_test, y_pred_knc_pca))

print("\nImportant Model Evaluation Metrics (K-Nearest Neighbors with PCA):")
print(f"Accuracy: {accuracy_score(y_test, y_pred_knc_pca):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_knc_pca):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_knc_pca):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_knc_pca):.4f}")


In [None]:
# prompt: train and evaluate knn on the regular scaled dataset

# Train and evaluate KNN on the regular scaled dataset (without PCA)
knc_regular = KNeighborsClassifier()
knc_regular.fit(X_train_scaled, y_train_smote) # Use scaled data without PCA
y_pred_knc_regular = knc_regular.predict(X_test_scaled) # Use scaled test data without PCA

print("\n K-Nearest Neighbors on Regular Scaled Data")
print("confusion matrix (0  1)")
print(confusion_matrix(y_test, y_pred_knc_regular))
print(classification_report(y_test, y_pred_knc_regular))

print("\nImportant Model Evaluation Metrics (K-Nearest Neighbors on Regular Scaled Data):")
print(f"Accuracy: {accuracy_score(y_test, y_pred_knc_regular):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_knc_regular):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_knc_regular):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_knc_regular):.4f}")

In [None]:
from sklearn.svm import SVC

# Support Vector Classifier with PCA transformed data
svc = SVC(random_state=42)
svc.fit(X_train_pca, y_train_smote)
y_pred_svc_pca = svc.predict(X_test_pca)

print("\nSupport Vector Classifier with PCA transformed data")
print("confusion matrix (0  1)")
print(confusion_matrix(y_test, y_pred_svc_pca))
print(classification_report(y_test, y_pred_svc_pca))

print("\nImportant Model Evaluation Metrics (Support Vector Classifier with PCA):")
print(f"Accuracy: {accuracy_score(y_test, y_pred_svc_pca):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_svc_pca):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_svc_pca):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_svc_pca):.4f}")

In [None]:
# prompt: apply svc on normal scaled data and print the metric like before

# Support Vector Classifier on regular scaled data (without PCA)
svc_regular = SVC(random_state=42)
svc_regular.fit(X_train_scaled, y_train_smote) # Use scaled data without PCA
y_pred_svc_regular = svc_regular.predict(X_test_scaled) # Use scaled test data without PCA

print("\nSupport Vector Classifier on Regular Scaled Data")
print("confusion matrix (0  1)")
print(confusion_matrix(y_test, y_pred_svc_regular))
print(classification_report(y_test, y_pred_svc_regular))

print("\nImportant Model Evaluation Metrics (Support Vector Classifier on Regular Scaled Data):")
print(f"Accuracy: {accuracy_score(y_test, y_pred_svc_regular):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_svc_regular):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_svc_regular):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_svc_regular):.4f}")

In [None]:
# apply gradient boost to data without pca and report metrics

from sklearn.ensemble import GradientBoostingClassifier

# Gradient Boosting Classifier without PCA
gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train_scaled, y_train_smote)
y_pred_gb = gb.predict(X_test_scaled)

print("\nGradient Boosting Classifier without PCA")
print("confusion matrix (0  1)")
print(confusion_matrix(y_test, y_pred_gb))
print(classification_report(y_test, y_pred_gb))

print("\nImportant Model Evaluation Metrics (Gradient Boosting without PCA):")
print(f"Accuracy: {accuracy_score(y_test, y_pred_gb):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_gb):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_gb):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_gb):.4f}")


In [None]:
# Define model names
model_names = ['Decision Tree', 'Random Forest', 'AdaBoost',
               'K-Nearest Neighbors (PCA)', 'Support Vector Classifier (PCA)',
               'Gradient Boosting']

# Collect evaluation metrics
f1_scores = [
    f1_score(y_test, y_pred_dt),
    f1_score(y_test, y_pred_rf),
    f1_score(y_test, y_pred_ab),
    f1_score(y_test, y_pred_knc_pca),
    f1_score(y_test, y_pred_svc_pca),
    f1_score(y_test, y_pred_gb)
]

recall_scores = [
    recall_score(y_test, y_pred_dt),
    recall_score(y_test, y_pred_rf),
    recall_score(y_test, y_pred_ab),
    recall_score(y_test, y_pred_knc_pca),
    recall_score(y_test, y_pred_svc_pca),
    recall_score(y_test, y_pred_gb)
]

precision_scores = [
    precision_score(y_test, y_pred_dt),
    precision_score(y_test, y_pred_rf),
    precision_score(y_test, y_pred_ab),
    precision_score(y_test, y_pred_knc_pca),
    precision_score(y_test, y_pred_svc_pca),
    precision_score(y_test, y_pred_gb)
]

accuracy_scores = [
    accuracy_score(y_test, y_pred_dt),
    accuracy_score(y_test, y_pred_rf),
    accuracy_score(y_test, y_pred_ab),
    accuracy_score(y_test, y_pred_knc_pca),
    accuracy_score(y_test, y_pred_svc_pca),
    accuracy_score(y_test, y_pred_gb)
]

# Create a DataFrame with all metrics
results_df = pd.DataFrame({
    'Model': model_names,
    'Accuracy': accuracy_scores,
    'Precision': precision_scores,
    'Recall': recall_scores,
    'F1-Score': f1_scores
})

# Rank by F1-Score
results_df_ranked = results_df.sort_values(by='F1-Score', ascending=False).reset_index(drop=True)

# Display table in terminal
print("Model Performance Comparison (Ranked by F1-Score):")
print(results_df_ranked.to_markdown(index=False, floatfmt=".4f"))

# Melt the DataFrame for easier Seaborn plotting
melted_df = results_df_ranked.melt(id_vars='Model',
                                   value_vars=['Accuracy', 'Precision', 'Recall', 'F1-Score'],
                                   var_name='Metric',
                                   value_name='Score')

# Set up a 2x2 subplot grid for the metrics
fig, axes = plt.subplots(2, 2, figsize=(10, 10))
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
palettes = ['inferno', 'cividis', 'magma', 'viridis']

# Loop through and create one barplot per metric
for ax, metric, palette in zip(axes.flat, metrics, palettes):
    sns.barplot(data=melted_df[melted_df['Metric'] == metric],
                y='Model', x='Score', hue='Model', palette=palette, legend=False, ax=ax)

    ax.set_title(f'{metric} Comparison')
    ax.set_xlim(0, 1)
    ax.set_xlabel(metric)
    ax.set_ylabel('')

plt.suptitle('Model Comparison Across Metrics (Ranked by F1-Score)', fontsize=14)
plt.tight_layout(rect=[0, 0.03, 1, 0.97])
plt.show()