## Wine Quality Classification
Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests.

Download the datasset from: https://www.kaggle.com/code/anatpeled/multi-class-classification-for-wine-quality


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
path = 'winequalityN.csv'
wine_data = pd.read_csv(path)


In [None]:

# Display the first few rows
print("First few rows of the dataset:")
wine_data.head()

In [None]:
wine_data.describe()

In [None]:
wine_data.info()

In [None]:
wine_data.duplicated().sum()

In [None]:
wine_data['type'].loc[wine_data.duplicated()==1].value_counts()

### Handle Duplicate Data

In [None]:
# Check for duplicates
#TODO


In [None]:
# Check for missing values in the dataset
#TODO


In [None]:
# Drop rows with any missing values
#TODO


### Exploratory data analysis


### Type

In [None]:
# View unique types in the dataset
#TODO

This is the only categorical feature in the dataset, except for the target. There are two types of wines - red wine and white wine.


In [None]:
#Setting Up the Plot
fig, ax = plt.subplots(figsize=(12,4))

#Display Options for Data Formatting
pd.options.display.float_format = '{:,.2f}'.format

#Preparing the Data for Plotting
bar_chart = df.groupby(['type','quality'])['quality'].count().unstack('type')

#Normalization: (bar_chart.T / bar_chart.T.sum()).T calculates the percentage of each wine type
#within each quality level by dividing the counts by the total for each row (i.e., quality level).
bar_chart= (bar_chart.T/bar_chart.T.sum()).T

#Plotting the Stacked Bar Chart
ax = bar_chart.plot(kind='bar', stacked=True, color=['r','w'], edgecolor='black', ax=ax)

#Adding Percentage Labels
labels = []
for j in bar_chart.columns:
    for i in bar_chart.index:
          label = str('{0:.2%}'.format(bar_chart.loc[i][j]))
          labels.append(label)

#Positioning the Labels on the Plot
patches = ax.patches

for label, rect in zip(labels, patches):
    width = rect.get_width()
    if width > 0:
        x = rect.get_x()
        y = rect.get_y()
        height = rect.get_height()
        ax.text(x + width/2., y + height/2., label, ha='center', va='center', color='black')

#Customizing the Axis Labels and Legend
ax.set_xticklabels(labels=ax.get_xticklabels(), rotation=0)
ax.set_yticklabels(labels='')
ax.set_ylabel('% of records')
plt.legend(bbox_to_anchor = (1, 1.01), edgecolor='black')
plt.show()

A **pie** chart to show the distribution of red and white wines in the dataset. 

In [None]:
#Grouping and Counting Wine Types
data = df.groupby('type')['quality'].count()

#Setting Up the Plot
fig, ax = plt.subplots(figsize=[10,6])

#Defining Labels
labels = ['red','white']

#Creating the Pie Chart
ax = plt.pie(x=data, autopct="%.1f%%", explode=[0.05]*2, labels=labels, colors=['darkred','white'],
             wedgeprops={"edgecolor":"black"},pctdistance=0.5)
plt.show()

A **violin plot** to show the distribution of wine quality ratings for both red and white wines.

In [None]:
#Setting Up the Figure
fig, ax = plt.subplots(figsize=(10,5))

#Defining Colors
colors = ['w', 'r']

#Setting the Color Palette
sns.set_palette(sns.color_palette(colors))

#Creating the Violin Plot
ax = sns.violinplot(data=df ,x=df.type ,y=df.quality)

#Adding Labels
ax.set_ylabel('Quality', fontsize=14)

#Adjusting the Layout
plt.tight_layout()

The plot shows the distribution of quality ratings for both red and white wines:

- The white violin (for white wine) is on the left, and the red violin (for red wine) is on the right.
- The distribution shape of each wine type shows the concentration of quality ratings. For example:

    If the violin plot for white wine is widest around a quality rating of 5 or 6, it indicates that most white wines have these quality ratings.

    Similarly, red wines are likely concentrated around certain quality levels depending on where the red violin plot is widest.

Understanding the Target (Wine Quality Ratings)

In [None]:
# View unique values in the quality column
#TODO


In [None]:
# Checking the distribution of wine quality ratings 
# to understand the frequency of each wine quality rating and observe if there is an imbalance.

#TODO

# value_counts() will display the count of each unique value in the quality column, 
# which helps you understand how common each rating is in the dataset.

This output indicates:

Imbalance: Quality ratings of 5 and 6 are most common, while ratings of 3 and 8 are rare.
Implication: If you’re building a classification model, it may **struggle to accurately predict the minority classes** due to this imbalance. This will lead to discussions on resampling methods or class weighting to improve performance on imbalanced classes.

In [None]:
# Print the lowest and highest quality ratings
#TODO

print("Lowest wine quality rating:", lowest_quality)
print("Highest wine quality rating:", highest_quality)


### Visualizing the Distribution of the Target 
Let's use a plot to visually assess the distribution of wine quality ratings.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plot the distribution of quality ratings
plt.figure(figsize=(10, 6))
sns.countplot(data=wine_data, x='quality', palette='viridis')
plt.title("Distribution of Wine Quality Ratings")
plt.xlabel("Quality")
plt.ylabel("Count")
plt.show()

# A count plot makes it easy to see if certain ratings are more common than others. 
# This visualization is particularly helpful for identifying any imbalances in the quality ratings, 
# which is useful for discussions on class imbalance in machine learning.

### Feature Exploration 

### Visualize Alcohol vs. Quality
What is the relationship between individual alcohol values and quality ratings?


In [None]:
# Calculate Correlation: Quantify the Relationship Between Alcohol and Quality
# Calculate the correlation between alcohol content and quality

c#TODO

print("Correlation between Alcohol Content and Wine Quality:", correlation)


In [None]:
# Convert the 'type' column from categorical to numerical, keeping both categories
#TODO


In [None]:
#all correlation
#TODO

In [None]:
# Calculate the correlation matrix
correlation_matrix = wine_data.corr()

# Plot the heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5, vmax=1, vmin=-1)
plt.title("Correlation Heatmap of Wine Quality Dataset")
plt.show()


**Practical Recommendation:** Start with All Features and Use Regularization or Feature Importance

## Prepare the Data
Separate features and target: The quality column is the target, and all other columns are features.
- Train-test split: Split the data into training and testing sets.
- Feature Scaling: Scale the features, as KNN and Logistic Regression are sensitive to feature magnitudes.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#TODO


### Train and Evaluate Models


In [None]:
import warnings

warnings.filterwarnings('ignore')


We first, define a function to evaluate each model for consistency:

In [None]:
#a function for evaluation
#TODO


**Logistic Regression**

In [None]:
#TODO


**K-Nearest Neighbors**

In [None]:
#TODO


**Random Forest**

In [None]:
#TODO


## Grid search

https://scikit-learn.org/0.15/modules/grid_search.html

In [None]:
from sklearn.model_selection import GridSearchCV

**Define the Hyperparameter Grids**

Set up the parameter grids for each model

In [None]:
# Logistic Regression parameter grid
param_grid_log_reg = [
    {
        'penalty': ['l1', 'l2'],
        'C': [0.01, 0.1, 1, 10, 100],
        'solver': ['liblinear'],  # Compatible with 'l1' and 'l2' penalties
        'max_iter': [1000, 2000]
    },
    {
        'penalty': ['l2'],
        'C': [0.01, 0.1, 1, 10, 100],
        'solver': ['lbfgs', 'newton-cg', 'sag'],  # Compatible only with 'l2'
        'max_iter': [500, 1000]
    },
    {
        'penalty': ['elasticnet'],
        'C': [0.01, 0.1, 1, 10, 100],
        'solver': ['saga'],  # 'saga' is required for 'elasticnet'
        'l1_ratio': [0.1, 0.5, 0.9],  # Only include l1_ratio for elasticnet
        'max_iter': [500, 1000]
    }
]


# K-Nearest Neighbors parameter grid
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # 1: Manhattan distance, 2: Euclidean distance
}

# Random Forest parameter grid
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'class_weight': ['balanced', 'balanced_subsample']
}


**Set up GridSearchCV for Each Model**

Run the grid search for each model with cross-validation.

In [None]:
#Logistic Regression

log_reg = LogisticRegression(random_state=42)
grid_search_log_reg = GridSearchCV(log_reg, param_grid_log_reg, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)
grid_search_log_reg.fit(X_train_scaled, y_train)

print("Best Parameters for Logistic Regression:", grid_search_log_reg.best_params_)
print("Best Cross-Validation Accuracy for Logistic Regression:", grid_search_log_reg.best_score_)


In [None]:
#K-Nearest Neighbors
knn = KNeighborsClassifier()
grid_search_knn = GridSearchCV(knn, param_grid_knn, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)
grid_search_knn.fit(X_train_scaled, y_train)

print("Best Parameters for KNN:", grid_search_knn.best_params_)
print("Best Cross-Validation Accuracy for KNN:", grid_search_knn.best_score_)


In [None]:
#Random Forest
rf = RandomForestClassifier(random_state=42)
grid_search_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)
grid_search_rf.fit(X_train_scaled, y_train)

print("Best Parameters for Random Forest:", grid_search_rf.best_params_)
print("Best Cross-Validation Accuracy for Random Forest:", grid_search_rf.best_score_)


### Evaluate the Best Models on the Test Set

In [None]:
# Best models
#TODO

# Predictions
#TODO

# Evaluation
print("Logistic Regression Test Accuracy:", accuracy_score(y_test, y_pred_log_reg))
print("KNN Test Accuracy:", accuracy_score(y_test, y_pred_knn))
print("Random Forest Test Accuracy:", accuracy_score(y_test, y_pred_rf))

# Optional: Detailed reports
print("\nLogistic Regression Classification Report:\n", classification_report(y_test, y_pred_log_reg))
print("\nKNN Classification Report:\n", classification_report(y_test, y_pred_knn))
print("\nRandom Forest Classification Report:\n", classification_report(y_test, y_pred_rf))


### Confusion Matrices for Each Model

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def plot_confusion_matrix(model_name, y_true, y_pred):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=sorted(y.unique()), yticklabels=sorted(y.unique()))
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.title(f"Confusion Matrix for {model_name}")
    plt.show()

plot_confusion_matrix("Logistic Regression", y_test, y_pred_log_reg)
plot_confusion_matrix("KNN", y_test, y_pred_knn)
plot_confusion_matrix("Random Forest", y_test, y_pred_rf)


## SMOTE technique
To apply SMOTE (Synthetic Minority Over-sampling Technique) to your dataset, you can use the SMOTE class from the imblearn library, which is part of the imbalanced-learn package. This technique will help by oversampling the minority classes in the training set, creating synthetic samples to balance the class distribution.

In [None]:
# pip install -U scikit-learn
# pip install -U imbalanced-learn






In [None]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE to the training data
smote = SMOTE(random_state=21, k_neighbors=3)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Scale the resampled training data and the original test data
scaler = StandardScaler()
X_train_resampled_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.transform(X_test)

# Initialize parameter grids for each model
param_grid_log_reg = {
    'penalty': ['l1', 'l2', 'elasticnet'],
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['saga'],
    'max_iter': [1000],
    'tol': [0.001, 0.01],
    'l1_ratio': [0.5]  # Only relevant if using 'elasticnet' penalty
}

param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}

param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'class_weight': ['balanced']
}

# Define a function for grid search and model evaluation
def evaluate_model(name, model, param_grid, X_train, y_train, X_test, y_test):
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)
    grid_search.fit(X_train, y_train)
    
    best_model = grid_search.best_estimator_
    y_pred = best_model.predict(X_test)
    
    print(f"\n{name} - Best Parameters: {grid_search.best_params_}")
    print(f"Mean Cross-Validation Accuracy: {grid_search.best_score_}")
    print(f"{name} Model Performance on Test Set:")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred, zero_division=0))
    
    return best_model

# Logistic Regression
log_reg = LogisticRegression(random_state=42)
best_log_reg = evaluate_model("Logistic Regression", log_reg, param_grid_log_reg, X_train_resampled_scaled, y_train_resampled, X_test_scaled, y_test)

# K-Nearest Neighbors
knn = KNeighborsClassifier()
best_knn = evaluate_model("KNN", knn, param_grid_knn, X_train_resampled_scaled, y_train_resampled, X_test_scaled, y_test)

# Random Forest
rf = RandomForestClassifier(random_state=42)
best_rf = evaluate_model("Random Forest", rf, param_grid_rf, X_train_resampled_scaled, y_train_resampled, X_test_scaled, y_test)