# 🍷 Wine Classification using Machine Learning

## 🎯 Objective

- Train multiple machine learning models on the Wine dataset
- Evaluate models using metrics: Accuracy, Precision, Recall, and F1-score
- Optimize model performance using GridSearchCV and RandomizedSearchCV
- Select the best-performing model based on evaluation results

In [1]:
# Data handling
import pandas as pd
import numpy as np

# Visualization (optional but useful)
import matplotlib.pyplot as plt
import seaborn as sns

# ML models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Dataset
from sklearn.datasets import load_wine


In [2]:
wine = load_wine()
df = pd.DataFrame(data=wine.data, columns=wine.feature_names)
df['target'] = wine.target

In [3]:
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
 13  targe

In [5]:
print("Dataset Shape:", df.shape)
print("Target distribution:\n", df['target'].value_counts())

Dataset Shape: (178, 14)
Target distribution:
 target
1    71
0    59
2    48
Name: count, dtype: int64


Feature/Target Split

In [6]:
X = df.drop('target', axis=1)
y = df['target']

Step 4: Train-Test Split


In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [8]:
# Scale for SVM & Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Step 5: Train Models

In [9]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Random Forest": RandomForestClassifier(),
    "SVM": SVC()
}

In [10]:
results={}

In [11]:
for name, model in models.items():
  if name in ["SVM", "Logistic Regression"]:
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
  else:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

  results[name] = {
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred, average='weighted'),
        "recall": recall_score(y_test, y_pred, average='weighted'),
        "f1": f1_score(y_test, y_pred, average='weighted')
    }


Step 6: Print Model Performance

In [12]:
print("\n📊 Model Performance Summary:")
for name, metrics in results.items():
    print(f"\n{name}:")
    for metric, value in metrics.items():
        print(f"  {metric.capitalize()}: {value:.4f}")


📊 Model Performance Summary:

Logistic Regression:
  Accuracy: 0.9722
  Precision: 0.9741
  Recall: 0.9722
  F1: 0.9720

Random Forest:
  Accuracy: 1.0000
  Precision: 1.0000
  Recall: 1.0000
  F1: 1.0000

SVM:
  Accuracy: 0.9722
  Precision: 0.9741
  Recall: 0.9722
  F1: 0.9720


In [13]:
# GridSearchCV on RandomForest
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5]
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='f1_weighted')
grid_search.fit(X_train, y_train)
print("\n✅ Best parameters (GridSearchCV):", grid_search.best_params_)


✅ Best parameters (GridSearchCV): {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 150}


In [17]:
# RandomizedSearchCV on RandomForest
from scipy.stats import randint
param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': randint(5, 30),
    'min_samples_split': randint(2, 10)
}

random_search = RandomizedSearchCV(RandomForestClassifier(), param_distributions=param_dist, n_iter=10, cv=5, random_state=42, scoring='f1_weighted')
random_search.fit(X_train, y_train)
print("✅ Best parameters (RandomizedSearchCV):", random_search.best_params_)


✅ Best parameters (RandomizedSearchCV): {'max_depth': 19, 'min_samples_split': 4, 'n_estimators': 121}


Step 8: Final Model Evaluation

In [18]:
best_model = random_search.best_estimator_
y_pred_final = best_model.predict(X_test)

print("\n🎯 Tuned Random Forest Performance:")
print(classification_report(y_test, y_pred_final))


🎯 Tuned Random Forest Performance:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        12
           1       1.00      1.00      1.00        14
           2       1.00      1.00      1.00        10

    accuracy                           1.00        36
   macro avg       1.00      1.00      1.00        36
weighted avg       1.00      1.00      1.00        36



## 🧪 Model Comparison Summary

| Model               | Accuracy | Precision | Recall | F1-score |
|--------------------|----------|-----------|--------|----------|
| Logistic Regression| 0.97     | 0.97      | 0.97   | 0.97     |
| Random Forest       | 0.97     | 0.97      | 0.97   | 0.97     |
| SVM                 | 0.97     | 0.97      | 0.97   | 0.97     |
| Tuned RF (RandomCV) | 1.00     | 1.00      | 1.00   | 1.00     |

✅ **Best Model:** Tuned Random Forest using `RandomizedSearchCV`
