# Wine Cultivar Classification Using Machine Learning

This notebook analyzes the sklearn wine dataset to classify wines into three cultivars using three machine learning models:
- Logistic Regression (LR)
- Decision Tree (DT)
- Support Vector Machine (SVM)

## About the Dataset
The wine dataset is a classic multiclass classification dataset from the UCI Machine Learning Repository. It contains the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.

### What are Cultivars?
**Cultivars** (cultivated varieties) are plant varieties that have been produced in cultivation by selective breeding. In the context of wine:
- Wine cultivars are often associated with specific wine-producing regions (e.g., Barolo, Grignolino, Barbera from Italy)
- Different cultivars have distinct chemical compositions that influence:
  - **Flavors**: From fruity to earthy notes
  - **Aromas**: Floral, spicy, or woody scents
  - **Overall wine profiles**: Body, tannin levels, acidity, and color
- The chemical analysis of wines can help identify their cultivar origin, which is crucial for quality control and authenticity verification

### Dataset Features
The dataset contains 13 features representing the chemical properties of wines:
1. Alcohol
2. Malic acid
3. Ash
4. Alcalinity of ash
5. Magnesium
6. Total phenols
7. Flavanoids
8. Nonflavanoid phenols
9. Proanthocyanins
10. Color intensity
11. Hue
12. OD280/OD315 of diluted wines
13. Proline

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html

## 1. Import Required Libraries

In [None]:
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
import pandas as pd
import numpy as np

## 2. Load the Wine Dataset

In [None]:
# Load the wine dataset
wine_data = load_wine()

# Display basic information about the dataset
print("Wine Dataset Information:")
print("="*50)
print(f"Number of samples: {wine_data.data.shape[0]}")
print(f"Number of features: {wine_data.data.shape[1]}")
print(f"Number of classes (cultivars): {len(wine_data.target_names)}")
print(f"\nClass names: {wine_data.target_names}")
print(f"\nFeature names: {wine_data.feature_names}")

## 3. Explore the Data

In [None]:
# Create a DataFrame for better visualization
df = pd.DataFrame(wine_data.data, columns=wine_data.feature_names)
df['cultivar'] = wine_data.target

print("\nFirst few rows of the dataset:")
print(df.head())

print("\nDataset statistics:")
print(df.describe())

print("\nClass distribution:")
print(df['cultivar'].value_counts().sort_index())

## 4. Prepare Data for Training

In [None]:
# Separate features and target
X = wine_data.data
y = wine_data.target

# Split the data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

# Scale features for models that are sensitive to feature scaling (like SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nFeatures have been scaled for optimal model performance.")

## 5. Model 1: Logistic Regression (LR)

In [None]:
# Create and train Logistic Regression model
lr_model = LogisticRegression(max_iter=10000, random_state=42)
lr_model.fit(X_train_scaled, y_train)

# Make predictions
lr_predictions = lr_model.predict(X_test_scaled)

# Display classification report
print("\n" + "="*70)
print("LOGISTIC REGRESSION - Classification Report")
print("="*70)
print(classification_report(y_test, lr_predictions, target_names=wine_data.target_names))

# Calculate and display accuracy
lr_accuracy = lr_model.score(X_test_scaled, y_test)
print(f"Overall Accuracy: {lr_accuracy:.4f}")
print("="*70)

## 6. Model 2: Decision Tree (DT)

In [None]:
# Create and train Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

# Make predictions
dt_predictions = dt_model.predict(X_test)

# Display classification report
print("\n" + "="*70)
print("DECISION TREE - Classification Report")
print("="*70)
print(classification_report(y_test, dt_predictions, target_names=wine_data.target_names))

# Calculate and display accuracy
# Note: Decision Trees don't require feature scaling
dt_accuracy = dt_model.score(X_test, y_test)
print(f"Overall Accuracy: {dt_accuracy:.4f}")
print("="*70)

## 7. Model 3: Support Vector Machine (SVM)

In [None]:
# Create and train SVM model
# SVM benefits from feature scaling, so we use the scaled data
svm_model = SVC(kernel='rbf', random_state=42)
svm_model.fit(X_train_scaled, y_train)

# Make predictions
svm_predictions = svm_model.predict(X_test_scaled)

# Display classification report
print("\n" + "="*70)
print("SUPPORT VECTOR MACHINE - Classification Report")
print("="*70)
print(classification_report(y_test, svm_predictions, target_names=wine_data.target_names))

# Calculate and display accuracy
svm_accuracy = svm_model.score(X_test_scaled, y_test)
print(f"Overall Accuracy: {svm_accuracy:.4f}")
print("="*70)

## 8. Performance Comparison

In [None]:
# Create a comparison summary
print("\n" + "="*70)
print("MODEL PERFORMANCE COMPARISON")
print("="*70)

comparison_df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'Support Vector Machine'],
    'Accuracy': [lr_accuracy, dt_accuracy, svm_accuracy]
}).sort_values('Accuracy', ascending=False)

print(comparison_df.to_string(index=False))
print("="*70)

print("\nðŸ“Š Key Insights:")
print("\n1. Precision: Indicates how many of the predicted instances of each cultivar")
print("   were actually correct. High precision means fewer false positives.")

print("\n2. Recall: Shows how many of the actual instances of each cultivar were")
print("   correctly identified. High recall means fewer false negatives.")

print("\n3. F1-Score: Harmonic mean of precision and recall, providing a balanced")
print("   measure of model performance across all cultivars.")

print("\n4. Accuracy: Overall percentage of correct predictions across all three")
print("   cultivars.")

print("\nðŸ’¡ Analysis:")
best_model = comparison_df.iloc[0]['Model']
best_accuracy = comparison_df.iloc[0]['Accuracy']
print(f"\nThe {best_model} achieved the highest accuracy of {best_accuracy:.4f}")
print("\nAll models show strong performance in classifying wine cultivars based on")
print("their chemical properties, demonstrating that machine learning can effectively")
print("distinguish between different wine varieties. The chemical composition of wines")
print("provides robust signals for classification, which is valuable for:")
print("  - Quality control in wine production")
print("  - Authentication and fraud detection")
print("  - Understanding the relationship between chemistry and wine characteristics")

## Conclusion

This analysis demonstrates how machine learning can be used to classify wines into their respective cultivars based on chemical analysis. Each model has its strengths:

- **Logistic Regression**: Simple, interpretable, and efficient for linearly separable data
- **Decision Tree**: Captures non-linear relationships and provides interpretable decision rules
- **Support Vector Machine**: Effective for high-dimensional data and can handle non-linear boundaries

The classification reports provide detailed metrics (precision, recall, F1-score) for each cultivar, helping identify which cultivars are easier or harder to classify. This information is valuable for understanding the distinctiveness of different wine varieties based on their chemical profiles.