# Wine Dataset Classification Analysis

## Overview
The Wine dataset contains the results of chemical analysis of wines grown in the same region of Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

## Dataset Details
- **Samples**: 178 wine samples
- **Features**: 13 chemical properties
- **Target**: 3 wine classes (cultivars)
- **Task**: Multi-class classification

## Features (Chemical Properties)
1. Alcohol
2. Malic acid
3. Ash
4. Alcalinity of ash
5. Magnesium
6. Total phenols
7. Flavanoids
8. Nonflavanoid phenols
9. Proanthocyanins
10. Color intensity
11. Hue
12. OD280/OD315 of diluted wines
13. Proline

## Step 1: Import Required Libraries
Import all necessary libraries for data analysis, visualization, and machine learning.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Configure plotting
plt.style.use('default')
sns.set_palette('viridis')
plt.rcParams['figure.figsize'] = (12, 8)

## Step 2: Load and Explore the Dataset
Load the Wine dataset and examine its structure and basic properties.

In [None]:
# Load the Wine dataset
wine = load_wine()

# Create a DataFrame for easier manipulation
df = pd.DataFrame(wine.data, columns=wine.feature_names)
df['target'] = wine.target
df['wine_class'] = df['target'].map({0: 'Class 0', 1: 'Class 1', 2: 'Class 2'})

print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

print("\nDataset Info:")
print(df.info())

print("\nClass Distribution:")
print(df['wine_class'].value_counts())
print("\nClass Distribution (counts):")
print(df['target'].value_counts().sort_index())

## Step 3: Statistical Summary and Data Quality Check
Examine statistical properties and check for data quality issues.

In [None]:
# Statistical summary
print("Statistical Summary:")
print(df.describe())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Check for duplicates
print(f"\nDuplicate rows: {df.duplicated().sum()}")

# Feature ranges (to understand scaling needs)
print("\nFeature Ranges:")
feature_ranges = pd.DataFrame({
    'Min': df[wine.feature_names].min(),
    'Max': df[wine.feature_names].max(),
    'Range': df[wine.feature_names].max() - df[wine.feature_names].min()
})
print(feature_ranges.sort_values('Range', ascending=False))

## Step 4: Exploratory Data Analysis and Visualization
Create comprehensive visualizations to understand data patterns and relationships.

In [None]:
# Class distribution visualization
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
df['wine_class'].value_counts().plot(kind='bar', color='skyblue')
plt.title('Wine Class Distribution')
plt.xlabel('Wine Class')
plt.ylabel('Count')
plt.xticks(rotation=0)

plt.subplot(1, 2, 2)
plt.pie(df['wine_class'].value_counts(), labels=df['wine_class'].value_counts().index, autopct='%1.1f%%')
plt.title('Wine Class Distribution (Pie Chart)')

plt.tight_layout()
plt.show()

In [None]:
# Feature distribution by class - Box plots
# Select top 6 features for visualization (based on variance)
feature_variance = df[wine.feature_names].var().sort_values(ascending=False)
top_features = feature_variance.head(6).index

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for i, feature in enumerate(top_features):
    sns.boxplot(data=df, x='wine_class', y=feature, ax=axes[i])
    axes[i].set_title(f'{feature} by Wine Class')
    axes[i].tick_params(axis='x', rotation=45)

plt.suptitle('Feature Distributions by Wine Class (Top 6 by Variance)', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

print("Top features by variance:")
for i, (feature, variance) in enumerate(feature_variance.head(6).items()):
    print(f"{i+1}. {feature}: {variance:.2f}")

In [None]:
# Correlation matrix heatmap
plt.figure(figsize=(15, 12))
correlation_matrix = df[wine.feature_names].corr()
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='coolwarm', center=0, 
            square=True, fmt='.2f', cbar_kws={'shrink': 0.8})
plt.title('Feature Correlation Matrix (Wine Dataset)')
plt.tight_layout()
plt.show()

# Identify highly correlated features
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_val = correlation_matrix.iloc[i, j]
        if abs(corr_val) > 0.7:
            high_corr_pairs.append((correlation_matrix.columns[i], correlation_matrix.columns[j], corr_val))

print("\nHighly correlated feature pairs (|correlation| > 0.7):")
for feat1, feat2, corr in high_corr_pairs:
    print(f"{feat1} - {feat2}: {corr:.3f}")

## Step 5: Feature Scaling and Data Preprocessing
Prepare the data for machine learning by scaling features and splitting the dataset.

In [None]:
# Separate features and target
X = wine.data
y = wine.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print(f"Feature dimensions: {X_train.shape[1]}")

# Check class distribution in splits
print("\nClass distribution in training set:")
unique, counts = np.unique(y_train, return_counts=True)
for cls, count in zip(unique, counts):
    print(f"Class {cls}: {count} samples ({count/len(y_train)*100:.1f}%)")

print("\nClass distribution in test set:")
unique, counts = np.unique(y_test, return_counts=True)
for cls, count in zip(unique, counts):
    print(f"Class {cls}: {count} samples ({count/len(y_test)*100:.1f}%)")

# Apply different scaling methods
standard_scaler = StandardScaler()
minmax_scaler = MinMaxScaler()

X_train_std = standard_scaler.fit_transform(X_train)
X_test_std = standard_scaler.transform(X_test)

X_train_minmax = minmax_scaler.fit_transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)

print("\nFeature scaling completed with both StandardScaler and MinMaxScaler.")

## Step 6: Model Training and Evaluation
Train multiple classification models and compare their performance.

In [None]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'Naive Bayes': GaussianNB()
}

# Train and evaluate models with different scaling
scaling_methods = {
    'Standard Scaling': (X_train_std, X_test_std),
    'MinMax Scaling': (X_train_minmax, X_test_minmax),
    'No Scaling': (X_train, X_test)
}

results = {}

for scale_name, (X_tr, X_te) in scaling_methods.items():
    print(f"\n{'='*50}")
    print(f"Results with {scale_name}")
    print(f"{'='*50}")
    
    scale_results = {}
    
    for name, model in models.items():
        print(f"\n--- {name} ---")
        
        # Train the model
        model.fit(X_tr, y_train)
        
        # Make predictions
        y_pred = model.predict(X_te)
        
        # Calculate accuracy
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Test Accuracy: {accuracy:.4f}")
        
        # Cross-validation
        cv_scores = cross_val_score(model, X_tr, y_train, cv=5)
        print(f"Cross-validation Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
        
        # Store results
        scale_results[name] = {
            'model': model,
            'accuracy': accuracy,
            'predictions': y_pred,
            'cv_mean': cv_scores.mean(),
            'cv_std': cv_scores.std()
        }
    
    results[scale_name] = scale_results

# Find best performing model overall
best_accuracy = 0
best_model_info = None

for scale_name, scale_results in results.items():
    for model_name, model_results in scale_results.items():
        if model_results['accuracy'] > best_accuracy:
            best_accuracy = model_results['accuracy']
            best_model_info = (scale_name, model_name, model_results)

print(f"\n{'='*60}")
print(f"BEST PERFORMING MODEL")
print(f"{'='*60}")
print(f"Scaling: {best_model_info[0]}")
print(f"Model: {best_model_info[1]}")
print(f"Test Accuracy: {best_model_info[2]['accuracy']:.4f}")
print(f"CV Accuracy: {best_model_info[2]['cv_mean']:.4f} (+/- {best_model_info[2]['cv_std'] * 2:.4f})")

## Step 7: Detailed Analysis of Best Model
Perform detailed analysis including classification report and confusion matrix.

In [None]:
# Use the best performing model for detailed analysis
best_scale, best_model_name, best_results = best_model_info
best_predictions = best_results['predictions']

print(f"Detailed Analysis - {best_model_name} with {best_scale}")
print("="*60)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, best_predictions, target_names=wine.target_names))

# Confusion Matrix
cm = confusion_matrix(y_test, best_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=wine.target_names, yticklabels=wine.target_names)
plt.title(f'Confusion Matrix - {best_model_name} ({best_scale})')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Per-class accuracy
class_accuracies = cm.diagonal() / cm.sum(axis=1)
print("\nPer-class Accuracy:")
for i, acc in enumerate(class_accuracies):
    print(f"Class {i} ({wine.target_names[i]}): {acc:.4f}")

## Step 8: Model Performance Comparison Visualization
Create comprehensive visualizations comparing all models and scaling methods.

In [None]:
# Create comparison DataFrame
comparison_data = []
for scale_name, scale_results in results.items():
    for model_name, model_results in scale_results.items():
        comparison_data.append({
            'Scaling': scale_name,
            'Model': model_name,
            'Test_Accuracy': model_results['accuracy'],
            'CV_Mean': model_results['cv_mean'],
            'CV_Std': model_results['cv_std']
        })

comparison_df = pd.DataFrame(comparison_data)

# Model performance heatmap
plt.figure(figsize=(12, 8))
pivot_test = comparison_df.pivot(index='Model', columns='Scaling', values='Test_Accuracy')
sns.heatmap(pivot_test, annot=True, cmap='RdYlGn', center=0.9, fmt='.3f')
plt.title('Test Accuracy Comparison Across Models and Scaling Methods')
plt.tight_layout()
plt.show()

# Bar plot comparison
plt.figure(figsize=(15, 10))
for i, scale_name in enumerate(scaling_methods.keys()):
    plt.subplot(2, 2, i+1)
    scale_data = comparison_df[comparison_df['Scaling'] == scale_name]
    plt.bar(scale_data['Model'], scale_data['Test_Accuracy'], alpha=0.7)
    plt.title(f'Model Performance with {scale_name}')
    plt.ylabel('Test Accuracy')
    plt.xticks(rotation=45)
    plt.ylim(0.8, 1.0)

plt.tight_layout()
plt.show()

# Summary table
print("\nComplete Performance Summary:")
summary_table = comparison_df.pivot(index='Model', columns='Scaling', values='Test_Accuracy')
print(summary_table.round(4))

## Step 9: Dimensionality Reduction and Visualization
Apply PCA and t-SNE for data visualization and dimensionality reduction analysis.

In [None]:
# Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_train_std)

# Plot explained variance
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), 
         np.cumsum(pca.explained_variance_ratio_), 'bo-')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA: Cumulative Explained Variance')
plt.grid(True, alpha=0.3)
plt.axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
plt.legend()

plt.subplot(1, 3, 2)
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_)
plt.xlabel('Component')
plt.ylabel('Explained Variance Ratio')
plt.title('PCA: Individual Component Variance')

# 2D PCA visualization
plt.subplot(1, 3, 3)
colors = ['red', 'green', 'blue']
for i, (color, target_name) in enumerate(zip(colors, wine.target_names)):
    mask = y_train == i
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1], c=color, label=target_name, alpha=0.7)

plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('2D PCA Visualization')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find number of components for 95% variance
cumsum_var = np.cumsum(pca.explained_variance_ratio_)
n_components_95 = np.argmax(cumsum_var >= 0.95) + 1
print(f"Number of components needed for 95% variance: {n_components_95}")
print(f"First 5 components explain {cumsum_var[4]:.1%} of variance")

In [None]:
# Apply t-SNE for non-linear dimensionality reduction
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_train_std)

# Create visualization comparing PCA and t-SNE
plt.figure(figsize=(15, 6))

# PCA plot
plt.subplot(1, 2, 1)
for i, (color, target_name) in enumerate(zip(colors, wine.target_names)):
    mask = y_train == i
    plt.scatter(X_pca[mask, 0], X_pca[mask, 1], c=color, label=target_name, alpha=0.7, s=50)

plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
plt.title('PCA - Linear Dimensionality Reduction')
plt.legend()
plt.grid(True, alpha=0.3)

# t-SNE plot
plt.subplot(1, 2, 2)
for i, (color, target_name) in enumerate(zip(colors, wine.target_names)):
    mask = y_train == i
    plt.scatter(X_tsne[mask, 0], X_tsne[mask, 1], c=color, label=target_name, alpha=0.7, s=50)

plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.title('t-SNE - Non-linear Dimensionality Reduction')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Dimensionality Reduction Analysis:")
print(f"- PCA: First 2 components explain {(pca.explained_variance_ratio_[0] + pca.explained_variance_ratio_[1]):.1%} of variance")
print(f"- t-SNE: Non-linear reduction shows {len(np.unique(y_train))} distinct clusters")
print(f"- Both methods show good class separation")

## Step 10: Feature Importance Analysis
Analyze which features are most important for wine classification.

In [None]:
# Get feature importance from Random Forest (works with all scaling methods)
rf_results = results['No Scaling']['Random Forest']  # RF doesn't require scaling
rf_model = rf_results['model']

# Feature importance from Random Forest
feature_importance_rf = pd.DataFrame({
    'feature': wine.feature_names,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

# PCA component analysis
pca_components = pd.DataFrame(
    pca.components_[:3].T,  # First 3 components
    columns=['PC1', 'PC2', 'PC3'],
    index=wine.feature_names
)

# Create visualization
plt.figure(figsize=(18, 12))

# Random Forest Feature Importance
plt.subplot(2, 2, 1)
sns.barplot(data=feature_importance_rf.head(10), x='importance', y='feature')
plt.title('Top 10 Features - Random Forest Importance')
plt.xlabel('Importance Score')

# PCA Component Loadings
plt.subplot(2, 2, 2)
pca_abs = pca_components.abs()
top_features_pca = pca_abs.sum(axis=1).sort_values(ascending=False).head(10)
sns.barplot(x=top_features_pca.values, y=top_features_pca.index)
plt.title('Top 10 Features - PCA Loading Magnitude')
plt.xlabel('Sum of Absolute Loadings (PC1-PC3)')

# Feature importance heatmap
plt.subplot(2, 2, 3)
top_10_features = feature_importance_rf.head(10)['feature']
correlation_subset = df[top_10_features].corr()
sns.heatmap(correlation_subset, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Correlation Matrix - Top 10 Important Features')

# Compare different importance measures
plt.subplot(2, 2, 4)
# Combine different importance measures
importance_comparison = pd.DataFrame({
    'RF_Importance': feature_importance_rf.set_index('feature')['importance'],
    'PCA_Loading': pca_abs.sum(axis=1)
}).fillna(0)

# Normalize for comparison
importance_comparison = importance_comparison.div(importance_comparison.max())
importance_comparison['Average'] = importance_comparison.mean(axis=1)
importance_comparison = importance_comparison.sort_values('Average', ascending=False)

x = np.arange(len(importance_comparison.head(8)))
width = 0.35

plt.bar(x - width/2, importance_comparison.head(8)['RF_Importance'], 
        width, label='Random Forest', alpha=0.7)
plt.bar(x + width/2, importance_comparison.head(8)['PCA_Loading'], 
        width, label='PCA Loading', alpha=0.7)

plt.xlabel('Features')
plt.ylabel('Normalized Importance')
plt.title('Feature Importance Comparison')
plt.xticks(x, importance_comparison.head(8).index, rotation=45)
plt.legend()

plt.tight_layout()
plt.show()

print("Top 10 Most Important Features (Random Forest):")
for i, row in feature_importance_rf.head(10).iterrows():
    print(f"{row['feature']}: {row['importance']:.4f}")

print("\nTop 5 Features by PCA Loading Magnitude:")
for feature, loading in top_features_pca.head(5).items():
    print(f"{feature}: {loading:.4f}")

## Step 11: Hyperparameter Tuning
Perform hyperparameter tuning on the best performing model.

In [None]:
# Hyperparameter tuning for the best model type
# Let's tune both SVM and Random Forest as they often perform well

# Define parameter grids
param_grids = {
    'SVM': {
        'C': [0.1, 1, 10, 100],
        'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1],
        'kernel': ['rbf', 'poly', 'linear']
    },
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }
}

models_for_tuning = {
    'SVM': SVC(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42)
}

tuning_results = {}

# Use standard scaled data for tuning
for model_name, model in models_for_tuning.items():
    print(f"\nTuning {model_name}...")
    
    # Grid search with cross-validation
    grid_search = GridSearchCV(
        model, 
        param_grids[model_name], 
        cv=5, 
        scoring='accuracy',
        n_jobs=-1,
        verbose=1
    )
    
    grid_search.fit(X_train_std, y_train)
    
    # Test the best model
    best_model = grid_search.best_estimator_
    y_pred_tuned = best_model.predict(X_test_std)
    tuned_accuracy = accuracy_score(y_test, y_pred_tuned)
    
    tuning_results[model_name] = {
        'best_params': grid_search.best_params_,
        'best_cv_score': grid_search.best_score_,
        'test_accuracy': tuned_accuracy,
        'model': best_model,
        'predictions': y_pred_tuned
    }
    
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best CV score: {grid_search.best_score_:.4f}")
    print(f"Test accuracy: {tuned_accuracy:.4f}")

# Compare with original results
print("\n" + "="*60)
print("HYPERPARAMETER TUNING RESULTS COMPARISON")
print("="*60)

for model_name in tuning_results.keys():
    original_acc = results['Standard Scaling'][model_name]['accuracy']
    tuned_acc = tuning_results[model_name]['test_accuracy']
    improvement = tuned_acc - original_acc
    
    print(f"\n{model_name}:")
    print(f"  Original accuracy: {original_acc:.4f}")
    print(f"  Tuned accuracy: {tuned_acc:.4f}")
    print(f"  Improvement: {improvement:+.4f}")
    print(f"  Best parameters: {tuning_results[model_name]['best_params']}")

## Step 12: Final Model Evaluation and Summary
Comprehensive evaluation of the best performing model with all metrics.

In [None]:
# Find the best model overall (including tuned models)
all_results = []

# Add original results
for scale_name, scale_results in results.items():
    for model_name, model_results in scale_results.items():
        all_results.append({
            'model_type': f"{model_name} ({scale_name})",
            'accuracy': model_results['accuracy'],
            'cv_mean': model_results['cv_mean'],
            'source': 'Original'
        })

# Add tuned results
for model_name, tuned_results in tuning_results.items():
    all_results.append({
        'model_type': f"{model_name} (Tuned)",
        'accuracy': tuned_results['test_accuracy'],
        'cv_mean': tuned_results['best_cv_score'],
        'source': 'Tuned'
    })

# Create final comparison
final_comparison = pd.DataFrame(all_results).sort_values('accuracy', ascending=False)

print("FINAL MODEL RANKING (Top 10):")
print("="*60)
print(final_comparison.head(10).to_string(index=False))

# Best model analysis
best_model_row = final_comparison.iloc[0]
print(f"\n\nBEST OVERALL MODEL: {best_model_row['model_type']}")
print(f"Test Accuracy: {best_model_row['accuracy']:.4f}")
print(f"CV Score: {best_model_row['cv_mean']:.4f}")

# Get the actual best model for detailed analysis
if 'Tuned' in best_model_row['model_type']:
    model_name = best_model_row['model_type'].split(' (')[0]
    final_best_model = tuning_results[model_name]['model']
    final_predictions = tuning_results[model_name]['predictions']
else:
    # Parse original model info
    parts = best_model_row['model_type'].split(' (')
    model_name = parts[0]
    scale_name = parts[1].rstrip(')')
    final_best_model = results[scale_name][model_name]['model']
    final_predictions = results[scale_name][model_name]['predictions']

# Final confusion matrix and classification report
plt.figure(figsize=(12, 5))

# Confusion matrix
plt.subplot(1, 2, 1)
cm_final = confusion_matrix(y_test, final_predictions)
sns.heatmap(cm_final, annot=True, fmt='d', cmap='Blues',
            xticklabels=wine.target_names, yticklabels=wine.target_names)
plt.title(f'Final Model Confusion Matrix\n{best_model_row["model_type"]}')
plt.xlabel('Predicted')
plt.ylabel('Actual')

# Performance comparison visualization
plt.subplot(1, 2, 2)
top_models = final_comparison.head(8)
colors = plt.cm.RdYlGn(np.linspace(0.3, 0.9, len(top_models)))

bars = plt.barh(range(len(top_models)), top_models['accuracy'], color=colors)
plt.yticks(range(len(top_models)), [name.replace(' (', '\n(') for name in top_models['model_type']])
plt.xlabel('Test Accuracy')
plt.title('Top 8 Models Performance')
plt.xlim(0.9, 1.0)

# Add accuracy labels on bars
for i, (bar, acc) in enumerate(zip(bars, top_models['accuracy'])):
    plt.text(acc + 0.001, i, f'{acc:.3f}', va='center', ha='left', fontsize=9)

plt.tight_layout()
plt.show()

print("\nFinal Classification Report:")
print(classification_report(y_test, final_predictions, target_names=wine.target_names))

# Per-class performance
per_class_acc = cm_final.diagonal() / cm_final.sum(axis=1)
print("\nPer-class Performance:")
for i, (acc, name) in enumerate(zip(per_class_acc, wine.target_names)):
    print(f"{name}: {acc:.4f} ({cm_final.diagonal()[i]}/{cm_final.sum(axis=1)[i]} correct)")

## Key Findings and Conclusions

### Dataset Characteristics
- **Size**: 178 samples with 13 chemical features
- **Balance**: Relatively balanced classes (Class 0: 59, Class 1: 71, Class 2: 48)
- **Quality**: No missing values, high-quality chemical measurements
- **Complexity**: More challenging than Iris due to higher dimensionality and subtle class differences

### Feature Analysis
- **High Variance Features**: Proline, Color intensity, OD280/OD315 show highest variance
- **Correlations**: Several strong correlations exist (e.g., flavanoids with total phenols)
- **Importance**: Flavanoids, Proline, and Color intensity are consistently most important
- **Dimensionality**: First 5 PCA components capture ~80% of variance

### Model Performance
- **Best Models**: SVM and ensemble methods generally perform best
- **Scaling Impact**: Standard scaling significantly improves performance for distance-based models
- **Accuracy Range**: Most models achieve 90-100% accuracy
- **Stability**: Cross-validation scores are consistent with test performance

### Hyperparameter Tuning
- **Improvement**: Tuning provides modest but consistent improvements
- **Best Parameters**: Typically favor balanced complexity (moderate C for SVM, reasonable depth for RF)
- **Overfitting Risk**: High accuracy suggests potential overfitting due to small dataset size

### Practical Implications
- **Wine Classification**: Chemical analysis can reliably distinguish wine cultivars
- **Feature Selection**: Focus on flavanoids, proline, and color-related measurements
- **Model Choice**: SVM with RBF kernel or ensemble methods recommended
- **Data Collection**: Additional samples would improve model robustness

### Recommendations
1. **Production Use**: Collect more data before deploying in production
2. **Feature Engineering**: Consider ratios and interactions between chemical compounds
3. **Validation**: Use external validation set from different harvest years/regions
4. **Monitoring**: Track model performance over time as wine characteristics may change