# üç∑ DATA SCIENCE FINAL LAB EXAM ‚Äì VARIANT 1

## Wine Dataset Multiclass Classification with PCA & Model Deployment

**Author:** Jahanzaib Channa  
**Dataset:** Wine Dataset (Multiclass Classification)  
**Total Marks:** 15 Marks

---

### üìÅ Project Files

| File | Description | Link |
|------|-------------|------|
| `wine_classification.py` | Main Python Script (Tasks a-d) | [View File](./wine_classification.py) |
| `streamlit_app.py` | Streamlit Web Application (Task e) | [View File](./streamlit_app.py) |
| `README.md` | Project Documentation | [View File](./README.md) |
| `requirements.txt` | Python Dependencies | [View File](./requirements.txt) |

### üåê GitHub Repository

**URL:** [https://github.com/jahanzaib-codes/wine_classification_project](https://github.com/jahanzaib-codes/wine_classification_project)

---

## üì¶ Import Libraries

In [None]:
# Import all required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import joblib
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!")

---

# SECTION A: PRACTICAL TASKS (15 Marks)

---

## Task A: Data Loading, Cleaning & Exploration (2 Marks)

### Requirements:
1. Load the Wine dataset
2. Display shapes of X and y
3. Convert to Pandas DataFrame and show first 5 rows + summary statistics
4. Display class distribution using value_counts() and determine if balanced

In [None]:
# 1. Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target
feature_names = wine.feature_names
target_names = wine.target_names

print("‚úÖ Wine dataset loaded successfully!")
print(f"\nFeature names: {feature_names}")
print(f"Target classes: {target_names}")

In [None]:
# 2. Display shapes of X and y
print("üìä Dataset Shapes:")
print(f"   X (Features) shape: {X.shape}")
print(f"   y (Target) shape: {y.shape}")
print(f"\n   Number of samples: {X.shape[0]}")
print(f"   Number of features: {X.shape[1]}")

In [None]:
# 3. Convert the dataset into a Pandas DataFrame
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y
df['wine_class'] = df['target'].map({i: target_names[i] for i in range(len(target_names))})

# Show first 5 rows
print("üìã First 5 rows of the dataset:")
df.head()

In [None]:
# Display summary statistics
print("üìà Summary Statistics:")
df.describe()

In [None]:
# 4. Display class distribution using value_counts()
print("üè∑Ô∏è Class Distribution:")
class_dist = pd.Series(y).value_counts().sort_index()

for i, count in enumerate(class_dist):
    print(f"   Class {i} ({target_names[i]}): {count} samples ({count/len(y)*100:.1f}%)")

# Determine if the dataset is balanced
min_count = class_dist.min()
max_count = class_dist.max()
balance_ratio = min_count / max_count

print(f"\n‚öñÔ∏è Balance Analysis:")
print(f"   Min class count: {min_count}")
print(f"   Max class count: {max_count}")
print(f"   Balance ratio (min/max): {balance_ratio:.4f}")

if balance_ratio >= 0.8:
    print("\n   ‚úÖ Dataset is BALANCED (ratio >= 0.8)")
else:
    print("\n   ‚ö†Ô∏è Dataset is IMBALANCED (ratio < 0.8)")

In [None]:
# Visualize class distribution
plt.figure(figsize=(8, 5))
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']
bars = plt.bar(target_names, class_dist.values, color=colors)
plt.xlabel('Wine Class')
plt.ylabel('Number of Samples')
plt.title('Wine Dataset Class Distribution')
for bar, count in zip(bars, class_dist.values):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
             str(count), ha='center', va='bottom', fontweight='bold')
plt.tight_layout()
plt.show()

---

## Task B: Preprocessing, Scaling & Stratified Split (2 Marks)

### Requirements:
1. Standardize all features
2. Split the dataset into 80% training and 20% testing using stratified sampling

In [None]:
# 1. Standardize all features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("üîß Feature Standardization:")
print("   - Applied StandardScaler to all features")
print(f"   - Mean of scaled features: {np.mean(X_scaled):.10f} (‚âà 0)")
print(f"   - Std of scaled features: {np.std(X_scaled):.4f} (‚âà 1)")

In [None]:
# 2. Split the dataset into 80% training and 20% testing using stratified sampling
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print("üìÇ Stratified Train-Test Split:")
print(f"   Training set size: {X_train.shape[0]} samples ({X_train.shape[0]/len(y)*100:.0f}%)")
print(f"   Testing set size: {X_test.shape[0]} samples ({X_test.shape[0]/len(y)*100:.0f}%)")

print("\n   Training set class distribution:")
train_counts = pd.Series(y_train).value_counts().sort_index()
for i, count in enumerate(train_counts):
    print(f"      Class {i}: {count} samples ({count/len(y_train)*100:.1f}%)")

print("\n   Testing set class distribution:")
test_counts = pd.Series(y_test).value_counts().sort_index()
for i, count in enumerate(test_counts):
    print(f"      Class {i}: {count} samples ({count/len(y_test)*100:.1f}%)")

In [None]:
# Save scaler for deployment
joblib.dump(scaler, 'scaler.pkl')
print("‚úÖ Scaler saved as 'scaler.pkl'")

---

## Task C: PCA Analysis (3 Marks)

### Requirements:
1. Apply PCA and determine components needed for 95% and 99% variance
2. Transform training and testing data using 95% variance PCA
3. Display explained variance values numerically

In [None]:
# 1. Apply PCA and determine components needed for 95% and 99% variance
pca_full = PCA()
pca_full.fit(X_train)

explained_variance_ratio = pca_full.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)

# Components needed for 95% variance
components_95 = np.argmax(cumulative_variance >= 0.95) + 1
# Components needed for 99% variance
components_99 = np.argmax(cumulative_variance >= 0.99) + 1

print("üî¨ PCA Variance Analysis:")
print(f"\nüìä Components Needed:")
print(f"   For 95% variance: {components_95} components")
print(f"   For 99% variance: {components_99} components")

In [None]:
# 3. Display explained variance values numerically
print("üìä Individual Explained Variance Ratios:")
variance_df = pd.DataFrame({
    'Principal Component': [f'PC{i+1}' for i in range(len(explained_variance_ratio))],
    'Explained Variance Ratio': explained_variance_ratio,
    'Cumulative Variance': cumulative_variance,
    'Percentage': [f'{v*100:.2f}%' for v in explained_variance_ratio],
    'Cumulative %': [f'{v*100:.2f}%' for v in cumulative_variance]
})
variance_df

In [None]:
# Visualize PCA variance
plt.figure(figsize=(10, 6))
components_range = range(1, len(cumulative_variance) + 1)
plt.bar(components_range, explained_variance_ratio, alpha=0.7, label='Individual')
plt.step(components_range, cumulative_variance, where='mid', color='red', 
         linewidth=2, label='Cumulative')
plt.axhline(y=0.95, color='green', linestyle='--', label='95% threshold')
plt.axhline(y=0.99, color='orange', linestyle='--', label='99% threshold')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('PCA Explained Variance Analysis')
plt.legend(loc='best')
plt.xticks(components_range)
plt.tight_layout()
plt.show()

In [None]:
# 2. Transform training and testing data using 95% variance PCA
pca_95 = PCA(n_components=components_95)
X_train_pca = pca_95.fit_transform(X_train)
X_test_pca = pca_95.transform(X_test)

print(f"üîÑ Data Transformation with {components_95}-component PCA (95% variance):")
print(f"   Original training shape: {X_train.shape}")
print(f"   Transformed training shape: {X_train_pca.shape}")
print(f"   Original testing shape: {X_test.shape}")
print(f"   Transformed testing shape: {X_test_pca.shape}")
print(f"\n   Dimensionality reduction: {X_train.shape[1]} ‚Üí {X_train_pca.shape[1]} features")
print(f"   Total Variance Explained: {sum(pca_95.explained_variance_ratio_)*100:.2f}%")

In [None]:
# Save PCA for deployment
joblib.dump(pca_95, 'pca_model.pkl')
print("‚úÖ PCA model saved as 'pca_model.pkl'")

In [None]:
# Visualize PCA 2D scatter plot
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

plt.figure(figsize=(10, 8))
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']
for i, (color, name) in enumerate(zip(colors, target_names)):
    mask = y == i
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1], c=color, label=name, alpha=0.7, s=60)
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}%)')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}%)')
plt.title('PCA 2D Visualization of Wine Dataset')
plt.legend()
plt.tight_layout()
plt.show()

---

## Task D: Model Training, Evaluation & Comparison (3 Marks)

### Requirements:
Train the following classifiers:
- Decision Tree
- Random Forest Classifier
- Support Vector Machine (SVM)

For each model:
1. Report test accuracy
2. Display confusion matrix
3. Identify the best-performing classifier and justify

In [None]:
# Define classifiers
classifiers = {
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Support Vector Machine (SVM)': SVC(kernel='rbf', C=1.0, random_state=42)
}

results = {}

print("ü§ñ Training Models...\n")

In [None]:
# Train and evaluate Decision Tree
print("=" * 60)
print("üå≥ DECISION TREE CLASSIFIER")
print("=" * 60)

dt_clf = classifiers['Decision Tree']
dt_clf.fit(X_train_pca, y_train)
dt_pred = dt_clf.predict(X_test_pca)
dt_accuracy = accuracy_score(y_test, dt_pred)

results['Decision Tree'] = {'model': dt_clf, 'accuracy': dt_accuracy, 'predictions': dt_pred}

# 1. Report test accuracy
print(f"\nüìä Test Accuracy: {dt_accuracy:.4f} ({dt_accuracy*100:.2f}%)")

# 2. Display confusion matrix
dt_cm = confusion_matrix(y_test, dt_pred)
print(f"\nüìã Confusion Matrix:")
print(dt_cm)

# Classification report
print(f"\nüìà Classification Report:")
print(classification_report(y_test, dt_pred, target_names=target_names))

In [None]:
# Train and evaluate Random Forest
print("=" * 60)
print("üå≤ RANDOM FOREST CLASSIFIER")
print("=" * 60)

rf_clf = classifiers['Random Forest']
rf_clf.fit(X_train_pca, y_train)
rf_pred = rf_clf.predict(X_test_pca)
rf_accuracy = accuracy_score(y_test, rf_pred)

results['Random Forest'] = {'model': rf_clf, 'accuracy': rf_accuracy, 'predictions': rf_pred}

# 1. Report test accuracy
print(f"\nüìä Test Accuracy: {rf_accuracy:.4f} ({rf_accuracy*100:.2f}%)")

# 2. Display confusion matrix
rf_cm = confusion_matrix(y_test, rf_pred)
print(f"\nüìã Confusion Matrix:")
print(rf_cm)

# Classification report
print(f"\nüìà Classification Report:")
print(classification_report(y_test, rf_pred, target_names=target_names))

In [None]:
# Train and evaluate SVM
print("=" * 60)
print("üéØ SUPPORT VECTOR MACHINE (SVM)")
print("=" * 60)

svm_clf = classifiers['Support Vector Machine (SVM)']
svm_clf.fit(X_train_pca, y_train)
svm_pred = svm_clf.predict(X_test_pca)
svm_accuracy = accuracy_score(y_test, svm_pred)

results['Support Vector Machine (SVM)'] = {'model': svm_clf, 'accuracy': svm_accuracy, 'predictions': svm_pred}

# 1. Report test accuracy
print(f"\nüìä Test Accuracy: {svm_accuracy:.4f} ({svm_accuracy*100:.2f}%)")

# 2. Display confusion matrix
svm_cm = confusion_matrix(y_test, svm_pred)
print(f"\nüìã Confusion Matrix:")
print(svm_cm)

# Classification report
print(f"\nüìà Classification Report:")
print(classification_report(y_test, svm_pred, target_names=target_names))

In [None]:
# Visualize confusion matrices
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

confusion_matrices = [dt_cm, rf_cm, svm_cm]
model_names = ['Decision Tree', 'Random Forest', 'SVM']

for ax, cm, name in zip(axes, confusion_matrices, model_names):
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
                xticklabels=target_names, yticklabels=target_names)
    ax.set_title(f'{name} Confusion Matrix')
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')

plt.tight_layout()
plt.show()

In [None]:
# 3. Model Comparison & Best Classifier Identification
print("=" * 60)
print("üèÜ MODEL COMPARISON & BEST CLASSIFIER")
print("=" * 60)

print("\nüìä Accuracy Summary:")
print("-" * 50)

comparison_df = pd.DataFrame({
    'Model': list(results.keys()),
    'Accuracy': [results[m]['accuracy'] for m in results.keys()],
    'Accuracy %': [f"{results[m]['accuracy']*100:.2f}%" for m in results.keys()]
})
comparison_df = comparison_df.sort_values('Accuracy', ascending=False)
print(comparison_df.to_string(index=False))

# Identify best model
best_model_name = max(results, key=lambda x: results[x]['accuracy'])
best_accuracy = results[best_model_name]['accuracy']
best_model = results[best_model_name]['model']

print("\n" + "-" * 50)
print(f"\nü•á Best Classifier: {best_model_name}")
print(f"üìä Accuracy: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")

# Justification
if best_model_name == 'Random Forest':
    justification = "Random Forest is the best classifier because it achieves the highest accuracy by combining multiple decision trees to reduce overfitting and improve generalization."
elif best_model_name == 'Decision Tree':
    justification = "Decision Tree is the best classifier because it achieved the highest accuracy while maintaining interpretability and fast prediction times."
else:
    justification = "SVM is the best classifier because it achieves the highest accuracy by finding the optimal hyperplane that maximizes the margin between classes."

print(f"\nüìù Justification: {justification}")

In [None]:
# Visualize model comparison
plt.figure(figsize=(10, 6))
model_names = list(results.keys())
accuracies = [results[m]['accuracy'] for m in model_names]
bar_colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']

bars = plt.barh(model_names, accuracies, color=bar_colors)
plt.xlim(0, 1)
plt.xlabel('Accuracy')
plt.title('Model Accuracy Comparison')

for bar, acc in zip(bars, accuracies):
    plt.text(acc + 0.01, bar.get_y() + bar.get_height()/2, 
             f'{acc:.4f}', va='center', fontweight='bold')

plt.tight_layout()
plt.show()

---

## Task E: Model Deployment Preparation (5 Marks)

### Requirements:
1. Save the best-trained model using joblib
2. Create a Streamlit application (saved as separate file)
3. Display prediction result clearly in the app

In [None]:
# 1. Save the best-trained model using joblib
model_filename = 'best_wine_model.pkl'
joblib.dump(best_model, model_filename)

print("üíæ Model Saving:")
print(f"   ‚úÖ Best model ({best_model_name}) saved as '{model_filename}'")
print(f"   ‚úÖ Scaler saved as 'scaler.pkl'")
print(f"   ‚úÖ PCA model saved as 'pca_model.pkl'")

# Save additional metadata
metadata = {
    'model_name': best_model_name,
    'accuracy': best_accuracy,
    'feature_names': list(feature_names),
    'target_names': list(target_names),
    'n_pca_components': components_95
}

joblib.dump(metadata, 'model_metadata.pkl')
print(f"   ‚úÖ Model metadata saved as 'model_metadata.pkl'")

In [None]:
# Test the saved model
print("\nüß™ Testing Saved Model:")
print("-" * 50)

# Load saved models
loaded_model = joblib.load('best_wine_model.pkl')
loaded_scaler = joblib.load('scaler.pkl')
loaded_pca = joblib.load('pca_model.pkl')

# Test prediction with sample data
sample_data = X[0].reshape(1, -1)
sample_scaled = loaded_scaler.transform(sample_data)
sample_pca = loaded_pca.transform(sample_scaled)
prediction = loaded_model.predict(sample_pca)

print(f"   Sample input: First wine sample from dataset")
print(f"   True class: {target_names[y[0]]} (Class {y[0]})")
print(f"   Predicted class: {target_names[prediction[0]]} (Class {prediction[0]})")
print(f"   ‚úÖ Model prediction works correctly!")

---

## üì± Streamlit Application

The Streamlit application is saved as a separate file: **`streamlit_app.py`**

### To run the Streamlit app:

```bash
streamlit run streamlit_app.py
```

### Features:
- üé® Premium Dark Theme UI
- üìä Interactive Sliders for all 13 wine features
- üîÆ Real-time Predictions with class probabilities
- üìã Model Information sidebar
- üß™ Quick Test with sample data

---

# ‚úÖ ALL TASKS COMPLETED SUCCESSFULLY!

## üìä Summary

| Task | Status | Key Results |
|------|--------|-------------|
| **Task A** | ‚úÖ Complete | 178 samples, 13 features, BALANCED dataset |
| **Task B** | ‚úÖ Complete | StandardScaler applied, 80/20 stratified split |
| **Task C** | ‚úÖ Complete | 5 components for 95% variance, 8 for 99% |
| **Task D** | ‚úÖ Complete | SVM achieved highest accuracy (97.22%) |
| **Task E** | ‚úÖ Complete | Model saved, Streamlit app created |

## üèÜ Best Model: Support Vector Machine (SVM)
- **Accuracy:** 97.22%
- **Justification:** SVM achieves the highest accuracy by finding the optimal hyperplane that maximizes the margin between classes.

---

## üîó Project Links

- **GitHub Repository:** [https://github.com/jahanzaib-codes/wine_classification_project](https://github.com/jahanzaib-codes/wine_classification_project)
- **Streamlit Cloud:** Deploy using [share.streamlit.io](https://share.streamlit.io)

---

**Author:** Jahanzaib Channa  
**Data Science Final Lab Exam ‚Äì Variant 1**