# Supervised Learning Demo: Step-by-Step Guide

This notebook demonstrates the key concepts of supervised learning through an interactive classification example.

## What is Supervised Learning?

Supervised learning is a machine learning approach where we:
1. **Train** a model using labeled data (input-output pairs)
2. **Learn** patterns from this training data
3. **Predict** outputs for new, unseen inputs

Today we'll classify flowers based on their measurements!

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ Libraries imported successfully!")

## Step 1: Understanding Our Data

We'll use the famous Iris dataset - it contains measurements of 150 flowers from 3 different species.

In [None]:
# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features (measurements)
y = iris.target  # Labels (species)

# Create a DataFrame for easier visualization
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = [iris.target_names[i] for i in y]

print("📊 Dataset Overview:")
print(f"Number of samples: {len(df)}")
print(f"Number of features: {len(iris.feature_names)}")
print(f"Species: {list(iris.target_names)}")
print("\n🔍 First 5 rows:")
df.head()

In [None]:
# Visualize the data distribution
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('🌸 Iris Dataset: Feature Distributions by Species', fontsize=16, fontweight='bold')

features = iris.feature_names
for i, feature in enumerate(features):
    row, col = i // 2, i % 2
    for species in iris.target_names:
        data = df[df['species'] == species][feature]
        axes[row, col].hist(data, alpha=0.7, label=species, bins=15)
    
    axes[row, col].set_title(feature.replace('_', ' ').title())
    axes[row, col].set_xlabel('Measurement (cm)')
    axes[row, col].set_ylabel('Frequency')
    axes[row, col].legend()

plt.tight_layout()
plt.show()

print("💡 Key Observation: Different species have different measurement patterns!")
print("This is what makes supervised learning possible - patterns in the data.")

## Step 2: The Supervised Learning Process

### 🎯 The Goal
Given flower measurements → Predict the species

### 📚 Training vs Testing
We split our data into two parts:
- **Training set**: Used to teach the model
- **Test set**: Used to evaluate how well the model learned

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print("📊 Data Split:")
print(f"Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"Testing samples: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")

# Visualize the split
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Training set
train_counts = np.bincount(y_train)
ax1.bar(iris.target_names, train_counts, color=['skyblue', 'lightgreen', 'salmon'])
ax1.set_title('🎓 Training Set Distribution')
ax1.set_ylabel('Number of Samples')

# Test set
test_counts = np.bincount(y_test)
ax2.bar(iris.target_names, test_counts, color=['skyblue', 'lightgreen', 'salmon'])
ax2.set_title('🧪 Test Set Distribution')
ax2.set_ylabel('Number of Samples')

plt.tight_layout()
plt.show()

print("\n✅ Data is evenly distributed across both sets!")

## Step 3: Training Our First Model

Let's start with a simple but powerful algorithm: **Logistic Regression**

### How it works:
1. Finds mathematical relationships between features and labels
2. Creates decision boundaries to separate different classes
3. Uses probability to make predictions

In [None]:
# Create and train the model
print("🤖 Training Logistic Regression Model...")

# Scale the features for better performance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train the model
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train_scaled, y_train)

print("✅ Model training completed!")
print("\n🧠 What the model learned:")
print("The model found mathematical patterns that connect flower measurements to species.")
print("It can now use these patterns to classify new flowers!")

## Step 4: Making Predictions

Now let's see how our trained model performs on new, unseen data!

In [None]:
# Make predictions
y_pred = lr_model.predict(X_test_scaled)
y_pred_proba = lr_model.predict_proba(X_test_scaled)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"🎯 Model Accuracy: {accuracy:.2%}")
print(f"The model correctly classified {accuracy:.1%} of the test flowers!")

# Show some example predictions
print("\n🔍 Example Predictions:")
for i in range(5):
    actual = iris.target_names[y_test[i]]
    predicted = iris.target_names[y_pred[i]]
    confidence = np.max(y_pred_proba[i]) * 100
    
    status = "✅" if actual == predicted else "❌"
    print(f"{status} Actual: {actual:12} | Predicted: {predicted:12} | Confidence: {confidence:.1f}%")

In [None]:
# Visualize predictions vs actual
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, 
            yticklabels=iris.target_names, ax=ax1)
ax1.set_title('🎯 Confusion Matrix')
ax1.set_xlabel('Predicted')
ax1.set_ylabel('Actual')

# Prediction Confidence Distribution
max_probas = np.max(y_pred_proba, axis=1)
ax2.hist(max_probas, bins=20, alpha=0.7, color='green', edgecolor='black')
ax2.set_title('📊 Prediction Confidence Distribution')
ax2.set_xlabel('Confidence Level')
ax2.set_ylabel('Number of Predictions')
ax2.axvline(np.mean(max_probas), color='red', linestyle='--', 
           label=f'Average: {np.mean(max_probas):.2f}')
ax2.legend()

plt.tight_layout()
plt.show()

print("\n💡 Insights:")
print("• Diagonal values in confusion matrix = correct predictions")
print("• Higher confidence values = more certain predictions")
print(f"• Average confidence: {np.mean(max_probas):.1%}")

## Step 5: Comparing Different Algorithms

Let's try a different algorithm: **Random Forest**
- Uses multiple decision trees
- Often more robust and accurate
- Can handle complex patterns

In [None]:
# Train Random Forest model
print("🌳 Training Random Forest Model...")

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)  # Random Forest doesn't need scaling

# Make predictions
rf_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)

print(f"✅ Random Forest Accuracy: {rf_accuracy:.2%}")

# Compare models
print("\n🏆 Model Comparison:")
print(f"Logistic Regression: {accuracy:.2%}")
print(f"Random Forest:       {rf_accuracy:.2%}")

if rf_accuracy > accuracy:
    print("🎉 Random Forest performs better!")
elif accuracy > rf_accuracy:
    print("🎉 Logistic Regression performs better!")
else:
    print("🤝 Both models perform equally well!")

In [None]:
# Feature importance from Random Forest
feature_importance = rf_model.feature_importances_

plt.figure(figsize=(10, 6))
bars = plt.bar(iris.feature_names, feature_importance, 
               color=['skyblue', 'lightgreen', 'salmon', 'gold'])
plt.title('🌟 Feature Importance (Random Forest)', fontsize=14, fontweight='bold')
plt.xlabel('Features')
plt.ylabel('Importance Score')
plt.xticks(rotation=45)

# Add value labels on bars
for bar, importance in zip(bars, feature_importance):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{importance:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("\n🔍 Feature Importance Insights:")
most_important = iris.feature_names[np.argmax(feature_importance)]
print(f"• Most important feature: {most_important}")
print("• This tells us which measurements are most useful for classification")
print("• Higher importance = more discriminative power")

## Step 6: Interactive Prediction

Let's create a function to predict species for any flower measurements!

In [None]:
def predict_flower_species(sepal_length, sepal_width, petal_length, petal_width):
    """
    Predict flower species based on measurements
    """
    # Prepare the input
    measurements = np.array([[sepal_length, sepal_width, petal_length, petal_width]])
    
    # Scale for logistic regression
    measurements_scaled = scaler.transform(measurements)
    
    # Get predictions from both models
    lr_pred = lr_model.predict(measurements_scaled)[0]
    lr_proba = lr_model.predict_proba(measurements_scaled)[0]
    
    rf_pred = rf_model.predict(measurements)[0]
    rf_proba = rf_model.predict_proba(measurements)[0]
    
    print(f"🌸 Flower Measurements:")
    print(f"   Sepal Length: {sepal_length} cm")
    print(f"   Sepal Width:  {sepal_width} cm")
    print(f"   Petal Length: {petal_length} cm")
    print(f"   Petal Width:  {petal_width} cm")
    print("\n🤖 Model Predictions:")
    print(f"   Logistic Regression: {iris.target_names[lr_pred]} ({np.max(lr_proba):.1%} confidence)")
    print(f"   Random Forest:       {iris.target_names[rf_pred]} ({np.max(rf_proba):.1%} confidence)")
    
    # Show probability breakdown
    print("\n📊 Probability Breakdown (Random Forest):")
    for i, species in enumerate(iris.target_names):
        print(f"   {species:12}: {rf_proba[i]:.1%}")

# Example predictions
print("🧪 Example Predictions:\n")

print("Example 1: Small flower")
predict_flower_species(4.5, 2.5, 1.2, 0.3)

print("\n" + "="*50 + "\n")

print("Example 2: Medium flower")
predict_flower_species(6.0, 3.0, 4.0, 1.2)

print("\n" + "="*50 + "\n")

print("Example 3: Large flower")
predict_flower_species(7.5, 3.2, 6.0, 2.0)

## Step 7: Key Takeaways

### 🎓 What We Learned About Supervised Learning:

1. **Data is Key**: Quality labeled data is essential
2. **Train-Test Split**: Always evaluate on unseen data
3. **Multiple Algorithms**: Different algorithms have different strengths
4. **Feature Importance**: Some measurements matter more than others
5. **Confidence Matters**: Models provide probability estimates

### 🔄 The Supervised Learning Workflow:
```
Data Collection → Data Preparation → Model Training → Evaluation → Prediction
```

### 🚀 Next Steps:
- Try with your own dataset
- Experiment with different algorithms
- Learn about feature engineering
- Explore cross-validation techniques

In [None]:
# Final visualization: Decision boundary (2D projection)
from sklearn.decomposition import PCA

# Reduce to 2D for visualization
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)
X_train_2d = pca.transform(X_train)
X_test_2d = pca.transform(X_test)

# Train a simple model on 2D data
simple_model = LogisticRegression()
simple_model.fit(X_train_2d, y_train)

# Create decision boundary
h = 0.02
x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = simple_model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
plt.figure(figsize=(12, 8))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')

# Plot data points
colors = ['red', 'green', 'blue']
for i, species in enumerate(iris.target_names):
    mask = y == i
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1], 
               c=colors[i], label=species, s=50, alpha=0.7)

plt.title('🎯 Decision Boundaries in 2D Space', fontsize=16, fontweight='bold')
plt.xlabel(f'First Principal Component ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'Second Principal Component ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("🎨 This visualization shows how the model creates decision boundaries")
print("to separate different classes in the feature space!")
print(f"\n📈 Total variance explained by 2D projection: {sum(pca.explained_variance_ratio_):.1%}")

## 🎉 Congratulations!

You've successfully completed a comprehensive supervised learning demo! 

You now understand:
- How supervised learning works
- The importance of train-test splits
- How to evaluate model performance
- How different algorithms compare
- How to make predictions on new data

**Try modifying the code above to:**
- Use different train-test split ratios
- Try other algorithms (SVM, Decision Trees, etc.)
- Experiment with feature selection
- Test with your own measurements!