# üìò Day 1: Classification Algorithms

**üéØ Goal:** Master the fundamental classification algorithms used in modern AI systems

**‚è±Ô∏è Time:** 60-90 minutes

**üåü Why This Matters for AI:**
- Classification powers spam detection, sentiment analysis, and content moderation
- Used in RAG systems to classify document relevance
- Foundation for Agentic AI decision-making (which action to take?)
- Pre-processing step before Transformer models
- Essential for building production AI systems in 2024-2025

---

## ü§î What is Classification?

**Classification = Assigning items to categories**

Real-world examples:
- Email: Spam or Not Spam?
- Review: Positive, Negative, or Neutral?
- Image: Cat, Dog, or Bird?
- Customer: Will Buy or Won't Buy?

Think of it as a **sorting machine**:
- Input: Data (email text, customer info, image pixels)
- Output: Category label (spam/not spam, buy/won't buy)

Let's build classification models! üëá

In [None]:
# Import the tools we need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

# Classifiers we'll use
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Make plots look nice
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ Libraries imported successfully!")
print("Let's build some AI classifiers! üöÄ")

## üìä Our Dataset: Email Spam Detection

We'll build a **spam classifier** - a real AI application used by Gmail, Outlook, etc!

**Features:**
- `word_freq_money`: How often "money" appears
- `word_freq_free`: How often "free" appears
- `word_freq_winner`: How often "winner" appears
- `capital_run_length`: Longest sequence of CAPS
- `exclamation_marks`: Number of !!!

**Target:**
- `is_spam`: 1 = Spam, 0 = Not Spam

In [None]:
# Create a realistic spam detection dataset
np.random.seed(42)

n_samples = 1000

# Generate features
# Spam emails have more "money", "free", caps, etc.
data = {
    'word_freq_money': np.concatenate([
        np.random.exponential(2, 400),  # Spam
        np.random.exponential(0.3, 600)  # Not spam
    ]),
    'word_freq_free': np.concatenate([
        np.random.exponential(1.5, 400),
        np.random.exponential(0.2, 600)
    ]),
    'word_freq_winner': np.concatenate([
        np.random.exponential(1, 400),
        np.random.exponential(0.1, 600)
    ]),
    'capital_run_length': np.concatenate([
        np.random.poisson(8, 400),
        np.random.poisson(2, 600)
    ]),
    'exclamation_marks': np.concatenate([
        np.random.poisson(5, 400),
        np.random.poisson(1, 600)
    ]),
    'is_spam': [1] * 400 + [0] * 600  # 40% spam, 60% not spam
}

df = pd.DataFrame(data)

# Shuffle the data
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print("üìß Spam Detection Dataset Created!")
print(f"Total emails: {len(df)}")
print(f"Spam emails: {df['is_spam'].sum()}")
print(f"Not spam: {(df['is_spam'] == 0).sum()}")
print("\nFirst few emails:")
df.head()

In [None]:
# Visualize the difference between spam and not spam
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('üìä Spam vs Not Spam: Feature Comparison', fontsize=16, fontweight='bold')

features = ['word_freq_money', 'word_freq_free', 'word_freq_winner', 
            'capital_run_length', 'exclamation_marks']

for idx, feature in enumerate(features):
    ax = axes[idx // 3, idx % 3]
    
    df[df['is_spam'] == 1][feature].hist(ax=ax, alpha=0.6, label='Spam', bins=20, color='red')
    df[df['is_spam'] == 0][feature].hist(ax=ax, alpha=0.6, label='Not Spam', bins=20, color='green')
    
    ax.set_title(feature.replace('_', ' ').title())
    ax.set_xlabel('Value')
    ax.set_ylabel('Frequency')
    ax.legend()

# Remove empty subplot
fig.delaxes(axes[1, 2])

plt.tight_layout()
plt.show()

print("üìà Notice how spam emails have higher values for suspicious words!")

## üîß Prepare Data for Training

Before training, we need to:
1. **Split data**: Training set (80%) and Test set (20%)
2. **Scale features**: Normalize values for better performance

In [None]:
# Separate features (X) and target (y)
X = df.drop('is_spam', axis=1)
y = df['is_spam']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale the features (important for many algorithms)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("‚úÖ Data prepared!")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"\nFeatures: {list(X.columns)}")

## 1Ô∏è‚É£ Logistic Regression

**What it does:** Calculates the probability that something belongs to a category

**How it works:**
- Draws a line (or curve) to separate categories
- Outputs probability: 0.0 (definitely not spam) to 1.0 (definitely spam)
- If probability > 0.5 ‚Üí Spam, else ‚Üí Not Spam

**Best for:**
- Fast predictions
- When you need probability scores
- Binary classification (2 categories)

**üéØ Real AI Use Cases:**
- **Sentiment analysis** in social media monitoring (2024 trend)
- **Document classification** in RAG systems
- **Action classification** for Agentic AI (which action to take?)

In [None]:
# Create and train Logistic Regression model
log_reg = LogisticRegression(random_state=42, max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

# Make predictions
y_pred_log = log_reg.predict(X_test_scaled)
y_pred_proba_log = log_reg.predict_proba(X_test_scaled)[:, 1]

# Evaluate
accuracy_log = accuracy_score(y_test, y_pred_log)

print("üéØ Logistic Regression Results:")
print(f"Accuracy: {accuracy_log:.2%}")
print("\nüìä Classification Report:")
print(classification_report(y_test, y_pred_log, target_names=['Not Spam', 'Spam']))

# Show some predictions with probabilities
print("\nüîç Sample Predictions (first 5 test emails):")
for i in range(5):
    actual = "Spam" if y_test.iloc[i] == 1 else "Not Spam"
    predicted = "Spam" if y_pred_log[i] == 1 else "Not Spam"
    confidence = y_pred_proba_log[i] if y_pred_log[i] == 1 else 1 - y_pred_proba_log[i]
    print(f"Email {i+1}: Actual={actual}, Predicted={predicted}, Confidence={confidence:.1%}")

## 2Ô∏è‚É£ Decision Trees

**What it does:** Makes decisions by asking yes/no questions

**How it works:**
- Like a flowchart: "Does it have 'money' > 2 times?"
  - Yes ‚Üí "Does it have CAPS?"
    - Yes ‚Üí SPAM
    - No ‚Üí Check more...

**Best for:**
- Easy to understand and visualize
- Handles non-linear patterns
- No need to scale data

**üéØ Real AI Use Cases:**
- **Customer segmentation** for personalized AI
- **Fraud detection** in financial AI systems
- **Rule extraction** from multimodal AI outputs

In [None]:
# Create and train Decision Tree
dt = DecisionTreeClassifier(random_state=42, max_depth=5)
dt.fit(X_train, y_train)  # No scaling needed!

# Make predictions
y_pred_dt = dt.predict(X_test)

# Evaluate
accuracy_dt = accuracy_score(y_test, y_pred_dt)

print("üå≥ Decision Tree Results:")
print(f"Accuracy: {accuracy_dt:.2%}")
print("\nüìä Classification Report:")
print(classification_report(y_test, y_pred_dt, target_names=['Not Spam', 'Spam']))

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': dt.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nüî• Most Important Features:")
print(feature_importance)

In [None]:
# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'])
plt.xlabel('Importance Score', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('üéØ Decision Tree: Feature Importance for Spam Detection', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("üí° Higher importance = More useful for detecting spam!")

## 3Ô∏è‚É£ Random Forests

**What it does:** Combines many decision trees for better accuracy

**How it works:**
- Creates 100+ decision trees (a "forest")
- Each tree votes on the prediction
- Final prediction = majority vote

**Analogy:** Instead of asking 1 expert, ask 100 experts and use the majority opinion!

**Best for:**
- High accuracy
- Robust to overfitting
- Handles complex patterns

**üéØ Real AI Use Cases:**
- **Content moderation** on social media platforms
- **Query routing** in RAG systems (which documents to retrieve?)
- **Ensemble methods** combined with Transformers in 2024-2025
- **Feature extraction** for Agentic AI decision-making

In [None]:
# Create and train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10)
rf.fit(X_train, y_train)

# Make predictions
y_pred_rf = rf.predict(X_test)
y_pred_proba_rf = rf.predict_proba(X_test)[:, 1]

# Evaluate
accuracy_rf = accuracy_score(y_test, y_pred_rf)

print("üå≤ Random Forest Results:")
print(f"Accuracy: {accuracy_rf:.2%}")
print(f"Number of trees: {rf.n_estimators}")
print("\nüìä Classification Report:")
print(classification_report(y_test, y_pred_rf, target_names=['Not Spam', 'Spam']))

# Feature importance
rf_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nüî• Most Important Features:")
print(rf_importance)

## üìä Compare All Models

Let's see which algorithm performs best on our spam detection task!

In [None]:
# Compare all models
results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest'],
    'Accuracy': [accuracy_log, accuracy_dt, accuracy_rf]
}).sort_values('Accuracy', ascending=False)

print("üèÜ Model Comparison:")
print(results.to_string(index=False))
print(f"\nü•á Best Model: {results.iloc[0]['Model']} with {results.iloc[0]['Accuracy']:.2%} accuracy")

# Visualize comparison
plt.figure(figsize=(10, 6))
bars = plt.bar(results['Model'], results['Accuracy'], color=['#3498db', '#e74c3c', '#2ecc71'])
plt.ylabel('Accuracy', fontsize=12)
plt.title('üéØ Classification Algorithm Comparison: Spam Detection', fontsize=14, fontweight='bold')
plt.ylim(0.7, 1.0)

# Add accuracy labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.2%}',
             ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

## üéØ Confusion Matrix: Understanding Errors

A confusion matrix shows:
- **True Positives (TP)**: Correctly identified spam
- **True Negatives (TN)**: Correctly identified not spam
- **False Positives (FP)**: Incorrectly flagged as spam (legitimate email marked spam!)
- **False Negatives (FN)**: Missed spam (spam reached inbox!)

In [None]:
# Create confusion matrices for all models
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

models = [
    ('Logistic Regression', y_pred_log),
    ('Decision Tree', y_pred_dt),
    ('Random Forest', y_pred_rf)
]

for idx, (name, predictions) in enumerate(models):
    cm = confusion_matrix(y_test, predictions)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx],
                xticklabels=['Not Spam', 'Spam'],
                yticklabels=['Not Spam', 'Spam'])
    axes[idx].set_title(f'{name}', fontsize=12, fontweight='bold')
    axes[idx].set_ylabel('Actual', fontsize=10)
    axes[idx].set_xlabel('Predicted', fontsize=10)

plt.tight_layout()
plt.show()

print("üìä Reading the Confusion Matrix:")
print("  Top-left: Correctly identified NOT spam")
print("  Bottom-right: Correctly identified spam")
print("  Top-right: False alarm (legitimate email marked spam) ‚ö†Ô∏è")
print("  Bottom-left: Missed spam (spam reached inbox) ‚ö†Ô∏è")

## üåü Real AI Example: Sentiment Analysis Pipeline

Let's build a **sentiment classifier** for product reviews - used in 2024-2025 AI systems!

**Use Case:** E-commerce platforms analyze millions of reviews to:
- Route negative reviews to customer service (Agentic AI)
- Identify trending products from positive sentiment
- Feed into RAG systems for customer support

In [None]:
# Create a sentiment analysis dataset
np.random.seed(42)

# Simulate review features
n_reviews = 800

sentiment_data = {
    'positive_words': np.concatenate([
        np.random.poisson(8, 400),   # Positive reviews
        np.random.poisson(2, 400)    # Negative reviews
    ]),
    'negative_words': np.concatenate([
        np.random.poisson(1, 400),   # Positive reviews
        np.random.poisson(6, 400)    # Negative reviews
    ]),
    'exclamation_marks': np.concatenate([
        np.random.poisson(3, 400),
        np.random.poisson(2, 400)
    ]),
    'review_length': np.concatenate([
        np.random.normal(150, 30, 400),
        np.random.normal(100, 25, 400)
    ]),
    'rating_stars': np.concatenate([
        np.random.choice([4, 5], 400, p=[0.3, 0.7]),
        np.random.choice([1, 2, 3], 400, p=[0.5, 0.3, 0.2])
    ]),
    'sentiment': ['Positive'] * 400 + ['Negative'] * 400
}

sentiment_df = pd.DataFrame(sentiment_data)
sentiment_df = sentiment_df.sample(frac=1, random_state=42).reset_index(drop=True)

# Convert sentiment to binary
sentiment_df['sentiment_binary'] = (sentiment_df['sentiment'] == 'Positive').astype(int)

print("üí¨ Sentiment Analysis Dataset Created!")
print(f"Total reviews: {len(sentiment_df)}")
print(f"Positive: {(sentiment_df['sentiment'] == 'Positive').sum()}")
print(f"Negative: {(sentiment_df['sentiment'] == 'Negative').sum()}")
print("\nSample reviews:")
sentiment_df.head()

In [None]:
# Train sentiment classifier
X_sent = sentiment_df.drop(['sentiment', 'sentiment_binary'], axis=1)
y_sent = sentiment_df['sentiment_binary']

X_train_sent, X_test_sent, y_train_sent, y_test_sent = train_test_split(
    X_sent, y_sent, test_size=0.2, random_state=42
)

# Try all three algorithms
print("üöÄ Training Sentiment Classifiers...\n")

# Logistic Regression
sent_log = LogisticRegression(random_state=42)
sent_log.fit(X_train_sent, y_train_sent)
sent_log_acc = accuracy_score(y_test_sent, sent_log.predict(X_test_sent))

# Decision Tree
sent_dt = DecisionTreeClassifier(random_state=42, max_depth=5)
sent_dt.fit(X_train_sent, y_train_sent)
sent_dt_acc = accuracy_score(y_test_sent, sent_dt.predict(X_test_sent))

# Random Forest
sent_rf = RandomForestClassifier(n_estimators=100, random_state=42)
sent_rf.fit(X_train_sent, y_train_sent)
sent_rf_acc = accuracy_score(y_test_sent, sent_rf.predict(X_test_sent))

print("üéØ Sentiment Analysis Results:")
print(f"  Logistic Regression: {sent_log_acc:.2%}")
print(f"  Decision Tree: {sent_dt_acc:.2%}")
print(f"  Random Forest: {sent_rf_acc:.2%}")

# Demo: Classify sample reviews
print("\nüí¨ Sample Predictions:")
sample_reviews = X_test_sent.head()
predictions = sent_rf.predict(sample_reviews)

for i, (idx, row) in enumerate(sample_reviews.iterrows()):
    sentiment = "üòä Positive" if predictions[i] == 1 else "üòû Negative"
    actual = "üòä Positive" if y_test_sent.iloc[i] == 1 else "üòû Negative"
    print(f"\nReview {i+1}:")
    print(f"  Positive words: {int(row['positive_words'])}, Negative words: {int(row['negative_words'])}")
    print(f"  Stars: {int(row['rating_stars'])}‚≠ê")
    print(f"  Predicted: {sentiment}  |  Actual: {actual}")

## üéØ YOUR TURN: Exercise 1

**Challenge:** Build a customer churn predictor!

**Scenario:** Predict if a customer will cancel their subscription

**Your Task:**
1. Use the dataset below
2. Train all 3 classifiers (Logistic Regression, Decision Tree, Random Forest)
3. Compare their accuracy
4. Which model works best?

Don't worry - experiment and learn! üí™

In [None]:
# Customer churn dataset
np.random.seed(42)

churn_data = {
    'months_subscribed': np.concatenate([
        np.random.randint(1, 6, 300),    # Will churn (short subscription)
        np.random.randint(12, 60, 500)   # Won't churn (long subscription)
    ]),
    'monthly_usage_hours': np.concatenate([
        np.random.randint(1, 10, 300),   # Will churn (low usage)
        np.random.randint(20, 80, 500)   # Won't churn (high usage)
    ]),
    'support_tickets': np.concatenate([
        np.random.poisson(5, 300),       # Will churn (many complaints)
        np.random.poisson(1, 500)        # Won't churn (few complaints)
    ]),
    'payment_failures': np.concatenate([
        np.random.poisson(2, 300),
        np.random.poisson(0.2, 500)
    ]),
    'will_churn': [1] * 300 + [0] * 500  # 1 = Will cancel, 0 = Will stay
}

churn_df = pd.DataFrame(churn_data)
churn_df = churn_df.sample(frac=1, random_state=42).reset_index(drop=True)

print("üìä Customer Churn Dataset:")
print(churn_df.head())
print(f"\nTotal customers: {len(churn_df)}")
print(f"Will churn: {churn_df['will_churn'].sum()}")
print(f"Will stay: {(churn_df['will_churn'] == 0).sum()}")

In [None]:
# YOUR CODE HERE!
# Hint: Follow the same steps as spam detection above

# Step 1: Separate X and y
X_churn = # YOUR CODE
y_churn = # YOUR CODE

# Step 2: Split data
# YOUR CODE

# Step 3: Train models
# YOUR CODE

# Step 4: Compare accuracy
# YOUR CODE

<details>
<summary>üìñ Click here for solution</summary>

```python
# Step 1: Separate X and y
X_churn = churn_df.drop('will_churn', axis=1)
y_churn = churn_df['will_churn']

# Step 2: Split data
X_train_ch, X_test_ch, y_train_ch, y_test_ch = train_test_split(
    X_churn, y_churn, test_size=0.2, random_state=42
)

# Step 3: Train models
models_churn = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42)
}

results_churn = {}
for name, model in models_churn.items():
    model.fit(X_train_ch, y_train_ch)
    accuracy = accuracy_score(y_test_ch, model.predict(X_test_ch))
    results_churn[name] = accuracy
    print(f"{name}: {accuracy:.2%}")
```
</details>

## üéì Key Takeaways

**You just learned:**

1. **Logistic Regression**
   - ‚úÖ Fast and simple
   - ‚úÖ Outputs probabilities
   - ‚ùå Limited to linear patterns
   - **Use for:** Quick baselines, probability scores

2. **Decision Trees**
   - ‚úÖ Easy to understand
   - ‚úÖ Handles non-linear patterns
   - ‚ùå Can overfit
   - **Use for:** Interpretability, feature importance

3. **Random Forests**
   - ‚úÖ High accuracy
   - ‚úÖ Robust and reliable
   - ‚ùå Slower, less interpretable
   - **Use for:** Production systems, complex patterns

**üåü Real-World AI Applications (2024-2025):**
- **RAG Systems:** Classify document relevance before retrieval
- **Agentic AI:** Route user queries to the right action/agent
- **Content Moderation:** Classify toxic content on social media
- **Sentiment Analysis:** Analyze customer feedback at scale
- **Multimodal AI:** Pre-filter data before expensive Transformer processing

## üöÄ Next Steps

**Practice Exercises:**
1. Try adjusting `max_depth` in Decision Trees - what happens?
2. Change `n_estimators` in Random Forest (50, 200, 500)
3. Create your own dataset and train classifiers

**Coming Next:**
- **Day 2:** Advanced Classifiers (SVM, KNN, Naive Bayes)
- **Day 3:** Regression Algorithms (predict continuous values!)

---

**üéâ Congratulations!** You can now build spam filters, sentiment analyzers, and churn predictors - all real AI applications!

**üí¨ Questions?** Review the notebook, experiment with the code, and see what happens when you change parameters!

---

*Remember: Every AI system (including ChatGPT, Claude, and modern RAG systems) uses these fundamental algorithms somewhere in their pipeline!* üåü