# Week 4 Homework Exercises
## Advanced Machine Learning & Algorithm Implementation

**Duration:** 4-6 hours
**Prerequisites:** Completion of Week 4 Lab Exercises

---

### Learning Objectives:
1. Implement advanced search algorithms and graph traversal
2. Build and compare multiple ML models for spam detection
3. Perform feature engineering and model optimization
4. Create a complete ML pipeline with cross-validation
5. Analyze model performance and interpret results

---

## Exercise 1: Advanced Search Algorithms (60 minutes)

### Task 1.1: Implement A* Search Algorithm

**Instructions:** Implement the A* search algorithm for finding the shortest path in a weighted graph.

In [None]:
import heapq
from collections import defaultdict

# Sample weighted graph
graph = {
    'A': [('B', 4), ('C', 2)],
    'B': [('A', 4), ('D', 3), ('E', 1)],
    'C': [('A', 2), ('D', 1), ('F', 5)],
    'D': [('B', 3), ('C', 1), ('E', 2), ('F', 3)],
    'E': [('B', 1), ('D', 2), ('F', 4)],
    'F': [('C', 5), ('D', 3), ('E', 4)]
}

# Heuristic function (straight-line distance to goal)
heuristic = {
    'A': 6, 'B': 4, 'C': 4, 'D': 2, 'E': 3, 'F': 0
}

def a_star_search(graph, start, goal, heuristic):
    """
    TODO: Implement A* search algorithm
    Args:
        graph: weighted graph as adjacency list
        start: starting node
        goal: goal node
        heuristic: heuristic function for each node
    Returns:
        path: list of nodes from start to goal
        cost: total cost of the path
    """
    # Your implementation here
    pass

# Test your implementation
path, cost = a_star_search(graph, 'A', 'F', heuristic)
print(f"A* Path: {path}")
print(f"Total Cost: {cost}")

### Task 1.2: Compare Search Algorithms

**Instructions:** Implement a function to compare BFS, DFS, and A* search algorithms.

In [None]:
import time

def compare_search_algorithms(graph, start, goal):
    """
    TODO: Compare BFS, DFS, and A* search algorithms
    - Measure execution time
    - Count nodes explored
    - Find path length
    """
    results = {}
    
    # TODO: Implement BFS comparison
    
    # TODO: Implement DFS comparison
    
    # TODO: Implement A* comparison
    
    return results

# Test comparison
comparison_results = compare_search_algorithms(graph, 'A', 'F')
print("Algorithm Comparison:")
for algo, metrics in comparison_results.items():
    print(f"\n{algo}:")
    for metric, value in metrics.items():
        print(f"  {metric}: {value}")

## Exercise 2: Advanced Email Spam Detection (120 minutes)

### Task 2.1: Multiple Model Comparison

**Instructions:** Implement and compare multiple ML models for spam detection.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np

# Load and preprocess data (reuse from lab)
df = pd.read_csv("mail_data.csv")
data = df.where((pd.notnull(df)), '')
data.loc[data['Category'] == 'spam', 'Category'] = 1
data.loc[data['Category'] == 'ham', 'Category'] = 0

X = data['Message']
y = data['Category']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print("Data prepared successfully!")
print(f"Training set: {X_train_tfidf.shape}")
print(f"Testing set: {X_test_tfidf.shape}")

In [None]:
# TODO: Define multiple models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='linear', random_state=42),
    'Naive Bayes': MultinomialNB()
}

# TODO: Train and evaluate all models
results = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # TODO: Train model
    
    # TODO: Make predictions
    
    # TODO: Calculate metrics
    
    # TODO: Store results
    
    print(f"{name} - Accuracy: {accuracy:.4f}, F1: {f1:.4f}")

# TODO: Create comparison table
results_df = pd.DataFrame(results).T
print("\nModel Comparison:")
print(results_df)

### Task 2.2: Feature Engineering

**Instructions:** Create additional features to improve model performance.

In [None]:
import re

def extract_features(text):
    """
    TODO: Extract additional features from text
    - Text length
    - Word count
    - Number of exclamation marks
    - Number of capital letters
    - Presence of spam keywords
    - Presence of URLs
    """
    features = {}
    
    # TODO: Implement feature extraction
    
    return features

# TODO: Apply feature extraction to dataset
def create_enhanced_features(texts):
    """Create enhanced feature matrix"""
    enhanced_features = []
    
    for text in texts:
        # TODO: Extract features for each text
        pass
    
    return pd.DataFrame(enhanced_features)

# Create enhanced features
enhanced_train = create_enhanced_features(X_train)
enhanced_test = create_enhanced_features(X_test)

print("Enhanced features created!")
print(f"Training features: {enhanced_train.shape}")
print(f"Testing features: {enhanced_test.shape}")

In [None]:
# TODO: Combine TF-IDF and enhanced features
# Hint: Use hstack or concatenate

# TODO: Train model with combined features

# TODO: Compare performance with original model

print("Feature engineering results:")
print(f"Original accuracy: {original_accuracy:.4f}")
print(f"Enhanced accuracy: {enhanced_accuracy:.4f}")
print(f"Improvement: {enhanced_accuracy - original_accuracy:.4f}")

### Task 2.3: Cross-Validation and Hyperparameter Tuning

**Instructions:** Implement cross-validation and hyperparameter tuning for the best model.

In [None]:
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline

# TODO: Create pipeline with vectorizer and classifier
pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# TODO: Define parameter grid for tuning
param_grid = {
    'vectorizer__max_features': [3000, 5000, 7000],
    'vectorizer__ngram_range': [(1, 1), (1, 2)],
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [10, 20, None]
}

# TODO: Perform grid search with cross-validation
grid_search = GridSearchCV(
    pipeline, param_grid, cv=5, scoring='f1', n_jobs=-1, verbose=1
)

# TODO: Fit the grid search
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

# TODO: Evaluate on test set
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print(f"Test set accuracy: {test_score:.4f}")

## Exercise 3: Model Analysis and Interpretation (60 minutes)

### Task 3.1: Feature Importance Analysis

**Instructions:** Analyze and visualize feature importance for the Random Forest model.

In [None]:
# TODO: Get feature importance from the best model
feature_names = 
feature_importance = 

# TODO: Create feature importance DataFrame
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': feature_importance
}).sort_values('importance', ascending=False)

# TODO: Plot top 20 most important features
plt.figure(figsize=(12, 8))
top_features = importance_df.head(20)
plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Feature Importance')
plt.title('Top 20 Most Important Features for Spam Detection')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("Top 10 spam indicators:")
print(top_features.head(10))

### Task 3.2: Error Analysis

**Instructions:** Analyze misclassified emails to understand model limitations.

In [None]:
# TODO: Get predictions from best model
y_pred = best_model.predict(X_test)

# TODO: Find misclassified examples
misclassified_indices = np.where(y_test != y_pred)[0]

print(f"Total misclassified: {len(misclassified_indices)}")
print(f"Misclassification rate: {len(misclassified_indices)/len(y_test):.4f}")

# TODO: Analyze some misclassified examples
print("\nSample misclassified emails:")
for i in misclassified_indices[:5]:
    true_label = "SPAM" if y_test.iloc[i] == 1 else "HAM"
    pred_label = "SPAM" if y_pred[i] == 1 else "HAM"
    email = X_test.iloc[i][:100] + "..."
    
    print(f"\nTrue: {true_label}, Predicted: {pred_label}")
    print(f"Email: {email}")

# TODO: Analyze patterns in misclassifications
# Your analysis here


## Exercise 4: Real-World Application (60 minutes)

### Task 4.1: Create a Spam Detection API

**Instructions:** Build a simple API for spam detection using Flask.

In [None]:
from flask import Flask, request, jsonify
import pickle
import joblib

# TODO: Save the best model
joblib.dump(best_model, 'spam_detector_model.pkl')
print("Model saved successfully!")

# TODO: Create Flask app
app = Flask(__name__)

# TODO: Load the model
model = joblib.load('spam_detector_model.pkl')

@app.route('/predict', methods=['POST'])
def predict_spam():
    """
    TODO: Implement prediction endpoint
    - Accept JSON with 'message' field
    - Return prediction (spam/ham) and confidence
    """
    # Your implementation here
    pass

@app.route('/health', methods=['GET'])
def health_check():
    return jsonify({'status': 'healthy', 'model': 'spam_detector'})

if __name__ == '__main__':
    app.run(debug=True, port=5000)

print("API created! Run the cell below to test it.")

In [None]:
# TODO: Test the API
import requests

# Test messages
test_messages = [
    "Meeting tomorrow at 3 PM",
    "FREE MONEY NOW! CLICK HERE!",
    "Please review the quarterly report"
]

for message in test_messages:
    # TODO: Make API request
    response = requests.post('http://localhost:5000/predict', 
                            json={'message': message})
    
    if response.status_code == 200:
        result = response.json()
        print(f"Message: {message[:30]}...")
        print(f"Prediction: {result['prediction']}")
        print(f"Confidence: {result['confidence']:.4f}\n")
    else:
        print(f"Error: {response.status_code}")

## Exercise 5: Documentation and Report (30 minutes)

### Task 5.1: Create Project Documentation

**Instructions:** Create comprehensive documentation for your spam detection project.

In [None]:
# TODO: Create project documentation
documentation = """
# Email Spam Detection Project Report

## Project Overview
This project implements a machine learning system for email spam detection using various algorithms and feature engineering techniques.

## Dataset
- Source: mail_data.csv
- Size: {dataset_size} emails
- Classes: Spam ({spam_count}), Ham ({ham_count})

## Models Evaluated
{model_comparison}

## Best Model
- Algorithm: {best_model_name}
- Accuracy: {best_accuracy:.4f}
- F1 Score: {best_f1:.4f}
- Parameters: {best_params}

## Feature Engineering
- TF-IDF vectorization
- Text length analysis
- Spam keyword detection
- URL detection

## Key Findings
1. {finding_1}
2. {finding_2}
3. {finding_3}

## Future Improvements
1. {improvement_1}
2. {improvement_2}
3. {improvement_3}
"""

# TODO: Fill in the documentation with actual values
documentation = documentation.format(
    dataset_size=len(data),
    spam_count=len(data[data['Category'] == 1]),
    ham_count=len(data[data['Category'] == 0]),
    model_comparison=results_df.to_string(),
    best_model_name="Random Forest",
    best_accuracy=test_score,
    best_f1=grid_search.best_score_,
    best_params=str(grid_search.best_params_),
    finding_1="Random Forest performed best with feature engineering",
    finding_2="TF-IDF features are crucial for text classification",
    finding_3="Cross-validation helps prevent overfitting",
    improvement_1="Try deep learning models (LSTM, BERT)",
    improvement_2="Collect more diverse training data",
    improvement_3="Implement ensemble methods"
)

# Save documentation
with open('spam_detection_report.md', 'w') as f:
    f.write(documentation)

print("Documentation saved as 'spam_detection_report.md'")

## Homework Completion Checklist

- [ ] Implemented A* search algorithm
- [ ] Compared BFS, DFS, and A* algorithms
- [ ] Built multiple ML models (Logistic Regression, Random Forest, SVM, Naive Bayes)
- [ ] Implemented feature engineering techniques
- [ ] Performed cross-validation and hyperparameter tuning
- [ ] Analyzed feature importance
- [ ] Conducted error analysis on misclassified emails
- [ ] Created a Flask API for spam detection
- [ ] Generated comprehensive project documentation
- [ ] All code runs without errors

**Total Time:** ~4-6 hours
**Submission:** Upload your completed notebook and generated files

---

## Bonus Challenges (Optional)

1. **Deep Learning:** Implement an LSTM or BERT model for spam detection
2. **Real-time Processing:** Create a real-time email processing system
3. **Multi-language Support:** Extend the model to handle multiple languages
4. **Web Interface:** Build a web interface using Streamlit or Flask
5. **Deployment:** Deploy your model to a cloud platform (AWS, Google Cloud, etc.)

**Note:** Bonus challenges can earn extra credit and demonstrate advanced skills!