# Class 4: Capstone Project Work and Presentation Prep

## Welcome to Week 10, Class 4!
This notebook is your workspace for implementing your **capstone project**, the culmination of Week 10. You’ll collect and preprocess data, train and evaluate a model, create visualizations, and prepare a presentation to showcase your work. This class ties together everything you’ve learned: NLP (Class 1), RNNs/transformers (Class 2), and deployment/ethics (Class 3).

**Objectives**:
- Implement an end-to-end AI project (data → model → results).
- Create meaningful visualizations to communicate findings.
- Prepare a clear, concise presentation of your project.
- Reflect on ethical considerations in your work.

**Let’s build something awesome!**

## 1. Capstone Project Overview
Your capstone project is a chance to apply AI to a problem you care about. Examples include:
- **Text Sentiment Analysis**: Classify movie reviews as positive/negative (like Classes 1–2).
- **Image Classification**: Identify objects in photos (e.g., cats vs. dogs).
- **Predictive Modeling**: Forecast house prices or stock trends.

**Project Steps**:
1. Collect and preprocess a dataset.
2. Train and evaluate a machine learning (ML) or deep learning (DL) model.
3. Visualize results (e.g., confusion matrix, loss curves).
4. Prepare a 5-minute presentation (problem, approach, results, challenges, ethics).

**Discussion Question**: What’s your project idea? Share it with a peer!

## 2. Setup
Run the cell below to install common libraries. Add any specific ones your project needs (e.g., `tensorflow` for DL, `transformers` for NLP).

**Tip**: Use Google Colab if your local setup is limited.

In [None]:
# Install libraries (uncomment as needed)
# !pip install numpy pandas scikit-learn tensorflow matplotlib seaborn nltk transformers

# Import common libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import warnings
warnings.filter_warnings('ignore')

# Add your project-specific imports here
# e.g., import tensorflow as tf
# e.g., from transformers import pipeline

print("Setup complete!")

# Optional: Set plot style
plt.style.use('seaborn')

## 3. Project Implementation
This section guides you through the key steps of your project. Use the code cells below or create new ones as needed. If you’re stuck, refer to Classes 1–3 or ask the instructor.

**Note**: We’ll provide an example (sentiment analysis) to illustrate, but you should adapt the code to your project.

### 3.1 Step 1: Collect and Preprocess Data
Find a dataset for your project (e.g., Kaggle, UCI, or custom data). Preprocess it to prepare for modeling.

**Tips**:
- For text: Tokenize, remove stop words, lemmatize (Class 1).
- For images: Resize, normalize (e.g., scale pixels to 0–1).
- For tabular data: Handle missing values, encode categories.

**Example**: Sentiment analysis on a toy dataset.

In [None]:
# Example: Toy sentiment dataset
data = {
    "review": [
        "Loved the movie so much",
        "Terrible film, waste of time",
        "Amazing acting and story",
        "Hated it, really boring"
    ],
    "sentiment": [1, 0, 1, 0]  # 1 = positive, 0 = negative
}
df = pd.DataFrame(data)
print("Sample data:\n", df.head())

# Your code: Load your dataset
# e.g., df = pd.read_csv('your_dataset.csv')
# e.g., from tensorflow.keras.datasets import cifar10

# Your preprocessing code here
# e.g., handle missing values, tokenize text, normalize images

# Example preprocessing (text)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df["review"])
y = df["sentiment"]

print("\nFeatures shape:", X.shape)

**Your Turn**: Load your dataset and preprocess it. Describe your steps in a comment.

In [None]:
# Your code here
# Describe your preprocessing steps (e.g., 'Removed nulls, scaled features')

# Example placeholder
# df = pd.read_csv('my_data.csv')
# X = df.drop('target', axis=1)
# y = df['target']

### 3.2 Step 2: Train and Evaluate a Model
Choose a model based on your task:
- **ML**: Logistic Regression, Random Forest (scikit-learn).
- **DL**: LSTM, CNN (TensorFlow/Keras).
- **NLP**: Pre-trained transformer (Hugging Face).

Split your data, train the model, and evaluate it with metrics like accuracy, F1-score, or RMSE.

**Example**: Train a Naive Bayes model for sentiment.

In [None]:
# Example: Train Naive Bayes
from sklearn.naive_bayes import MultinomialNB

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train
model = MultinomialNB()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=["Negative", "Positive"]))

**Your Turn**: Train and evaluate your model. Include at least one metric.

In [None]:
# Your code here
# Describe your model choice and metric (e.g., 'Used Random Forest, evaluated with F1-score')

# Example placeholder
# from sklearn.ensemble import RandomForestClassifier
# model = RandomForestClassifier()
# model.fit(X_train, y_train)
# y_pred = model.predict(X_test)
# print("Accuracy:", accuracy_score(y_test, y_pred))

### 3.3 Step 3: Visualize Results
Visualizations make your results clear and engaging. Ideas:
- **Classification**: Confusion matrix, ROC curve.
- **Regression**: Predicted vs. actual scatter plot.
- **DL**: Training/validation loss curves.

**Example**: Confusion matrix for sentiment analysis.

In [None]:
# Example: Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

**Your Turn**: Create at least one visualization for your project.

In [None]:
# Your code here
# Describe your visualization (e.g., 'Plotted loss curve to show training progress')

# Example placeholder
# plt.plot(history.history['loss'], label='Training Loss')
# plt.plot(history.history['val_loss'], label='Validation Loss')
# plt.xlabel('Epoch')
# plt.ylabel('Loss')
# plt.legend()
# plt.show()

## 4. Presentation Preparation
Your presentation should be **5 minutes** and cover:
1. **Problem**: What are you solving? Why is it important?
2. **Approach**: How did you preprocess data and choose your model?
3. **Results**: Show key metrics and visualizations.
4. **Challenges**: What was hard? How did you overcome it?
5. **Ethics**: Any biases or fairness issues? How would you address them?

**Tips**:
- Use 5–7 slides (e.g., Google Slides, PowerPoint).
- Keep visuals simple (e.g., one chart per slide).
- Practice explaining your project clearly to a non-technical audience.

**Example Outline**:
- Slide 1: Title, your name, project goal.
- Slide 2: Problem statement and dataset.
- Slide 3: Preprocessing and model choice.
- Slide 4: Results (metric + visualization).
- Slide 5: Challenges faced.
- Slide 6: Ethical considerations.
- Slide 7: Conclusion and next steps.

### 4.1 Draft Your Presentation Notes
Use the cell below to jot down ideas for each slide.

**Your Turn**: Write a brief note for each presentation section.

In [None]:
# Your notes here (run this as a markdown cell or comment)
"""
Problem: [e.g., Classifying spam emails to improve user experience]
Approach: [e.g., Used TF-IDF and Logistic Regression]
Results: [e.g., 95% accuracy, confusion matrix shows low false positives]
Challenges: [e.g., Imbalanced dataset, fixed with oversampling]
Ethics: [e.g., Risk of misclassifying legitimate emails, need diverse data]
"""

## 5. Ethical Reflection
Consider potential ethical issues in your project, such as:
- **Bias**: Could your data favor one group? (e.g., reviews from only one demographic)
- **Fairness**: Does your model treat all inputs equitably?
- **Transparency**: Can users understand your predictions?

**Example**: A sentiment model might misclassify slang-heavy reviews if trained on formal text, affecting fairness.

**Your Turn**: Test your model with edge cases to spot biases.

In [None]:
# Example: Test sentiment model with edge cases
test_reviews = ["This movie slaps!", "Not my vibe but okay"]
for review in test_reviews:
    vector = vectorizer.transform([review])
    pred = model.predict(vector)
    print(f"Review: {review}\nPrediction: {'Positive' if pred[0] == 1 else 'Negative'}\n")

# Your code here
# Test your model with edge cases (e.g., slang, ambiguous inputs)
# Describe findings (e.g., 'Model misclassified slang as negative')

## 6. Wrap-Up
You’ve made great progress on your capstone project! By now, you should have:
- A preprocessed dataset and trained model.
- At least one visualization.
- Draft notes for your presentation.
- Reflections on ethical issues.

**Next Steps**:
- Finalize your code and visualizations.
- Create your presentation slides.
- Practice your 5-minute talk.

**Deliverables**:
- Submit this notebook with your project code, visualizations, and presentation notes.
- Submit your presentation slides (PDF or link).
- Optional: Share your project code in a GitHub repo.

**Homework**:
- Finish your project implementation.
- Prepare and rehearse your presentation.
- Read: [How to Give a Great Data Science Talk](https://towardsdatascience.com/how-to-give-a-great-data-science-presentation).

**Questions?** Ask the instructor or collaborate with peers. This is your chance to shine!

Amazing work, and good luck presenting! 🚀