# üìß P2.1.1.6 ‚Äì Machine Learning Foundations

## Topic: Spam Email Classifier Example


## üéØ Learning Objectives

By the end of this notebook, you will be able to:

- Understand the steps in a real-world ML pipeline
- Explain how text data is vectorized for ML
- Build a simple spam classifier using Naive Bayes
- Evaluate the performance of a classification model


## üìù Problem Statement

We want to build a program that classifies emails as **Spam** or **Ham** (not spam) using machine learning. The goal is to automate the detection of unwanted emails.


**Why is this important?**

- Spam emails waste time and can be dangerous. Automating detection helps keep inboxes clean and safe.


## üîç Choosing the ML Type

For this problem, we use **Supervised Learning** because:
- We have labeled examples (Spam or Ham)
- The model learns from past emails to predict new ones

**Why not Unsupervised or Reinforcement?**
- Unsupervised is for finding patterns without labels
- Reinforcement is for decision-making in environments (not classification)

## ü§ñ Choosing the Model & Why

We use the **Naive Bayes** model because:
- It works well for text classification
- It is fast and simple
- It handles word frequencies efficiently

**Why not other models?**
- Decision Trees, SVM, etc. can be used, but Naive Bayes is a classic choice for spam detection due to its performance on text data

## üõ†Ô∏è Example: Spam Email Classifier Pipeline

This example shows the steps:
1. Convert emails to numbers (vectorization)
2. Split data into train/test
3. Train Naive Bayes model
4. Predict and evaluate


In [None]:
"""
Spam Email Classifier using Scikit-learn
----------------------------------------
This program classifies emails as Spam or Ham
using text vectorization and Naive Bayes.

Author: AI Course
"""

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


def main():
    print("SPAM EMAIL CLASSIFIER")
    print("----------------------")

    # Dataset
    emails = [
        "Win money now",
        "Limited offer win big",
        "Win a free prize now",
        "Meeting tomorrow at office",
        "Project discussion scheduled",
        "Let us plan the meeting"
    ]

    labels = [
        "Spam",
        "Spam",
        "Spam",
        "Ham",
        "Ham",
        "Ham"
    ]

    # Convert text to numerical features
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(emails)

    # Train-Test Split
    X_train, X_test, y_train, y_test = train_test_split(
        X, labels, test_size=0.33, random_state=42
    )

    # Train model
    model = MultinomialNB()
    model.fit(X_train, y_train)

    # Predictions
    predictions = model.predict(X_test)

    # Evaluation
    print("\nAccuracy:", accuracy_score(y_test, predictions))
    print("\nConfusion Matrix:\n", confusion_matrix(y_test, predictions))
    print("\nClassification Report:\n", classification_report(y_test, predictions))

    # Predict new email
    new_email = ["Win free money now"]
    new_email_vector = vectorizer.transform(new_email)
    new_prediction = model.predict(new_email_vector)

    print("\nNew Email:", new_email[0])
    print("Prediction:", new_prediction[0])


if __name__ == "__main__":
    main()

## üìä Understanding Accuracy & Evaluation Metrics

- **Accuracy:** Percentage of correct predictions out of total predictions. High accuracy means the model is performing well.
- **Confusion Matrix:** Shows how many emails were correctly/incorrectly classified as Spam or Ham.
- **Classification Report:** Includes precision, recall, and F1-score for deeper evaluation.

**Why do we need these?**
- To measure how well the model works
- To identify areas for improvement
- To ensure the model is reliable before using it in real-world scenarios