<div align="center">

#  
  
# Demystifying AI - Session 0
## Programming Paradigms in AI: From Classical ML to Deep Learning


### Pate Motter, PhD  

AI Performance Engineer @ Google

[LinkedIn](https://www.linkedin.com/in/patemotter/) | [GitHub](https://github.com/patemotter)

---

</div>

## About This Notebook
This notebook provides a practical comparison of three major programming paradigms in AI development:
1. Traditional Programming
2. Classical Machine Learning
3. Deep Learning

We'll implement spam detection using each approach, highlighting the strengths, weaknesses, and key differences between these paradigms.

---

## Getting Started
This notebook provides an interactive exploration of different programming paradigms in AI. To run this notebook in Google Colab you will need:
- A Google account to run this in Colab
- About 60 minutes to go through the material

NOTES:
1. This colab is designed to run in the free tier of Google Colab.
2. You are free to take this notebook and do whatever you want with it.

Follow the instructions below to run this Colab:

<details>
<summary>1. Click Runtime -> Change runtime type</summary>

![Screenshot](https://drive.google.com/uc?export=view&id=13tysKrMzwMkGRQo8qmll1-YvUeabQEh5)

</details>

<details>
<summary>2. Change selection to CPU</summary>

For this notebook, we'll use the CPU runtime as we don't need GPU acceleration.

</details>

<details>
<summary>3. Click Runtime -> Run all</summary>

![Screenshot](https://drive.google.com/uc?export=view&id=1q0X-Rtzt3KgOnGPiM_uyGbA_kSFj4mlG)

</details>

---

## What You'll Learn in this Notebook
This interactive notebook will teach you:
- How different programming paradigms approach the same problem
- When to use each approach
- The evolution from rules to learning
- Practical implementation differences

----

# Setup the environment

In [None]:
# Install required packages
!pip install -q torch scikit-learn nltk pandas numpy matplotlib seaborn

In [None]:
# Import required libraries
import warnings
warnings.filterwarnings('ignore')

import torch
import nltk
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from google.colab import data_table
data_table.enable_dataframe_formatter()

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')

### Load Sample Data

We'll use a sample dataset of spam and non-spam messages for our examples.

In [None]:
import pandas as pd

# Sample data creation
spam_texts = [
    "CONGRATULATIONS! You've WON $1,000,000! Click here to claim: www.claim-prize.com",
    "FREE MONEY! Limited time offer! Visit now: www.free-cash.com!!!",
    "You are the lucky winner of our daily prize! Send your details NOW!",
    "Get RICH Quick! 100% Guaranteed! Click here: www.get-rich.com",
    "URGENT: Your bank account is at risk! Verify now: www.secure-bank.com",
    "Lose weight FAST with this miracle pill! Order now!",
    "You've been selected for a FREE vacation! Claim here: www.free-trip.com",
    "Make $$$ from home! Easy work, high pay. Apply now.",
    "Your package has been delayed. Track its status: www.track-package.com",
    "Your credit card has been charged. Call this number if it wasn't you.",
    "Hot singles in your area! Chat now: www.dating-site.com",
    "Pre-approved for a loan with 0% interest! Apply today.",
    "Your lottery ticket is a winner! Claim your prize here.",
    "Invest in this once-in-a-lifetime opportunity and become a millionaire!",
    "Secret to eternal youth discovered! Learn more here.",
    "Exclusive offer: 90% off on all designer brands!",
    "Your account has been compromised. Please reset your password: www.account-reset.com",
    "You've won a gift card! Redeem it now.",
    "Double your money in 24 hours! Guaranteed returns.",
    "Eliminate debt with this one simple trick!"
]

ham_texts = [
    "Hi, can we meet at 3pm tomorrow to discuss the project?",
    "Remember to pick up milk on your way home",
    "The meeting has been rescheduled to next Monday",
    "Great work on the presentation yesterday!",
    "Don't forget to submit your report by Friday",
    "Can you send me the meeting minutes from last week?",
    "What time is the team lunch today?",
    "I'll be out of the office next week. Please contact Sarah for urgent matters.",
    "Have you seen the latest project proposal?",
    "Let's grab coffee and catch up soon.",
    "The deadline for the proposal is approaching. Please review the document.",
    "Did you receive my email about the budget update?",
    "Please confirm your attendance for the training session.",
    "The client called and wants to schedule another meeting.",
    "What's the status of the marketing campaign?",
    "I've attached the revised contract. Please take a look.",
    "Can you help me with this technical issue?",
    "Reminder: Team building activity this Friday.",
    "How's the new project going?",
    "Just wanted to say thank you for your help."
]

# Create DataFrame
data = pd.DataFrame({
    'text': spam_texts + ham_texts,
    'true_label': ['spam'] * len(spam_texts) + ['ham'] * len(ham_texts)
})

data

# 1. Traditional Programming

## What You'll Learn in This Section
- How rule-based systems work
- Writing explicit logic for spam detection
- Advantages and limitations of hard-coded rules
- When this approach makes sense

## What is Traditional Programming?
In traditional programming, we explicitly write rules that define what spam looks like. For example:
- Contains specific keywords
- Uses all caps
- Has many exclamation marks
- Contains suspicious URLs

In [None]:
def is_spam_traditional(text):
    # Convert to lowercase for consistent checking
    text = text.lower()

    # Define spam indicators
    spam_keywords = ['won', 'winner', 'cash', 'prize', 'money', 'click', 'free' 'urgent', 'now']
    suspicious_patterns = {
        "exclamation_marks": text.count('!') > 2,  # Too many exclamation marks
        "dollar_signs": text.count('$') > 1,  # Multiple dollar signs
        "keywords_matched": any(word in text for word in spam_keywords),  # Contains spam keywords
        "urls": 'www.' in text or 'http' in text,  # Contains URLs
    }

    # If more than 2 patterns match, classify as spam
    if sum(suspicious_patterns.values()) >= 2:
      return 'spam'
    else:
      return 'ham'

# Run the rules on all of the text examples
data['traditional_label'] = data['text'].apply(is_spam_traditional)
data

Great, but now how do we measure the success of our new spam filter?

We can use a few standard statistical methods to view this. Pandas can produce something called a "Classification Report" that provides some of these key insights.

## How to read a classification report:
---

### Precision
When our filter says an email is spam, how often is it actually spam?
* High precision: Our filter is very picky and only flags emails as spam if it's very sure. You can trust its "spam" label.
* Low precision: Our filter flags a lot of emails as spam, but many of them are actually legitimate.
---

### Recall
Out of all the actual spam emails, how many does our filter successfully catch?
* High recall: Our filter is very good at catching almost all the spam, even if it sometimes makes mistakes.
* Low recall: Your filter misses a lot of spam, and those spam emails end up in your inbox.
---

### F1-Score:
  
A balanced average of precision and recall. It's useful when you want to consider both false positives and false negatives equally.

---

### Accuracy

The overall percentage of emails that your filter classified correctly (both spam and ham).
* Accuracy is a general measure of correctness, but it can be misleading if you have a lot more of one type of email than the other.

---
### Support
* This simply tells you how many emails of each type (spam or ham) were in your test set.


In [None]:
print("Traditional programming classification report:\n")
traditional_cr = classification_report(data['true_label'], data['traditional_label'], labels=["spam", "ham"])
print(traditional_cr)

So what can we take away from these results?

Spam:
1. Precision=1.0: When the model predicts an email as "spam," it is always correct.
2. Recall=0.3: The model only identifies 30% of the actual spam emails correctly. 70% of spam emails are misclassified as ham.

Ham:
1. Precision=0.59: When the model predicts "ham," it is correct only 59% of the time. 41% of the ham predictions are actually spam.
2. Recall=1.00: The model correctly identifies all the actual ham emails.

The filter is very good at identifying "ham" emails (perfect recall) but struggles with "spam." It has a very high precision for "spam" (no false positives), but this comes at the cost of very low recall (missing most of the actual spam).

## Analysis of Traditional Programming

### Advantages
- Simple to implement
- No training data needed
- Fast execution
- Easy to modify rules
- Completely transparent decision-making

### Limitations
- Cannot handle unseen patterns
- Requires manual rule updates
- Rules may conflict
- Cannot learn from mistakes
- Difficult to maintain as rules grow

---

# 2. Classical Machine Learning

## What You'll Learn in This Section

*   How traditional ML approaches text classification
*   Feature extraction techniques, including TF-IDF
*   Training a basic classifier (Naive Bayes)
*   Advantages over rule-based systems
*   How TF-IDF and Naive Bayes work together for text classification

## What is Classical Machine Learning?

Instead of writing explicit rules to classify text (like in rule-based systems), classical machine learning takes a different approach:

1.  **Convert text to numbers (features):** We transform text data into a numerical representation that the machine learning model can understand.
2.  **Train a model on examples:** We provide the model with a dataset of labeled examples (e.g., spam and ham emails with their corresponding labels).
3.  **Let the model find patterns:** The model learns the statistical relationships between the numerical features and the labels, effectively discovering patterns that distinguish between different classes of text.

## Implementation

We'll use a basic ML pipeline with:

*   **TF-IDF vectorization for feature extraction:**  We'll use TF-IDF (Term Frequency-Inverse Document Frequency) to convert text into numerical vectors.
*   **Naive Bayes classifier for prediction:** We'll employ a Naive Bayes algorithm to classify the text based on the learned patterns.




### TF-IDF (Term Frequency-Inverse Document Frequency)

**What it is:**
TF-IDF helps to create a numerical representation of each email, where words that are indicative of spam (e.g., "free," "money," "urgent") will likely have higher TF-IDF weights in spam emails, while more general words will have lower weights.

TF-IDF is a numerical statistic used to reflect how important a word is to a document in a collection of documents (a corpus). It's a way of representing text data as numerical vectors, which is essential for many machine learning algorithms.

**How it works:**

TF-IDF calculates a weight for each word in each document based on two factors:

*   **Term Frequency (TF):** How often a word appears in a specific document. A higher TF suggests the word is more important to that document.

    *   *Formula (one common variation):*  `TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)`

*   **Inverse Document Frequency (IDF):** How common or rare a word is across the entire corpus. Words that appear in many documents get a lower IDF score, while words that appear in only a few documents get a higher IDF score. This helps to give more weight to distinctive words.

    *   *Formula (one common variation):* `IDF(t) = log_e(Total number of documents / Number of documents with term t in it)`

*   **TF-IDF Score:** The TF-IDF score for a word in a document is calculated by multiplying its TF and IDF scores.

    *   *Formula:* `TF-IDF(t, d) = TF(t, d) * IDF(t)`

**Why it's useful:**

*   **Transforms text into numbers:** Machine learning models generally work with numerical data, not raw text. TF-IDF converts text into numerical vectors, where each dimension corresponds to a word in the vocabulary and the value represents the word's importance (TF-IDF weight).
*   **Highlights important words:** TF-IDF gives higher weights to words that are frequent in a specific document but relatively rare in the overall corpus. This helps to identify words that are likely to be more relevant to the meaning of that document.
*   **Reduces the impact of common words:** Common words like "the," "a," and "is" often appear in many documents and don't carry much specific meaning. IDF helps to downweight these words, preventing them from dominating the representation.


### Naive Bayes

**What it is:**

Naive Bayes is a probabilistic classification algorithm based on Bayes' theorem. It's called "naive" because it makes a simplifying assumption that all features (in this case, words) are independent of each other, which is often not true in reality but can still work surprisingly well in practice.

**How it works (for text classification):**

1.  **Training Phase:**
    *   The algorithm calculates the prior probability of each class (e.g., the probability of an email being spam or ham based on the training data).
    *   For each word in the vocabulary, it calculates the conditional probability of that word given each class (e.g., the probability of the word "free" appearing in a spam email, and the probability of it appearing in a ham email). These probabilities are often estimated using the frequency of the words in the training data.

2.  **Prediction Phase:**
    *   When a new email comes in, it's converted into a numerical vector using TF-IDF (or another method like `CountVectorizer`).
    *   The algorithm then uses Bayes' theorem to calculate the posterior probability of each class given the words in the email (and their TF-IDF weights).
    *   It classifies the email into the class with the highest posterior probability.

**Bayes' Theorem (Simplified):**

`P(Class | Words) = [P(Words | Class) * P(Class)] / P(Words)`

*   `P(Class | Words)`: The probability that the email belongs to a specific class (spam or ham) given the words in the email.
*   `P(Words | Class)`: The probability of observing those words given that the email is of a specific class (calculated during training).
*   `P(Class)`: The prior probability of that class (calculated during training).
*   `P(Words)`: The probability of observing those words (often ignored for comparison, as it's the same for all classes).

**Why it's useful for text classification:**

*   **Simple and Efficient:** Naive Bayes is relatively simple to implement and computationally efficient, especially for high-dimensional data like text.
*   **Works Well with Text:** Despite the "naive" independence assumption, it often performs surprisingly well for text classification tasks.
*   **Good with Limited Data:** It can perform reasonably well even with relatively small datasets, making it a good choice when you don't have a massive amount of training data.

### How TF-IDF and Naive Bayes Work Together

1.  **TF-IDF creates the input:** You use TF-IDF to transform your raw text data (emails) into numerical vectors. Each email is represented by a vector where each element corresponds to the TF-IDF weight of a word in the vocabulary.
2.  **Naive Bayes uses the TF-IDF vectors:** The Naive Bayes classifier is trained on these TF-IDF vectors and their corresponding labels (spam or ham). It learns the probabilities of words (and their weights) given each class.
3.  **Classification:** When a new email arrives, it's first converted into a TF-IDF vector, and then the Naive Bayes classifier uses the learned probabilities to predict the class (spam or ham) to which the email most likely belongs.

**In essence:**

*   TF-IDF provides a meaningful numerical representation of the text data.
*   Naive Bayes uses these numerical representations to learn a probabilistic model for classifying text.

By combining these two techniques, you can build a relatively simple yet effective text classification system, such as your spam filter.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

class SpamClassifierML:
    def __init__(self):
        # Create a pipeline with vectorizer and classifier
        self.pipeline = Pipeline([
            ('vectorizer', TfidfVectorizer(max_features=1000)),
            ('classifier', MultinomialNB())
        ])

    def train(self, texts, labels):
        """Train the classifier on the provided texts and labels"""
        self.pipeline.fit(texts, labels)

    def predict(self, texts):
        """Predict labels for the provided texts"""
        return self.pipeline.predict(texts)

# Create train/test split using the full dataset
X_train, X_test, y_train, y_test = train_test_split(
    data['text'], data['true_label'], test_size=0.2, random_state=42
)

# Train the ML model
ml_classifier = SpamClassifierML()
ml_classifier.train(X_train, y_train)

# Predict on the test set
ml_predictions_test = ml_classifier.predict(X_test)


# Now, predict on the entire dataset for a complete evaluation
ml_predictions_all = ml_classifier.predict(data['text'])
data['ml_label'] = ml_predictions_all

print("\nMachine Learning Results (Full Dataset):\n")
print(classification_report(data['true_label'], data['ml_label']))
data

## Analysis of Machine Learning

### Advantages
- Learns from data
- Can handle new patterns
- Relatively simple to implement
- Fast training and inference
- Works well with limited data

### Limitations
- Requires good feature engineering
- May miss complex patterns
- Limited by feature design
- Cannot handle very long-range dependencies

---

# 3. Deep Learning

## What You'll Learn in This Section

*   The basics of deep learning for text classification
*   How to build a neural network using PyTorch
*   Key components of a deep learning model:
    *   Neural network architecture
    *   Loss function
    *   Optimizer
*   Training and evaluating a deep learning model
*   Data augmentation techniques for text
*   Why deep learning needs a lot of data

## What is Deep Learning?

Deep learning is a subset of machine learning that uses artificial neural networks with multiple layers (hence "deep") to learn complex patterns from data. Unlike classical machine learning, where we often manually engineer features (like with TF-IDF), deep learning models can automatically learn hierarchical representations of data.

**Neural Networks:**

Neural networks are inspired by the structure of the human brain. They consist of interconnected nodes (neurons) organized in layers. Each connection has a weight, which is adjusted during training.

*   **Input Layer:** Receives the input data (e.g., numerical representation of text).
*   **Hidden Layers:** Multiple layers that perform computations on the input and learn increasingly complex features.
*   **Output Layer:** Produces the final prediction (e.g., probability of spam or ham).

## Data Augmentation for Text

Since our dataset is small, we'll use data augmentation to artificially increase its size and improve the model's ability to generalize. Common text augmentation techniques include:

*   **Synonym Replacement:** Replacing words with their synonyms.
*   **Random Deletion:** Randomly removing words from a sentence.
*   **Random Swap:** Randomly swapping the positions of words in a sentence.
*   **Random Insertion:** Randomly inserting new words (often synonyms of existing words) into a sentence.

## Implementation

Our deep learning model will use:

*   **PyTorch:** A popular deep learning framework.
*   **`CountVectorizer`:** To convert text into numerical vectors (simpler than TF-IDF for this example).
*   **A simple feedforward neural network:** With a few hidden layers.
*   **Data augmentation:** To increase the size of our training set.

In [None]:
import random
import nltk
from nltk.corpus import wordnet
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

# Download NLTK data (if you haven't already)
nltk.download('wordnet')
nltk.download('omw-1.4')

# --- Data Augmentation Functions ---

def get_synonyms(word):
    """Get synonyms for a word using WordNet."""
    synonyms = set()
    for syn in wordnet.synsets(word):
        for l in syn.lemmas():
            synonym = l.name().replace("_", " ").replace("-", " ").lower()
            synonyms.add(synonym)
    if word in synonyms:
        synonyms.remove(word)
    return list(synonyms)

def random_deletion(words, p=0.2):
    """Randomly delete words from a sentence with probability p."""
    if len(words) == 1:
        return words
    new_words = []
    for word in words:
        if random.uniform(0, 1) > p:
            new_words.append(word)
    if len(new_words) == 0:
        return [random.choice(words)]
    return new_words

def random_swap(words, n=2):
    """Randomly swap n pairs of words in a sentence."""
    new_words = words.copy()
    for _ in range(n):
        if len(words) < 2:
            break
        idx1, idx2 = random.sample(range(len(words)), 2)
        new_words[idx1], new_words[idx2] = new_words[idx2], new_words[idx1]
    return new_words

def random_insertion(words, n=1):
    """Randomly insert n words into a sentence."""
    new_words = words.copy()
    for _ in range(n):
        if not words:
            continue
        random_word = random.choice(words)
        synonyms = get_synonyms(random_word)
        if synonyms:
            new_word = random.choice(synonyms)
            insert_idx = random.randint(0, len(new_words))
            new_words.insert(insert_idx, new_word)
    return new_words

def augment_text(text, p_del=0.2, n_swap=1, n_ins=1):
    """Apply augmentations to a text."""
    words = text.split()
    augmented_texts = [text]
    augmented_texts.append(" ".join(random_deletion(words, p=p_del)))
    augmented_texts.append(" ".join(random_swap(words, n=n_swap)))
    augmented_texts.append(" ".join(random_insertion(words, n=n_ins)))
    return augmented_texts

# --- Your Original Data ---
spam_texts = [
    "CONGRATULATIONS! You've WON $1,000,000! Click here to claim: www.claim-prize.com",
    "FREE MONEY! Limited time offer! Visit now: www.free-cash.com!!!",
    "You are the lucky winner of our daily prize! Send your details NOW!",
    "Get RICH Quick! 100% Guaranteed! Click here: www.get-rich.com",
    "URGENT: Your bank account is at risk! Verify now: www.secure-bank.com",
    "Lose weight FAST with this miracle pill! Order now!",
    "You've been selected for a FREE vacation! Claim here: www.free-trip.com",
    "Make $$$ from home! Easy work, high pay. Apply now.",
    "Your package has been delayed. Track its status: www.track-package.com",
    "Your credit card has been charged. Call this number if it wasn't you.",
    "Hot singles in your area! Chat now: www.dating-site.com",
    "Pre-approved for a loan with 0% interest! Apply today.",
    "Your lottery ticket is a winner! Claim your prize here.",
    "Invest in this once-in-a-lifetime opportunity and become a millionaire!",
    "Secret to eternal youth discovered! Learn more here.",
    "Exclusive offer: 90% off on all designer brands!",
    "Your account has been compromised. Please reset your password: www.account-reset.com",
    "You've won a gift card! Redeem it now.",
    "Double your money in 24 hours! Guaranteed returns.",
    "Eliminate debt with this one simple trick!"
]

ham_texts = [
    "Hi, can we meet at 3pm tomorrow to discuss the project?",
    "Remember to pick up milk on your way home",
    "The meeting has been rescheduled to next Monday",
    "Great work on the presentation yesterday!",
    "Don't forget to submit your report by Friday",
    "Can you send me the meeting minutes from last week?",
    "What time is the team lunch today?",
    "I'll be out of the office next week. Please contact Sarah for urgent matters.",
    "Have you seen the latest project proposal?",
    "Let's grab coffee and catch up soon.",
    "The deadline for the proposal is approaching. Please review the document.",
    "Did you receive my email about the budget update?",
    "Please confirm your attendance for the training session.",
    "The client called and wants to schedule another meeting.",
    "What's the status of the marketing campaign?",
    "I've attached the revised contract. Please take a look.",
    "Can you help me with this technical issue?",
    "Reminder: Team building activity this Friday.",
    "How's the new project going?",
    "Just wanted to say thank you for your help."
]

# --- Create Augmented DataFrame ---
data = []

for text in spam_texts:
    augmented_texts = augment_text(text)
    for aug_text in augmented_texts:
        data.append({'text': aug_text, 'label': 'spam'})

for text in ham_texts:
    augmented_texts = augment_text(text)
    for aug_text in augmented_texts:
        data.append({'text': aug_text, 'label': 'ham'})

df = pd.DataFrame(data)

# --- Split Data ---
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

# --- Text Dataset Class ---
class TextDataset(Dataset):
    def __init__(self, texts, labels, vectorizer):
        self.texts = texts
        self.labels = labels
        self.vectorizer = vectorizer
        self.label_map = {label: i for i, label in enumerate(set(labels))}

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts.iloc[idx] if isinstance(self.texts, pd.Series) else self.texts[idx]
        label = self.labels.iloc[idx] if isinstance(self.labels, pd.Series) else self.labels[idx]
        if isinstance(label, str):
            label = self.label_map[label]
        vector = torch.tensor(self.vectorizer.transform([text]).toarray()[0])
        return vector.float(), torch.tensor(label).float()

# --- Spam Classifier Model ---
class SpamClassifierDL(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.layer1 = nn.Linear(input_dim, 64)
        self.layer2 = nn.Linear(64, 16)
        self.layer3 = nn.Linear(16, 1)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.2)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.relu(self.layer1(x))
        x = self.dropout(x)
        x = self.relu(self.layer2(x))
        x = self.dropout(x)
        x = self.sigmoid(self.layer3(x))
        return x

# --- Initialize Vectorizer and Create Datasets ---
vectorizer = CountVectorizer(max_features=1000)
vectorizer.fit(X_train)
vocab_size = len(vectorizer.vocabulary_)
print(f"Vocabulary size: {vocab_size}")

train_dataset = TextDataset(X_train, y_train, vectorizer)
test_dataset = TextDataset(X_test, y_test, vectorizer)

# --- Create Data Loaders ---
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# --- Initialize Model, Criterion, and Optimizer ---
model = SpamClassifierDL(input_dim=vocab_size)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters())

# --- Train the Model ---
for epoch in range(30):
    model.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels.unsqueeze(1))
        loss.backward()
        optimizer.step()

# --- Evaluate on Test Set ---
model.eval()
predictions = []
actuals = []

with torch.no_grad():
    for inputs, labels in test_loader:
        outputs = model(inputs)
        predicted = (outputs.squeeze() >= 0.5).float()
        predictions.extend(predicted.tolist())
        actuals.extend(labels.tolist())

label_map = {i: label for i, label in enumerate(set(y_train))}
predictions = [label_map[int(p)] for p in predictions]
actuals = [label_map[int(a)] for a in actuals]

print("\nDeep Learning Results (Test Set):\n")
print(classification_report(actuals, predictions))

# --- Evaluate on Full Dataset ---
model.eval()
all_predictions = []
all_actuals = []

full_dataset = TextDataset(df['text'], df['label'], vectorizer)
full_loader = DataLoader(full_dataset, batch_size=32)

with torch.no_grad():
    for inputs, labels in full_loader:
        outputs = model(inputs)
        predicted = (outputs.squeeze() >= 0.5).float()
        all_predictions.extend(predicted.tolist())
        all_actuals.extend(labels.tolist())

all_predictions = [label_map[int(p)] for p in all_predictions]
all_actuals = [label_map[int(a)] for a in all_actuals]

print("\nDeep Learning Results (Full Dataset):\n")
print(classification_report(all_actuals, all_predictions))

We did it gang! We built the perfect email filter.

1.00s in every category, what could possibli go wrong?

Actually a lot, it's called overfitting. We used a very small amount of data to train our model and it has basically memorized this set of data completely. We need to find some new data to actually test our model.


In [None]:
new_test_data = pd.DataFrame({
    'text': [
        "The mitochondria is the powerhouse of the cell. Eukaryotic organisms leverage oxidative phosphorylation for ATP synthesis.",  # Ham - scientific, very specific jargon
        "Quasar 3C 273 is an active galactic nucleus exhibiting relativistic jets and strong radio emissions.",  # Ham - astrophysics, highly technical
        "Epistemological considerations in qualitative research methodologies require reflexivity and bracketing of researcher bias.",  # Ham - academic, philosophical
        "My dude, that party last night was totally lit 🔥! We should do it again sometime.",  # Ham - very informal, slang, emoji
        "Has anyone seen my keys? I think I left them somewhere in the house. 🤔",  # Ham - common everyday question, emoji
        "Pneumonoultramicroscopicsilicovolcanoconiosis is a lung disease caused by the inhalation of very fine silica dust.", # Ham - extremely long, technical word
        "The quick brown fox jumps over the lazy dog. 1234567890 !@#$%^&*()",  # Ham - pangram, numbers, symbols
        "To be or not to be, that is the question. Whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune...", # Ham - famous quote, Shakespearean English
        "In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell...", # Ham - famous quote, fantasy literature
        "Just finished a great workout at the gym 💪. Feeling energized! #fitness #healthyliving",  # Ham - social media style, hashtag
        "OMG! Did you hear about the latest celebrity gossip? 😲 Spilling the tea ☕ on my blog: www.gossip-central.com",  # Spam - informal, internet slang, clickbaity
        "SUPER EXCLUSIVE!!! ONE-TIME OFFER!!! Get a FREE sample of our revolutionary new cryptocurrency! www.definitely-not-a-pyramid-scheme.com",  # Spam - different topic, very spammy
        "You are hereby cordially invited to an evening of intrigue and mystery. RSVP at www.this-sounds-suspicious.com",  # Spam - different style, formal but unusual
        "Participate in our survey for a chance to win an all-expenses-paid trip to a remote, undisclosed location! www.enter-at-your-own-risk.com",  # Spam - vague, potentially dangerous
        "This ancient herbal remedy can cure any ailment! Limited supply, order now! www.snake-oil-emporium.com",  # Spam - implausible claim, different topic
        "Foreign dignitary seeks assistance in transferring large sum of money. Generous compensation offered. Contact: www.not-a-scam-at-all.com", # Spam - a twist on the classic Nigerian prince scam
        "BREAKING NEWS: Evidence of extraterrestrial life discovered! Read the full story here: www.definitely-not-fake-news.com", # Spam - outrageous claim
        "Your social security number has been flagged for suspicious activity. Call this number immediately to avoid legal action." # Spam - no link, different tactic, still spammy

    ],
    'label': ['ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam']
})

new_test_dataset = TextDataset(new_test_data['text'], new_test_data['label'], vectorizer)
new_test_loader = DataLoader(new_test_dataset, batch_size=32)

model.eval()
new_predictions = []
new_actuals = []

with torch.no_grad():
    for inputs, labels in new_test_loader:
        outputs = model(inputs)
        predicted = (outputs.squeeze() >= 0.5).float()
        new_predictions.extend(predicted.tolist())
        new_actuals.extend(labels.tolist())

label_map = {i: label for i, label in enumerate(set(new_test_data['label']))}
new_predictions = [label_map[int(p)] for p in new_predictions]
new_actuals = [label_map[int(a)] for a in new_actuals]

print("\nDeep Learning Results (New Test Set):\n")
print(classification_report(new_actuals, new_predictions))
new_test_data["dl_label"] = new_predictions
new_test_data

## When to Use Each Approach

### Traditional Programming
✅ **Best For**:
- Simple, rule-based decisions
- Need for complete transparency
- Limited, well-defined patterns
- No training data available
- Quick prototyping

### Classical Machine Learning
✅ **Best For**:
- Moderate amounts of data
- Clear feature patterns
- Need for balance of performance and simplicity
- Resource constraints
- Well-understood problem domain

### Deep Learning
✅ **Best For**:
- Large datasets available
- Complex patterns
- Sequential or hierarchical data
- High performance requirements
- Resource availability
- Need for state-of-the-art accuracy

## Decision Framework

When choosing between these approaches, consider:

1. **Data Availability**
   - No data → Traditional Programming
   - Small dataset → Classical ML
   - Large dataset → Deep Learning

2. **Problem Complexity**
   - Simple rules exist → Traditional Programming
   - Clear features exist → Classical ML
   - Complex patterns → Deep Learning

3. **Resource Constraints**
   - Limited computing power → Traditional Programming
   - Moderate resources → Classical ML
   - GPU available → Deep Learning

4. **Maintenance Requirements**
   - Frequent rule updates → Consider ML/DL
   - Need for transparency → Traditional/Classical ML
   - Automated learning needed → ML/DL

# Glossary

## Traditional Programming Terms
- **Rule-Based System**: Program that uses manually defined rules to make decisions
- **Boolean Logic**: True/false conditions used in rules
- **Control Flow**: How program decisions are made
- **Deterministic**: Same input always produces same output
- **Pattern Matching**: Finding specific text patterns using rules

## Machine Learning Terms
- **Feature**: Numerical representation of data
- **Feature Engineering**: Process of creating features from raw data
- **Training**: Process of learning from examples
- **Classification**: Assigning categories to inputs
- **Supervised Learning**: Learning from labeled examples

## Deep Learning Terms
- **Neural Network**: Computing system inspired by biological brains
- **Layer**: Processing level in neural network
- **Activation Function**: Non-linear function applied to layer outputs
- **Batch**: Group of examples processed together
*   **Weights and Biases:**  The connections between neurons have associated weights, and each neuron has a bias. These are the parameters that the model learns during training.
*   **Activation Functions:**  Non-linear functions (like ReLU - Rectified Linear Unit) applied to the output of each neuron, introducing non-linearity into the model and enabling it to learn complex relationships.
*   **Forward Pass:** The process of feeding input data through the network, performing calculations at each layer, and producing an output.
*   **Loss Function:** A function that measures the difference between the model's predictions and the actual labels. The goal of training is to minimize this loss. (e.g. Binary Cross-Entropy Loss is often used for binary classification)
*   **Optimizer:** An algorithm that adjusts the model's weights and biases to minimize the loss function. (e.g. Adam is a popular optimization algorithm)
*   **Backpropagation:** The process of calculating the gradients of the loss function with respect to the model's weights and biases, used by the optimizer to update the parameters.
*   **Epoch:** One complete pass through the entire training dataset.
*   **Batch Size:** The number of training examples processed in one forward/backward pass.
*   **Learning Rate:** A hyperparameter that controls how much the model's weights are adjusted in each update step

## Additional Resources

- [scikit-learn Documentation](https://scikit-learn.org/)
- [PyTorch Tutorials](https://pytorch.org/tutorials/)
- [Machine Learning Mastery](https://machinelearningmastery.com/)

---

# License Information

<details>
<summary>License Information</summary>

MIT License

Copyright (c) 2024

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
</details>