# CS145 Introduction to Data Mining - Assignment 5
## Deadline: 11:59PM, June 1, 2025

---

## Instructions
Each assignment is structured as a Jupyter notebook (though we display it here in Markdown form). You will encounter two types of problems: **write-up problems** and **coding problems**:

1. **Write-up Problems**: These are theoretical questions where you should demonstrate understanding of lecture concepts. Provide explanations, derivations, and proofs where necessary. Use LaTeX math for clarity.

2. **Coding Problems**: You will implement and test data mining or machine learning algorithms. The code must be runnable (i.e., no syntax errors). **TODO** blocks indicate sections for you to complete.

### Submission Requirements
- Submit your `.ipynb` file (and any supplementary files, if needed) to GradeScope in BruinLearn before the deadline.
- Late submissions up to 24 hours are accepted with a penalty factor of $\mathbf{1}(t \le 24) e^{-(\ln(2)/12)t}$.

### Collaboration and Integrity
- Collaborating on ideas is encouraged, but all submitted work must be your own. If you discuss with peers or use external references, clearly cite them.
- Any form of cheating (e.g., unauthorized code or solutions) will be reported to the university's Office of the Dean of Students.

---

# Outline

1. **Part 1: Write-up**
   - Q1: Large Language Models (LLMs): A Pro and a Con
   - Q2: pLSA Model: Manual Calculation of $c$ and $\beta$

2. **Part 2: Coding**
   - Q3: Spam Detection with Logistic Regression & Naive Bayes
   - Q4: Implementing Transformers (Attention and Multi-Head Attention)
   - Q5: Time-Series Sequence Prediction with Yahoo Stock Prices (AR vs. RNN)

---

# Part 1: Write-up

## Q1: Large Language Models: A Pro and a Con
**Objective**  
Large Language Models (LLMs) such as GPT-4, PaLM, etc., are widely used in industry. Their abilities, however, come with potential pitfalls.

**Tasks**  
1. **Pro Example**: Provide an example of how an LLM might excel in a practical task (e.g., drafting emails, summarizing documents, coding assistance, chat-based tutoring, etc.). Write a short paragraph describing what you did, the prompt you used, and how the model responded. **(8 pts)**
2. **Con Example**: Provide an example scenario where an LLM's limitations became apparent (e.g., factual inaccuracies, biased output, difficulty with reasoning tasks, or security concerns). Again, share your prompt and the response to highlight the limitation. **(8 pts)**
3. **Experiment Setup**: Describe the interface or Web UI you used (e.g., ChatGPT online, Bard, Bing Chat, etc.). No need to show detailed logs—just summarize your approach. **(4 pts)**

Ensure you highlight what was "good" or "bad" about each scenario.


**[TODO: Write your responses here. ]**

---

## Q2: pLSA Model - Manual Calculation
**Objective**  
Recall that Probabilistic Latent Semantic Analysis (pLSA) involves an E-step and M-step. You will do a small calculation with a made-up dataset to solidify your understanding.

Let's define:
- $ K = 2 $ topics
- A small vocabulary with 3 words (w1, w2, w3)
- A document-term matrix that reflects how many times a word appears in each document (table below)

|       | w1 | w2 | w3 |
|-------|----|----|----|
| d1    | 3  | 1  | 0  |
| d2    | 0  | 2  | 2  |

We assume some initial values for $ p(z \mid d) $ and $ p(w \mid z) $. Suppose after the E-step, the following $c(d,z,w)$ values (expected counts of word $w$ in document $d$ being assigned to topic $z$) are obtained:

**For Topic $z_1$**:
| $c(d, z_1, w)$ | $w_1$ | $w_2$ | $w_3$ |
|----------------|-------|-------|-------|
| $d_1$          | 0.3   | 0.1   | 0     |
| $d_2$          | 0     | 0.8   | 1.5   |

**For Topic $z_2$**:
| $c(d, z_2, w)$ | $w_1$ | $w_2$ | $w_3$ |
|----------------|-------|-------|-------|
| $d_1$          | 2.7   | 0.9   | 0     |
| $d_2$          | 0     | 1.2   | 0.5   |

**Tasks**  
1. **M-Step for $ p(w \mid z) $**: Show how you calculate $\beta_{z,w} = p(w \mid z)$ given the partial counts $c(d,z,w)$.  **(8 pts)**  
2. **M-Step for $ p(z \mid d) $**: Show how you calculate $p(z \mid d)$.  **(6 pts)**  
3. **Interpretation**: Suppose one topic's distribution heavily favors w1, and the other heavily favors w3. How would you interpret that in terms of the potential "themes" of the documents?  **(6 pts)**

**Deliverable**: Provide your handwritten or typed calculations in your write-up. Reference the relevant equations from class for pLSA's E-step and M-step, and show your derivations.

**[TODO: Write your responses here. ]**


# Part 2: Coding

## Q3: Spam Detection with Logistic Regression & Naive Bayes (Coding)
**Objective**  
You will analyze a real, highly imbalanced spam‐detection dataset and compare the performance of two classifiers: **Logistic Regression** (discriminative) and **Naive Bayes** (generative).

**Dataset**  
We will use the **UCI Spambase** dataset (57 continuous features extracted from e-mails; 39 % of the samples are spam).  
Load it directly from OpenML. 

**Tasks (20 pts total)**  
1. **Data Loading & Exploration**: Download the dataset as shown above, print basic descriptive statistics, and confirm the degree of class imbalance. **(3 pts)**  
2. **Train/Validation Split**: Create an 80 / 20 stratified split that preserves the spam / non-spam ratio. **(2 pts)**  
3. **Logistic Regression Model**: Train a scikit-learn `LogisticRegression` (solver = `liblinear`, `class_weight='balanced'`) on the training set. **(4 pts)**  
4. **Naive Bayes Model**: Train a `GaussianNB` (or `MultinomialNB` after min-max scaling and discretization) on the same training data. **(3 pts)**  
5. **Evaluation Metrics**: For both classifiers, compute **precision, recall, F1-score, ROC AUC**, and plot the ROC curve on the **validation** set. **(4 pts)**  
6. **Discussion**: In 2-3 sentences, compare the two models' performance and comment on how class imbalance affected each approach. **(4 pts)**

**Starter Code Skeleton**  
Fill in the TODO blocks and leave `raise NotImplementedError` where indicated so that automated grading can detect incomplete work.



In [None]:
from sklearn.datasets import fetch_openml

# Returns X (pandas.DataFrame) and y (pandas.Series)
X, y = fetch_openml("spambase", version=1, as_frame=True, return_X_y=True, parser="auto")
print(X.shape, y.value_counts(normalize=True))

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import (precision_score, recall_score, f1_score,
                             roc_auc_score, RocCurveDisplay)
import matplotlib.pyplot as plt

# ---------------- 1) Load data ----------------
X, y = fetch_openml("spambase", version=1, as_frame=True, return_X_y=True, parser="auto")
print("Class distribution:\n", y.value_counts())

# ---------------- 2) Train / Val split --------
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Optional: feature scaling (logistic regression benefits)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled   = scaler.transform(X_val)

# ---------------- 3) Logistic Regression ------
log_reg = LogisticRegression(solver="liblinear", class_weight="balanced")
# TODO: fit and generate validation predictions  # **(4 pts)**
raise NotImplementedError

# ---------------- 4) Naive Bayes --------------
nb = GaussianNB()
# TODO: fit and generate validation predictions  # **(3 pts)**
raise NotImplementedError

# ---------------- 5) Evaluation ---------------
# TODO: compute precision, recall, F1, ROC-AUC for *each* model
#       and plot ROC curves on the same figure.                  # **(4 pts)**
raise NotImplementedError

---


## Q4: Implementing Transformers (Attention and Multi-Head Attention)
In this section, you will implement the core components of a Transformer, focusing specifically on attention and multi-head attention. We will not implement positional encodings or feed-forward networks (unless you want to explore further).

### Q4.1 Single-Head Attention
**Recap**  
For queries $Q$, keys $K$, and values $V$, we define:

$$
\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

where $d_k$ is the dimensionality of the key vectors.

**Starter Code (PyTorch)**  
The cell below provides a minimal scaffold that you must complete. **Fill in each TODO** and leave the `raise NotImplementedError` statements in place so that automated grading can detect incomplete work.

In [None]:
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q: torch.Tensor,
                                 K: torch.Tensor,
                                 V: torch.Tensor,
                                 mask: torch.Tensor = None) -> torch.Tensor:
    """Compute scaled dot-product attention (single head).

    Parameters
    ----------
    Q : torch.Tensor, shape (B, N, d_k)
    K : torch.Tensor, shape (B, M, d_k)
    V : torch.Tensor, shape (B, M, d_v)
    mask : optional tensor broadcastable to (B, N, M); elements that should be
            *ignored* must be set to -inf **before** the softmax.

    Returns
    -------
    out : torch.Tensor, shape (B, N, d_v)
    """
    # TODO 1) compute raw attention scores    **(3 pts)**
    # TODO 2) scale by sqrt(d_k)              **(2 pts)**
    # TODO 3) (optional) add mask             **(1 pt)**
    # TODO 4) apply softmax over keys (dim=-1) **(2 pts)**
    # TODO 5) multiply by V to obtain output  **(0 pts - included above)**
    raise NotImplementedError

# Quick shape check (should run without modification once you finish the TODOs)  # **(2 pts)**
if __name__ == "__main__":
    B, N, M, d_k, d_v = 2, 4, 4, 8, 8
    Q = torch.randn(B, N, d_k)
    K = torch.randn(B, M, d_k)
    V = torch.randn(B, M, d_v)
    out = scaled_dot_product_attention(Q, K, V)
    print("Output shape:", out.shape)  # Expected: (2, 4, 8)

### Q4.2 Multi-Head Attention
**Starter Code (PyTorch)**

In [None]:
class MultiHeadAttention(torch.nn.Module):
    def __init__(self, d_model: int, num_heads: int = 8):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Learned projections
        self.W_Q = torch.nn.Linear(d_model, d_model)
        self.W_K = torch.nn.Linear(d_model, d_model)
        self.W_V = torch.nn.Linear(d_model, d_model)

        # Final output projection
        self.W_O = torch.nn.Linear(d_model, d_model)

    def forward(self, Q, K, V, mask=None):
        """Applies multi-head attention.

        Q, K, V : shape (B, N, d_model)
        mask    : optional, broadcastable to (B, num_heads, N, M)
        returns : shape (B, N, d_model)
        """
        # TODO 1) project Q, K, V and split into heads (B, num_heads, N, d_k)  **(5 pts)**
        # TODO 2) call scaled_dot_product_attention on each head (vectorised)  **(3 pts)**
        # TODO 3) concatenate heads and apply final linear projection  **(0 pts - included above)**
        raise NotImplementedError

# Synthetic test (runs after you implement forward)  # **(2 pts)**
if __name__ == "__main__":
    B, N, d_model = 2, 5, 32
    mha = MultiHeadAttention(d_model=d_model, num_heads=4)
    x = torch.randn(B, N, d_model)
    y = mha(x, x, x)
    print("Multi-head output shape:", y.shape)  # Expected: (2, 5, 32)


---

## Q5: Time-Series Sequence Prediction with Yahoo Stock Prices (AR vs. RNN)

You will use daily adjusted closing prices for **Apple Inc. (ticker: `AAPL`)** obtained from Yahoo Finance via the [`yfinance`](https://pypi.org/project/yfinance/) API. If you prefer to study another large-cap equity, simply change the ticker symbol in the starter code.  

> **Package note**: Install missing Python packages with `pip install yfinance statsmodels scikit-learn torch` (or the equivalent `conda`/`mamba` command).

### Q5.1 Data Pre-processing
Starter code below illustrates how to download the data, create sliding-window sequences, and generate train/validation splits.


In [None]:
import yfinance as yf
import pandas as pd
import numpy as np

from sklearn.metrics import mean_squared_error
from typing import Tuple

# --------------- Download daily data ---------------
TICKER = "AAPL"
# The yfinance default `auto_adjust` became True in version ≥0.2; we set it to False to retain the
# separate "Adj Close" column.  If you upgrade yfinance and still don't see "Adj Close", fall back
# to the regular "Close" column.
DATA = yf.download(TICKER, start="2010-01-01", end="2024-01-01", progress=False, auto_adjust=False)
prices = DATA.get("Adj Close", DATA["Close"]).dropna().reset_index(drop=True)

# --------------- Train / Validation split ----------
train_ratio = 0.8
split_idx = int(len(prices) * train_ratio)
train_series = prices.iloc[:split_idx]
val_series   = prices.iloc[split_idx:]

WINDOW = 3  # lag order p

def make_sequences(series: pd.Series, window: int = WINDOW) -> Tuple[np.ndarray, np.ndarray]:
    """Create (X, y) pairs using a sliding window."""
    X, y = [], []
    for i in range(window, len(series)):
        X.append(series.iloc[i-window:i].values)  # previous p prices
        y.append(series.iloc[i])                  # current price
    return np.array(X), np.array(y)

X_train, y_train = make_sequences(train_series)
X_val,   y_val   = make_sequences(val_series)
print("Train sequences:", X_train.shape, "| Val sequences:", X_val.shape)  # **(Data preprocessing: 4 pts)**


### Q5.2 Autoregressive (AR) Model
Fill in the missing sections of the function below **or** feel free to re-implement using `statsmodels`.


In [None]:
from statsmodels.tsa.ar_model import AutoReg

def evaluate_ar(train_series: pd.Series, val_series: pd.Series, p: int = WINDOW) -> float:
    """Fit an AR(p) model on `train_series` and return the validation MSE."""  # **(6 pts)**
    # TODO: fit model (hint: AutoReg in statsmodels)
    # TODO: generate forecasts of length = len(val_series)
    # TODO: compute and return mean_squared_error
    raise NotImplementedError

### Q5.3 RNN / LSTM Model
The skeleton below defines a lightweight LSTM-based regressor. Complete the forward pass **and** the training loop of `train_rnn`.


In [None]:
import torch
from torch import nn

class PriceLSTM(nn.Module):
    def __init__(self, hidden_dim: int = 32, num_layers: int = 1):
        super().__init__()
        self.lstm = nn.LSTM(input_size=1, hidden_size=hidden_dim, num_layers=num_layers, batch_first=True)
        self.fc   = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        """x: (batch, seq_len=WINDOW, 1)"""
        # TODO: implement forward pass (LSTM -> final hidden state -> fc)  # **(5 pts)**
        raise NotImplementedError


def train_rnn(model: nn.Module, X_train: np.ndarray, y_train: np.ndarray,
              X_val: np.ndarray,   y_val: np.ndarray,
              epochs: int = 15, lr: float = 1e-3) -> None:
    """Train `model` using mean-squared error loss and print val MSE each epoch."""  # **(3 pts)**
    # TODO: write training loop (optimizer, criterion, batching optional)
    raise NotImplementedError


After training both models, report the validation MSEs and **briefly discuss** which approach performed better and why.  **(2 pts)**

**[TODO: Write your responses here. ]**