# VL07 – Logistic Regression & Neural Networks

In this seminar, we explore how classical machine learning (logistic regression) and modern neural networks approach the same NLP task: **sentiment analysis**.  We will be doing the following:

- training a logistic regression model
- preparing text for neural models (tokenization, vocabulary, padding)  
- defining simple neural architectures in PyTorch  
- understanding forward and backward passes  
- training with mini-batch gradient descent  
- evaluating and comparing different models  

The goal is not to build the most powerful model, but to understand **how** neural networks process text and how they differ from traditional linear classifiers.


## 1. Loading a Sentiment Dataset (IMDB)

For the sentiment analysis task, we will be using a real-world sentiment analysis dataset:  
**IMDB movie reviews**. This dataset is hosted in Huggingface and is offered at `stanfordnlp/imdb`. This dataset contatins a `text` column contating the document or review, and a `label` with values:

- `0` = negative review  
- `1` = positive review

Use the `download_dataset.py` script if you don't have access to the Internet from your jupyter notebook. 

In [None]:
from datasets import load_dataset, load_from_disk

#ds = load_dataset("stanfordnlp/imdb")
ds = load_from_disk("../../data/standfordnlp_imdb")

### Create train / split sets
We take the train and test splits already provided by the dataset, and convert it to DataFrames. We can explore some properties of the dataset.

In [None]:
import pandas as pd

# Convert splits to pandas DataFrames
df_train_full = ds["train"].to_pandas()
df_test_full = ds["test"].to_pandas()

print("Train shape:", df_train_full.shape)
print("Test  shape:", df_test_full.shape)

df_train_full.head()

In [None]:
df_train_full["label"].value_counts()

### Create small datasets
To make the training during class manageable, we can take a subset of the original train and test datasets. We can use `train_test_split()` to get a stratified sample, controlling the sample size with the `train_size` parameter.

In [None]:
from sklearn.model_selection import train_test_split

df_train, _ = train_test_split(
    df_train_full,
    train_size=20_000,
    stratify=df_train_full["label"],
    random_state=42
)

df_test, _ = train_test_split(
    df_test_full,
    train_size=4_000,
    stratify=df_test_full["label"],
    random_state=42
)

## 2. Baseline Sentiment Classifier (Logistic Regression)

We now build a simple baseline sentiment classifier:

- Input: IMDB movie review text  
- Representation: TF–IDF features  
- Model: Logistic Regression (linear classifier)

We will:
1. Fit the vectorizer and model on the training split.
2. Use the model to classify custom phrases that we provide.

### 2.1 Prepare the TF-IDF representation

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF–IDF vectorizer
tfidf = TfidfVectorizer(
    max_features=20_000,   # cap vocabulary size for speed
    ngram_range=(1, 2),    # unigrams + bigrams
    stop_words="english"   # simple English stopword removal
)

# Fit vectorizer on training texts
X_train = tfidf.fit_transform(df_train["text"])
y_train = df_train["label"].values

X_train.shape

### 2.2 Fit the Logistic regression model

In [None]:
log_reg = LogisticRegression(
    max_iter=1000,
    n_jobs=-1
)

log_reg.fit(X_train, y_train)

### 2.3 Testing sentiment prediction
Now that we have trained the model, we can test the sentiment prediction. The interface is similar to that of NB, being able to invoke `predict()` and `predict_proba()` to obtain the label prediction and its probaility.

In [None]:
text = "I am very happy"
v = tfidf.transform([text])

print ("Prob: ", log_reg.predict_proba(v))
print ("Pred: ", log_reg.predict(v))

Below we have a helper function to predict the sentiment of an array of texts.

In [None]:
label_names = {0: "negative", 1: "positive"}

def predict_sentiment(texts):
    """
    texts: list of strings (sentences or reviews)
    """
    if isinstance(texts, str):
        texts = [texts]
    
    X = tfidf.transform(texts)
    probs = log_reg.predict_proba(X)
    preds = log_reg.predict(X)
    
    for text, pred, prob in zip(texts, preds, probs):
        label = label_names.get(int(pred), str(pred))
        confidence = float(np.max(prob))
        print(f"Text: {text}")
        print(f" → Predicted: {label} (confidence: {confidence:.3f})")
        print("-" * 60)

In [None]:
# Change the list to test it. What if it is not movie related?
examples = [
    "I absolutely loved this movie, it was fantastic!",
    "This was the worst film I have seen in years.",
    "The story was a bit slow, but the acting was great.",
    "Not bad, but I wouldn't watch it again."
]

predict_sentiment(examples)

### 2.4 Inspecting the model
We can have access to the learned model coeficients (weights), and visualize, wich are the terms that *explain* the positive and negative sentiment. 

In [None]:
feature_names = np.array(tfidf.get_feature_names_out())
coefs = log_reg.coef_[0]  # since this is a binary classifier

# Sort coefficients
top_pos_indices = np.argsort(coefs)[-20:]
top_neg_indices = np.argsort(coefs)[:20]

print("Top POSITIVE words:")
for idx in top_pos_indices:
    print(f"{feature_names[idx]:<20}  weight= {coefs[idx]:.3f}")

print("\nTop NEGATIVE words:")
for idx in top_neg_indices:
    print(f"{feature_names[idx]:<20}  weight= {coefs[idx]:.3f}")

We can also have a helper function to show the influence of each word on a specific document. We can do this by multiplying  $w_i . x_i$ or `weight * tfidf` value.

In [None]:
def show_word_influences(text):
    # Transform the text into tf-idf vector
    X = tfidf.transform([text])
    
    # Extract non-zero features
    indices = X.nonzero()[1]
    words = feature_names[indices]
    weights = coefs[indices] * X.data  # weight * tfidf value
    
    # Sort by absolute contribution
    sorted_idx = np.argsort(np.abs(weights))[::-1]
    
    print(f"Text: {text}")
    print("\nMost influential words:")
    print("-" * 40)
    for idx in sorted_idx[:10]:
        w = words[idx]
        c = weights[idx]
        direction = "POSITIVE" if c > 0 else "NEGATIVE"
        print(f"{w:<20}  contribution: {c:.4f} ({direction})")

In [None]:
show_word_influences("I thought the movie was dull and boring but the ending was great.")

### 2.5 Evaluating the model
To evaluate the model on our test dataset, we need to process transform the dataset with the same TF-IDF vectorizer we learned from the train dataset. 

We can then use the metrics functions from sklearn, as we have been doing in previous classes. 

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay, accuracy_score

# True labels
y_test = df_test["label"].values
# Transform test split
X_test_tfidf = tfidf.transform(df_test["text"])

# LogReg predictions
y_pred_lr = log_reg.predict(X_test_tfidf)

cm_lr = confusion_matrix(y_test, y_pred_lr)
print("LogReg confusion matrix:\n", cm_lr)
print("\nLogReg classification report:\n",
      classification_report(y_test, y_pred_lr, target_names=["negative", "positive"]))

ConfusionMatrixDisplay(confusion_matrix=cm_lr,
                       display_labels=['negative', 'positive']).plot()

## 3. Neural Network Classifier (Feedforward Neural Network)

In this section we build a simple neural classifier for sentiment:

**Text → tokens → embedding vectors → mean pooling → hidden layer → output layer**

This model goes beyond logistic regression:

- Logistic regression learns only a *linear* relationship.
- Neural networks can learn *non-linear* patterns and interactions.
- Embeddings allow the model to learn semantic similarities between words.

We will:
1. Tokenize the text.
2. Build a vocabulary.
3. Train an embedding + feedforward neural network using PyTorch.
4. Compare its predictions to the logistic regression model qualitatively.

Before that, let's make sure we have torch available.

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

import numpy as np
from sklearn.model_selection import train_test_split

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

### 3.1. Preparing the data

#### Step 1: Text → Tokens
We start with raw text like:

> "The movie was fantastic and I loved every minute."

We split it into **tokens** (words):

`["the", "movie", "was", "fantastic", "and", "i", "loved", "every", "minute"]`

This is done with a simple tokenizer.

---


In [None]:
import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

def tokenize(text):
    # lowercase, strip punctuation
    text = text.lower()
    doc = nlp.make_doc(text)
    return [tok.text for tok in doc if tok.is_alpha and not tok.is_stop]

#### Step 2: Tokens → Indices
The model does not understand strings.
We build a **vocabulary** of all words in the training set:

`word2idx = {'<pad>':0, '<unk>':1, 'the':2, 'movie':3, …, 'fantastic':1234}`

This can later help us convert each token into an **integer**:

`[2, 3, 10, 1234, 5, 7, 99, 22, 18]`

In [None]:
from collections import Counter

# 1. Tokenize all texts in the training set
# df_train["text"] is a Series of strings.
# We apply our tokenize() function to each text so that we get lists of tokens.
tokenized_docs = df_train["text"].apply(tokenize)

# 2. Count how often each word appears in the training data
word_counts = Counter()
for tokens in tokenized_docs:
    for word in tokens:
        word_counts[word] += 1

print("Number of unique words before filtering:", len(word_counts))

# 3. Build the vocabulary list
#    We add special tokens first, then all words that appear at least 5 times.
vocab = []

# Special tokens
vocab.append("<pad>")  # index 0
vocab.append("<unk>")  # index 1

# Add words that are frequent enough
min_freq = 5
for word, count in word_counts.items():
    if count >= min_freq:
        vocab.append(word)

print("Number of words in vocabulary (after filtering):", len(vocab))

# 4. Create the mapping: word -> index
word2idx = {}

for idx, word in enumerate(vocab):
    word2idx[word] = idx

# Quick sanity check
print("Index of <pad>:", word2idx["<pad>"])
print("Index of <unk>:", word2idx["<unk>"])

#### Step 3: Enconding the documents
Documents (reviews) can be of varying length but the input of the neural network is fixed. So we need a way to fix the input to a specified size. What we can do is to set a length based on the document length distribution, and then either pad short documents or truncate long documents.

##### Inspect the document length distribution
We can set on a length that will cover `90-95%` of the documents.

In [None]:
doc_lengths = [len(tokens) for tokens in tokenized_docs]

import matplotlib.pyplot as plt

plt.hist(doc_lengths, bins=50)
plt.xlabel("Number of tokens per document")
plt.ylabel("Count")
plt.title("Distribution of document lengths (train set)")
plt.show()

import numpy as np

# we can have a look at the percentiels. 
for p in [50, 75, 90, 95, 99]:
    value = np.percentile(doc_lengths, p)
    print(f"{p}th percentile: {value:.1f} tokens")

##### Encode and pad documents for a fix doc size
Here we encode the documents with the token indexes, padding or truncating the text based on the MAX_LEN that we chose. For example:

`MAX_LEN=9`

Example 1: 

`["the", "movie", "was", "fantastic", "and", "i", "loved", "every", "minute"]`

`encode + pad -> [2, 3, 10, 1234, 5, 7, 99, 22, 18]`

Example 2:

`["the", "movie", "was", "bad"]`

`encode + pad -> [2, 3, 10, 28, <pad>, <pad>, <pad>, <pad>, <pad>]`

In [None]:
MAX_LEN = 200

def encode(tokens):
    """
    Convert a list of token strings into a list of integer IDs
    using the word2idx vocabulary.

    Unknown words are mapped to the <unk> index.
    """    
    return [word2idx.get(tok, word2idx["<unk>"]) for tok in tokens]

def pad(seq):
    """
    Make all sequences the same length (MAX_LEN).

    - If seq is shorter than MAX_LEN: append <pad> tokens at the end.
    - If seq is longer than MAX_LEN: truncate it to MAX_LEN.
    """    
    if len(seq) < MAX_LEN:
        return seq + [word2idx["<pad>"]] * (MAX_LEN - len(seq))
    return seq[:MAX_LEN]

# We can test it
tok_example = ["the", "movie", "was", "fantastic", "and", "i", "loved", "every", "minute"]
tok_encoded = encode(tok_example)
print("ENCODED -> ", tok_encoded)
print("PADDED  -> ", pad(tok_encoded))



#### Now we can encode our dataset

In [None]:
X_train_ids = [pad(encode(toks)) for toks in tokenized_docs]
y_train_ids = df_train["label"].values


X_test_ids = [pad(encode(tokenize(t))) for t in df_test["text"]]
y_test_ids = df_test["label"].values

### 3.2 Wrapping our data in a `Dataset` (so PyTorch can use it)

So far, we have:

- `X_train_ids`: a list/array of input sequences  
  – each element is a list of token IDs of length `MAX_LEN`  
- `y_train_ids`: a list/array of labels  
  – each element is `0` (negative) or `1` (positive)

To train with PyTorch, we put this data into a small **wrapper class** that tells PyTorch two things:

1. **How many examples** are in the dataset  
   → this is what `__len__` returns  
2. **How to get one example** (input and label) by index  
   → this is what `__getitem__` returns  

PyTorch has a standard interface for this called `torch.utils.data.Dataset`.
We create our own subclass:

In [None]:
class SentimentDataset(torch.utils.data.Dataset):
    """
    Small wrapper around our data so that PyTorch knows:
    - how many examples we have
    - how to get one example (inputs + label) by index
    """
    def __init__(self, X, y):
        # X: list/array of sequences (lists of token IDs)
        # y: list/array of labels (0 or 1)
        self.X = torch.tensor(X, dtype=torch.long)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self):
        """
        Return the number of examples in the dataset.
        This is used by DataLoader to know when an epoch is finished.
        """
        return len(self.X)

    def __getitem__(self, idx):
        """
        Return one example (inputs, label) at position idx.
        This is used when DataLoader creates batches.
        """
        return self.X[idx], self.y[idx]

train_ds = SentimentDataset(X_train_ids, y_train_ids)
print("Number of training examples:", len(train_ds))
print("Example 0 shapes:", train_ds[0][0].shape, train_ds[0][1])

### 3.3 Defining the Feed Forward Neural Network

The neural network we use here follows the same structure introduced in the lecture:

**Embedding → Mean Pooling → Hidden Layer (ReLU) → Output Layer**

It takes a sequence of token IDs as input (one padded review per row) and converts it into a single prediction: *negative* or *positive*.

Here is what each part of the model does:

1. **Embedding layer**  
   Turns each word index into a dense vector of size `embed_dim`.  
   The embedding matrix has shape **[vocab_size × embed_dim]**.

2. **Mean pooling**  
   Reviews have different lengths after tokenization, so we average all word embeddings  
   across the sequence to get **one fixed-size vector per document**.

3. **Hidden layer + ReLU**  
   A linear transformation followed by a non-linearity (ReLU).  
   This is the part that allows the model to learn **non-linear decision boundaries**.

4. **Output layer**  
   Maps the hidden representation to **2 logits**, one for each class  
   (negative/positive).  
   A later softmax will convert these logits into probabilities.


In [None]:
import torch
import torch.nn as nn

class FFNN(nn.Module):
    def __init__(self, vocab_size, embed_dim=100, hidden_dim=64, num_classes=2):
        """
        vocab_size: size of the vocabulary
        embed_dim: dimensionality of word embeddings
        hidden_dim: size of the hidden layer
        num_classes: output classes (positive/negative = 2)
        """
        super().__init__()

        # 1) Embedding layer: matrix of shape [vocab_size × embed_dim]
        # Each word index maps to a trainable vector.
        self.embedding = nn.Embedding(
            vocab_size, 
            embed_dim, 
            padding_idx=word2idx["<pad>"]
        )

        # 2) Hidden layer: Linear transformation from embedding → hidden
        # This is exactly W1 x + b1 in the slides.
        self.hidden = nn.Linear(embed_dim, hidden_dim)

        # 3) Non-linearity: ReLU
        # This creates curved / non-linear decision boundaries.
        self.relu = nn.ReLU()

        # 4) Output layer: hidden_dim → num_classes (logits)
        # This is W2 h + b2 in the slides.
        self.output = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        """
        x shape: (batch, seq_len), containing token indices
        """

        # Convert indices to embeddings: (batch, seq_len, embed_dim)
        emb = self.embedding(x)

        # Mean pooling across the sequence dimension:
        # (batch, embed_dim)
        pooled = emb.mean(dim=1)

        # Hidden layer + non-linearity
        h = self.relu(self.hidden(pooled))

        # Output layer: class scores (batch, num_classes)
        out = self.output(h)

        return out

model = FFNN(len(vocab)).to(device)
model

### 3.4 Training the FFNN

To train our neural network, we need to follow the steps seen in the lecture, involving the forward pass, loss computation and backpropagation. The learnign then occurs by minimising the loss over various passes over the training dataset (epochs).

#### Defining the data loader
The `DataLoader` is PyTorch’s mechanism for serving data to the model during training.  
Instead of feeding all training examples at once, the DataLoader:

- **splits the dataset into mini-batches** (here of size 32),  
- **shuffles the data** at the start of each epoch to avoid learning order-based patterns, and 
- **delivers one batch at a time** inside the training loop.

This batching is crucial: it makes training faster, and help stabilizing gradient estimates.

In [None]:
from torch.utils.data import DataLoader

# Wrap it in a DataLoader that:
# - serves batches of size 32
# - shuffles the data at the beginning of each epoch
train_dl = DataLoader(
    train_ds,
    batch_size=32,
    shuffle=True
)

# Let's inspect one batch
for X_batch, y_batch in train_dl:
    print("Batch X shape:", X_batch.shape)  #(batch_size, MAX_LEN)
    print("Batch y shape:", y_batch.shape)  # (batch_size,)
    break

#### Training loop

During training, we repeatedly show the model small batches of data and update its 
parameters using gradient descent.

The key components are:

- **Loss function (`CrossEntropyLoss`)**  
  Measures how far the model’s predictions are from the true labels.  
  In our case, the loss compares the predicted class scores (logits) with the correct class.

- **Optimizer (`SGD` or `Adam`)**  
  Applies gradient descent to update the model’s weights.  
  Adam is a widely used variant of SGD that adapts the step size automatically, making
  training more stable and faster.

Inside each epoch, the following steps occur for every mini-batch:

1. **Forward pass:** compute logits from the model.  
2. **Loss computation:** evaluate how wrong the predictions are.  
3. **Backward pass:** compute the gradients of the loss with respect to all trainable parameters.  
4. **Optimizer step:** update the parameters in the direction that reduces the loss.  
5. **Repeat** for all batches.

After each epoch, we report the **average training loss**, which gives a sense of how well the model is fitting the data.


In [None]:
criterion = nn.CrossEntropyLoss()
#optimizer = torch.optim.SGD(model.parameters(), lr=0.5)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

EPOCHS = 10

for epoch in range(EPOCHS):
    total_loss = 0.0
    model.train()

    for X_batch, y_batch in train_dl:   # batches come from the DataLoader
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)

        optimizer.zero_grad()

        # 1. Forward pass
        logits = model(X_batch)

        # 2. Compute loss
        loss = criterion(logits, y_batch)

        # 4. Backward pass
        loss.backward()

        # 5. Update parameters
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch+1}: Average loss = {total_loss/len(train_dl):.4f}")

### 3.5 Predicting with our neural network
Unlike scikit-learn, PyTorch models do **not** come with a built-in `predict` or `predict_proba` function.  
We only have a `forward` pass (`model(x)`) that takes tensors and returns **logits** (raw scores for each class).

To make this easier to use with text, we define a small helper function that:

1. **Preprocesses the input text**  
   - `tokenize(text)` → list of tokens  
   - `encode(tokens)` → list of word IDs  
   - `pad(ids)` → fixed-length sequence (`MAX_LEN`)  
   - wrap into a tensor of shape `(1, MAX_LEN)` and move it to the right device

2. **Runs the model to get logits**  
   - `logits = model(x)` gives unnormalised scores for each class

3. **Normalises the scores with softmax**  
   - `softmax` turns logits into a probability distribution over the two classes  
   - we use `F.softmax(logits, dim=1)` to get probabilities that sum to 1


In [None]:
import torch.nn.functional as F

def preprocess_text(text):
    tokens = tokenize(text)
    ids = encode(tokens)
    ids_padded = pad(ids)
    x = torch.tensor([ids_padded], dtype=torch.long).to(device)  # shape: (1, MAX_LEN)
    return x

def predict_proba_nn(text):
    model.eval()
    x = preprocess_text(text)

    with torch.no_grad():
        logits = model(x)                       # shape (1, 2)
        probs = F.softmax(logits, dim=1)[0]     # shape (2,)

    # convert to python floats
    return probs.cpu().numpy()


In [None]:
probs = predict_proba_nn("I absolutely loved this movie!")
print("Negative:", probs[0])
print("Positive:", probs[1])

## 4. Evaluating the trained neural network

After training, we want to know:

**How well does the model do on unseen data?**  
    For this, we use the **test set** (X_test_ids, y_test_ids).


Testing on the test set has three steps:

1. Put the model in **evaluation mode** with `model.eval()`.  
   (This tells PyTorch we are not training anymore.)
2. Loop over the test data, compute the model’s predictions, and compare them
   to the true labels.
3. Compute a simple metric, e.g. **accuracy** = number of correct predictions / total.

We do all of this inside a `torch.no_grad()` block so that PyTorch:
- does not track gradients, and
- runs faster and uses less memory during evaluation.

### 4.1 Evaluation loop
We first wrap the dataset with the expected format for PyTorch (`SentimentDataset`), and create a DataLoader for the evaluation loop - same way we did for the training.

In [None]:
# Create test dataset and dataloader (similar to train_ds/train_dl)
test_ds = SentimentDataset(X_test_ids, y_test_ids)
test_dl = DataLoader(test_ds, batch_size=32, shuffle=False) # no need to shuffle

In [None]:
model.eval()  # set model to evaluation mode

all_y_true = []
all_y_pred = []

with torch.no_grad():  # no gradients needed for testing
    for X_batch, y_batch in test_dl:
        X_batch = X_batch.to(device) # so that model and batch is in the same memory
        y_batch = y_batch.to(device)

        # 1. Forward pass: get logits
        logits = model(X_batch)              # shape: (batch_size, num_classes)

        # 2. Take the argmax to get predicted class index
        preds = torch.argmax(logits, dim=1)  # shape: (batch_size,)

        # 3. Store results as plain Python ints
        all_y_true.extend(y_batch.cpu().numpy().tolist())
        all_y_pred.extend(preds.cpu().numpy().tolist())

len(all_y_true), len(all_y_pred)

Note above that:
- `.to(device)` → move tensors to where the model is (CPU/GPU) for computation.
- `.cpu().numpy()` → move results back to CPU and convert them so libraries like NumPy / sklearn can use them.

### 4.2 Computing metrics
We apply again the metrics over the predicted labels.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay

cm = confusion_matrix(all_y_true, all_y_pred)
print("Confusion matrix:\n", cm)

print(classification_report(all_y_true, all_y_pred, target_names=["negative", "positive"]))

ConfusionMatrixDisplay(confusion_matrix=cm,
                       display_labels=['negative', 'positive']).plot()

## 5. Further experiments and reflection

1. **Vary the training set size**
Train both models on:
- 1,000 examples  
- 5,000 examples  
- 10,000 examples  
- 20,000 examples  

Questions:
- Which model benefits more from additional data?  
- At what point does the NN start outperforming TF–IDF + LogReg (if at all)?  
- What happens to overfitting as data grows?

2. **Change the Neural Network Architecture**
Try adjusting:
- Embedding dimension (50, 100, 200)
- Hidden layer size (32, 64, 128)

Questions:
- Does a bigger network always help?  
- When do you start to see signs of overfitting?