# 🧠 LLM Project — Language Model Exploration

This notebook implements and analyzes language models (LMs). It supports DistilBERT / BERT (masked LM) and GPT-2 (causal LM). The notebook contains:
- Installation & setup
- Loading one of (DistilBERT, BERT, GPT-2)
- Text classification (sentiment) and Cloze tasks (masked LM)
- Text generation (GPT-2)
- Pseudo-perplexity evaluation (approximate for causal LM)
- Visualizations: bar charts, attention heatmap / token importance
- Research questions and conclusion template

> Make sure you have internet on first run so Hugging Face can download models.

In [None]:
# Install dependencies (run once)
!pip install -q transformers torch matplotlib numpy

## Imports & utility functions

In [None]:
import math
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import (
    AutoTokenizer,
    AutoModelForMaskedLM,
    AutoModelForCausalLM,
    AutoModel,
    pipeline,
)
# Device config
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using device:', device)

## Model selection
Change `MODEL_NAME` to one of: `distilbert-base-uncased`, `bert-base-uncased`, `gpt2`.

In [None]:
# Choose model
MODEL_NAME = 'distilbert-base-uncased'  # or 'bert-base-uncased' or 'gpt2'

is_masked_lm = 'bert' in MODEL_NAME or 'distilbert' in MODEL_NAME
is_causal_lm = MODEL_NAME.startswith('gpt')
print('Model:', MODEL_NAME, 'MaskedLM:', is_masked_lm, 'CausalLM:', is_causal_lm)

# Load tokenizer and model/backbone (backbone for attention visualization)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if is_masked_lm:
    model = AutoModelForMaskedLM.from_pretrained(MODEL_NAME).to(device)
    # backbone to get attentions (some models share weights)
    backbone = AutoModel.from_pretrained(MODEL_NAME, output_attentions=True).to(device)
elif is_causal_lm:
    model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)
    backbone = None
else:
    raise ValueError('Unsupported model choice')

model.eval()
print('Loaded model and tokenizer')

## Text classification (Sentiment) — quick pipeline
- We'll use Hugging Face `pipeline('sentiment-analysis')` for a quick classification baseline.
- This pipeline will load its own lightweight model; it's independent of the LM you chose above but useful for the assignment section on text classification.

In [None]:
try:
    sentiment = pipeline('sentiment-analysis')
    examples = [
        'I love this product! It works great and exceeded my expectations.',
        'This was a waste of money. Very disappointed.'
    ]
    for ex in examples:
        print(ex)
        print(sentiment(ex))
        print()
except Exception as e:
    print('Sentiment pipeline failed:', e)
    print('You can still run model-specific tasks below.')

## Cloze task (Masked LM) — example probes
If you selected a masked LM (BERT/DistilBERT), we'll run a few cloze-style sentences with `[MASK]` and show top-k predictions.

In [None]:
def top_mask_predictions(text, k=5):
    """Return top-k token predictions for the first mask in `text` (uses tokenizer's mask token)."""
    if tokenizer.mask_token is None:
        raise ValueError('Selected tokenizer has no mask token')
    t = text.replace('[MASK]', tokenizer.mask_token)
    inputs = tokenizer(t, return_tensors='pt').to(device)
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
    mask_indices = (inputs['input_ids'][0] == tokenizer.mask_token_id).nonzero(as_tuple=True)[0]
    if len(mask_indices) == 0:
        raise ValueError('No mask in input')
    idx = mask_indices[0].item()
    probs = torch.softmax(logits[0, idx], dim=-1)
    topk = torch.topk(probs, k)
    tokens = [tokenizer.decode([i]) for i in topk.indices]
    scores = topk.values.tolist()
    return list(zip(tokens, scores))

if is_masked_lm:
    probes = [
        ('The capital of France is [MASK].', 'paris'),
        ('An apple a day keeps the [MASK] away.', 'doctor'),
        ('The quick brown fox jumps over the lazy [MASK].', 'dog'),
    ]
    for text, answer in probes:
        try:
            preds = top_mask_predictions(text, k=5)
            print('\nPrompt:', text)
            print('Top-5:', preds)
            print('Expected answer:', answer)
        except Exception as e:
            print('Error on', text, e)
else:
    print('Cloze task is for masked LMs (BERT/DistilBERT). Change MODEL_NAME to a masked model to run this.')

## Text generation (GPT-2)
If you selected GPT-2, we'll generate continuations for a few prompts.

In [None]:
def generate_with_gpt2(prompt, max_new_tokens=50, temperature=0.8, top_p=0.9, num_return_sequences=1):
    gen = pipeline('text-generation', model=model, tokenizer=tokenizer, device=0 if device=='cuda' else -1)
    out = gen(prompt, max_length=len(tokenizer(prompt)['input_ids']) + max_new_tokens,
              do_sample=True, temperature=temperature, top_p=top_p,
              num_return_sequences=num_return_sequences)
    return [o['generated_text'] for o in out]

if is_causal_lm:
    prompts = [
        'Once upon a time in a small village,',
        'Advances in machine learning have shown that',
        'The future of education will likely be',
    ]
    for p in prompts:
        print('\nPrompt:', p)
        try:
            gens = generate_with_gpt2(p, max_new_tokens=40, num_return_sequences=2)
            for g in gens:
                print('-', g)
        except Exception as e:
            print('Generation failed:', e)
else:
    print('Text generation section: change MODEL_NAME to a GPT-2 model to run generation.')

## Pseudo-perplexity evaluation (approximate)
For causal LMs we compute a rough token-level pseudo-perplexity by scoring each token conditioned on the prefix (this is not the true corpus PPL but useful for relative comparison on short sentences). For masked LMs we compute a simple cloze top-1 accuracy on our small probe set.

In [None]:
def pseudo_perplexity_causal(text, model, tokenizer):
    """Approximate perplexity for causal models by scoring each token with the prefix."""
    enc = tokenizer(text, return_tensors='pt').to(device)
    input_ids = enc['input_ids'][0]
    nlls = []
    # we iterate from 1..N-1 (score each token given previous tokens)
    for i in range(1, input_ids.size(0)):
        prefix = input_ids[:i].unsqueeze(0)
        target_id = input_ids[i].unsqueeze(0)
        with torch.no_grad():
            outputs = model(prefix)
            logits = outputs.logits[:, -1, :]
            log_probs = torch.log_softmax(logits, dim=-1)
            nll = -log_probs[0, target_id]
            nlls.append(nll.item())
    if len(nlls) == 0:
        return float('nan')
    return float(np.exp(np.mean(nlls)))

def masked_lm_cloze_accuracy(probes):
    correct = 0
    details = []
    for text, answer in probes:
        try:
            preds = top_mask_predictions(text, k=5)
            top1 = preds[0][0].strip().lower()
            ok = (answer.strip().lower() in top1) or (answer.strip().lower() == top1)
            correct += int(ok)
            details.append((text, answer, preds, ok))
        except Exception as e:
            details.append((text, answer, str(e), False))
    acc = correct / len(probes)
    return acc, details

if is_causal_lm:
    texts = [
        'Natural language processing enables machines to understand human language.',
        'Transformers have changed the field of deep learning for sequence modeling.'
    ]
    for t in texts:
        try:
            ppl = pseudo_perplexity_causal(t, model, tokenizer)
            print(f'Text: "{t}" -> Pseudo-PPL: {ppl:.2f}')
        except Exception as e:
            print('Pseudo-PPL failed on', t, e)
elif is_masked_lm:
    probes = [
        ('The capital of France is [MASK].', 'paris'),
        ('An apple a day keeps the [MASK] away.', 'doctor'),
        ('He drank a cup of [MASK] in the morning.', 'coffee')
    ]
    acc, details = masked_lm_cloze_accuracy(probes)
    print(f'Masked LM Top-1 Cloze Accuracy on {len(probes)} probes: {acc:.2%}')
    for d in details:
        print(d)
else:
    print('No evaluation run.')

## Visualization: Performance bar chart
Example showing how you might compare models. Replace `metrics` with real computed values from experiments.

In [None]:
def plot_metrics(metrics_dict, title='Model performance'):
    labels = list(metrics_dict.keys())
    values = [metrics_dict[k] for k in labels]
    plt.figure(figsize=(7,4))
    plt.bar(labels, values)
    plt.ylim(0, max(1.0, max(values)*1.1))
    plt.ylabel('Score')
    plt.title(title)
    plt.show()

# Example (replace with real values)
example_metrics = {
    'DistilBERT_cloze_acc': 0.67,
    'BERT_cloze_acc': 0.72,
    'GPT2_pseudo_ppl (lower better)': 45.3
}
plot_metrics({'DistilBERT': 0.67, 'BERT': 0.72, 'GPT-2 (pseudo-PPL scaled)': 0.55},
             title='Example comparison (illustrative)')

## Attention / Token importance visualization (for masked LMs)
We visualize the attention matrix of the last layer's first head as a heatmap. This gives an interpretability snapshot.

In [None]:
if is_masked_lm and backbone is not None:
    sentence = 'Transformers are powerful models for natural language processing.'
    inputs = tokenizer(sentence, return_tensors='pt').to(device)
    with torch.no_grad():
        out = backbone(**inputs)
    # out.attentions: tuple (num_layers, batch, num_heads, seq, seq)
    if hasattr(out, 'attentions') and out.attentions is not None:
        atts = out.attentions  # tuple
        last = atts[-1][0, 0].cpu().numpy()  # head 0
        tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
        plt.figure(figsize=(6,5))
        plt.imshow(last, aspect='auto')
        plt.xticks(range(len(tokens)), tokens, rotation=90)
        plt.yticks(range(len(tokens)), tokens)
        plt.title('Attention (last layer, head 0)')
        plt.colorbar()
        plt.tight_layout()
        plt.show()
    else:
        print('No attention available from backbone for this model.')
else:
    print('Attention visualization is available only for masked LMs (BERT/DistilBERT).')

## Research questions & experimental ideas
- **Contextual understanding**: How well does the LM fill cloze tasks that require world knowledge vs. local syntax?
- **Robustness**: How sensitive is cloze accuracy / generation to small paraphrases, typos, or domain shift?
- **Creativity & coherence**: For GPT-2, evaluate generated text quality for different temperatures and nucleus sampling (top-p).
- **Interpretability**: Do attention heads focus on syntax (function words) or key content words? Visualize per-head attention.
- **Bias and safety**: Probe models for stereotypical outputs and add filters / post-processing to mitigate.

Suggested experiments:
1. Create a larger curated probe set (100+ cloze sentences) across several domains (general, medical, legal, conversational) and compute cloze accuracy.
2. Compare pseudo-perplexity of GPT-2 across domains.
3. Try few-shot prompting (for GPT-2) and measure improvements in desired behavior.

## Conclusion & next steps
- Summarize your findings here (e.g., DistilBERT achieved X% top-1 cloze accuracy on small probe; GPT-2 produced fluent continuations but had a pseudo-PPL of Y).
- Limitations: small probe set, approximate metrics, no fine-tuning.
- Next steps: expand probes, fine-tune on domain data (if compute permits), compare more models, add human evaluation for generation quality.