# Intro to Hugging Face Transformers

This notebook covers core transformer tasks using the [Hugging Face Transformers](https://huggingface.co/docs/transformers/) library:

1. **Text Generation** — Generate text with GPT-2
2. **Text Embeddings** — Extract BERT embeddings and measure semantic similarity
3. **Sentiment Analysis** — Classify text with a pretrained BERT pipeline
4. **Fine-Tuning** — Fine-tune GPT-2 on your own text data

## Setup

In [None]:
# !pip install transformers torch

In [None]:
import torch
from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using device: {device}")

---
## Part 1: Text Generation with GPT-2

GPT-2 is an autoregressive language model that generates text by predicting the next token.

| Model | Parameters | HF Name |
|-------|-----------|----------|
| Small | 124M | `gpt2` |
| Medium | 355M | `gpt2-medium` |
| Large | 774M | `gpt2-large` |
| XL | 1.5B | `gpt2-xl` |

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)

print(f"Loaded gpt2 ({gpt2_model.num_parameters():,} parameters)")

In [None]:
def generate(model, tokenizer, prompt, max_new_tokens=100, temperature=0.7,
             top_k=50, top_p=0.9, num_samples=1):
    """Generate text from a prompt."""
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        do_sample=True,
        num_return_sequences=num_samples,
        pad_token_id=tokenizer.eos_token_id,
    )
    for i, output in enumerate(outputs):
        text = tokenizer.decode(output, skip_special_tokens=True)
        if num_samples > 1:
            print(f"--- Sample {i + 1} ---")
        print(text)
        print()

In [None]:
generate(gpt2_model, gpt2_tokenizer, "The secret of life is", max_new_tokens=50)

### Generation Parameters

- **temperature**: Higher = more creative/random (0.3 for focused, 1.0 for wild)
- **top_k**: Only consider the top k most likely next tokens
- **top_p**: Nucleus sampling — only consider tokens whose cumulative probability reaches p

In [None]:
generate(gpt2_model, gpt2_tokenizer, "Once upon a time",
         max_new_tokens=80, temperature=0.9, num_samples=3)

---
## Part 2: Text Embeddings with BERT

BERT produces contextual embeddings — vector representations where meaning depends on surrounding context. These embeddings power similarity search, clustering, and classification.

We use the `[CLS]` token's embedding as a representation of the entire input sequence.

In [None]:
from transformers import BertModel, BertTokenizer

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased").to(device)

print(f"Loaded bert-base-uncased ({bert_model.num_parameters():,} parameters)")

In [None]:
def get_embeddings(texts, tokenizer, model):
    """Get [CLS] token embeddings for a list of texts."""
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    # [CLS] token is at position 0
    return outputs.last_hidden_state[:, 0, :]

In [None]:
sentences = [
    "The cat sat on the mat.",
    "A kitten rested on the rug.",
    "Stock prices rose sharply today.",
    "The financial markets surged.",
]

embeddings = get_embeddings(sentences, bert_tokenizer, bert_model)
print(f"Embedding shape: {embeddings.shape}")
print(f"(batch_size={embeddings.shape[0]}, hidden_size={embeddings.shape[1]})")

### Semantic Similarity

Cosine similarity between embeddings measures how semantically similar two sentences are.

In [None]:
from torch.nn.functional import cosine_similarity
import pandas as pd

n = len(sentences)
sim_matrix = torch.zeros(n, n)
for i in range(n):
    for j in range(n):
        sim_matrix[i, j] = cosine_similarity(embeddings[i].unsqueeze(0), embeddings[j].unsqueeze(0))

labels = [s[:30] + "..." if len(s) > 30 else s for s in sentences]
sim_df = pd.DataFrame(sim_matrix.numpy(), index=labels, columns=labels)
sim_df.style.background_gradient(cmap="YlOrRd", vmin=0.8, vmax=1.0).format("{:.3f}")

---
## Part 3: Sentiment Analysis

Hugging Face `pipeline` provides a high-level API for common tasks. The sentiment analysis pipeline uses a BERT model fine-tuned on movie reviews.

In [None]:
sentiment = pipeline("sentiment-analysis", device=device)

reviews = [
    "This movie was absolutely wonderful! The acting was superb.",
    "Terrible film. I walked out after 30 minutes.",
    "It was okay, nothing special but not bad either.",
    "A masterpiece of modern cinema. Truly breathtaking.",
    "The plot made no sense and the dialogue was awful.",
]

results = sentiment(reviews)

for review, result in zip(reviews, results):
    print(f"{result['label']:8} ({result['score']:.3f})  {review}")

### Other Pipelines

Hugging Face provides pipelines for many tasks. Here are a few:

In [None]:
# Named Entity Recognition
ner = pipeline("ner", aggregation_strategy="simple", device=device)
entities = ner("Barack Obama graduated from Harvard Law School and served as President of the United States.")

for ent in entities:
    print(f"{ent['entity_group']:10} {ent['word']:20} (score: {ent['score']:.3f})")

In [None]:
# Zero-shot classification — classify text without training
classifier = pipeline("zero-shot-classification", device=device)

result = classifier(
    "The new iPhone features a faster processor and improved camera system.",
    candidate_labels=["technology", "politics", "sports", "science"]
)

for label, score in zip(result["labels"], result["scores"]):
    print(f"{label:15} {score:.3f}")

---
## Part 4: Fine-Tune GPT-2 on Custom Text

Fine-tuning adapts the pretrained model to generate text in the style of your dataset.

Set `TRAIN_FILE` to the path of a `.txt` file you want to train on.

In [None]:
TRAIN_FILE = "train.txt"  # path to your text file
OUTPUT_DIR = "gpt2-finetuned"
EPOCHS = 3
BATCH_SIZE = 2
BLOCK_SIZE = 128  # sequence length for training chunks
LEARNING_RATE = 5e-5

In [None]:
from torch.utils.data import Dataset, DataLoader
import os


class TextDataset(Dataset):
    """Tokenize a text file and split it into fixed-length chunks for training."""

    def __init__(self, file_path, tokenizer, block_size):
        with open(file_path, "r", encoding="utf-8") as f:
            text = f.read()

        tokens = tokenizer.encode(text)
        self.examples = [
            torch.tensor(tokens[i : i + block_size])
            for i in range(0, len(tokens) - block_size, block_size)
        ]
        print(f"Loaded {len(tokens):,} tokens -> {len(self.examples)} training chunks")

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        return self.examples[idx]

In [None]:
train_dataset = TextDataset(TRAIN_FILE, gpt2_tokenizer, BLOCK_SIZE)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)

In [None]:
from torch.optim import AdamW

optimizer = AdamW(gpt2_model.parameters(), lr=LEARNING_RATE)
gpt2_model.train()

for epoch in range(EPOCHS):
    total_loss = 0
    for step, batch in enumerate(train_loader):
        batch = batch.to(device)
        outputs = gpt2_model(batch, labels=batch)
        loss = outputs.loss

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        total_loss += loss.item()

        if (step + 1) % 50 == 0:
            print(f"  Epoch {epoch + 1}, Step {step + 1}, Loss: {loss.item():.4f}")

    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch + 1}/{EPOCHS} complete. Avg loss: {avg_loss:.4f}")

### Save and Load the Fine-Tuned Model

In [None]:
os.makedirs(OUTPUT_DIR, exist_ok=True)
gpt2_model.save_pretrained(OUTPUT_DIR)
gpt2_tokenizer.save_pretrained(OUTPUT_DIR)
print(f"Model saved to {OUTPUT_DIR}/")

In [None]:
ft_model = GPT2LMHeadModel.from_pretrained(OUTPUT_DIR).to(device)
ft_tokenizer = GPT2Tokenizer.from_pretrained(OUTPUT_DIR)
ft_model.eval()

generate(ft_model, ft_tokenizer, "The", max_new_tokens=100, temperature=0.7)