# Intro to Hugging Face Transformers

This notebook covers core transformer tasks using the [Hugging Face Transformers](https://huggingface.co/docs/transformers/) library:

1. **Text Generation** — Generate text with GPT-2
2. **Text Embeddings** — Extract BERT embeddings and measure semantic similarity
3. **Sentiment Analysis** — Classify text with a pretrained BERT pipeline
4. **Fine-Tuning** — Fine-tune GPT-2 on your own text data

## Setup

In [1]:
# !pip install transformers torch

In [None]:
import os
os.environ.setdefault("TRANSFORMERS_NO_TF", "1")

import torch
from transformers import pipeline, GPT2LMHeadModel, GPT2Tokenizer, BertModel, BertTokenizer, T5Tokenizer, T5ForConditionalGeneration
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from IPython.display import display, Markdown
import json
import random

device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using device: {device}")



Using device: mps


---
## Part 1: Text Generation with GPT-2

GPT-2 is an autoregressive language model that generates text by predicting the next token.

| Model | Parameters | HF Name |
|-------|-----------|----------|
| Small | 124M | `gpt2` |
| Medium | 355M | `gpt2-medium` |
| Large | 774M | `gpt2-large` |
| XL | 1.5B | `gpt2-xl` |

In [3]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)

print(f"Loaded gpt2 ({gpt2_model.num_parameters():,} parameters)")

Loaded gpt2 (124,439,808 parameters)


In [4]:
def generate(model, tokenizer, prompt, max_new_tokens=100, temperature=0.7,
             top_k=50, top_p=0.9, num_samples=1):
    """Generate text from a prompt."""
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        do_sample=True,
        num_return_sequences=num_samples,
        pad_token_id=tokenizer.eos_token_id,
    )
    for i, output in enumerate(outputs):
        text = tokenizer.decode(output, skip_special_tokens=True)
        if num_samples > 1:
            print(f"--- Sample {i + 1} ---")
        print(text)
        print()

In [5]:
generate(gpt2_model, gpt2_tokenizer, "The secret of life is", max_new_tokens=50)

The secret of life is not a secret. It is the secret of the human race. The secret of life is the secret of the human race.

We are all humans.

We are all human beings.

We are all human beings.





### Generation Parameters

- **temperature**: Higher = more creative/random (0.3 for focused, 1.0 for wild)
- **top_k**: Only consider the top k most likely next tokens
- **top_p**: Nucleus sampling — only consider tokens whose cumulative probability reaches p

In [6]:
generate(gpt2_model, gpt2_tokenizer, "Once upon a time",
         max_new_tokens=80, temperature=0.9, num_samples=3)

--- Sample 1 ---
Once upon a time, people who thought they were smarter were actually the ones who got the most bang for their buck with the rest of us.

I have no idea if it's because they were in a hurry to do the right thing or because they thought they'd get the job done. Either way, I feel like the people who've had to work so hard for so long aren't paying it any attention

--- Sample 2 ---
Once upon a time the game was all about making a run at being the most powerful character in the game and we didn't have an easy time in the beginning. In the end, we found a way to let players do whatever they wanted, regardless of how big a role they were playing.

We did a lot of early testing with characters such as the Night Lords, Shadow Lords, and the Guardian.


--- Sample 3 ---
Once upon a time a lot of people were saying, "You can't go to school," but I got to talk to some of the most wonderful people in the world. I can't imagine anything that was more rewarding, especially to see t

---
## Part 2: Text Embeddings with BERT

BERT produces contextual embeddings — vector representations where meaning depends on surrounding context. These embeddings power similarity search, clustering, and classification.

We use the `[CLS]` token's embedding as a representation of the entire input sequence.

In [7]:
from transformers import BertModel, BertTokenizer

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased").to(device)

print(f"Loaded bert-base-uncased ({bert_model.num_parameters():,} parameters)")

Loaded bert-base-uncased (109,482,240 parameters)


In [8]:
def get_embeddings(texts, tokenizer, model):
    """Get [CLS] token embeddings for a list of texts."""
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    # [CLS] token is at position 0
    return outputs.last_hidden_state[:, 0, :]

In [9]:
sentences = [
    "The cat sat on the mat.",
    "A kitten rested on the rug.",
    "Stock prices rose sharply today.",
    "The financial markets surged.",
]

embeddings = get_embeddings(sentences, bert_tokenizer, bert_model)
print(f"Embedding shape: {embeddings.shape}")
print(f"(batch_size={embeddings.shape[0]}, hidden_size={embeddings.shape[1]})")

Embedding shape: torch.Size([4, 768])
(batch_size=4, hidden_size=768)


### Semantic Similarity

Cosine similarity between embeddings measures how semantically similar two sentences are.

In [10]:
from torch.nn.functional import cosine_similarity
import pandas as pd

n = len(sentences)
sim_matrix = torch.zeros(n, n)
for i in range(n):
    for j in range(n):
        sim_matrix[i, j] = cosine_similarity(embeddings[i].unsqueeze(0), embeddings[j].unsqueeze(0))

labels = [s[:30] + "..." if len(s) > 30 else s for s in sentences]
sim_df = pd.DataFrame(sim_matrix.numpy(), index=labels, columns=labels)
sim_df.style.background_gradient(cmap="YlOrRd", vmin=0.8, vmax=1.0).format("{:.3f}")

Unnamed: 0,The cat sat on the mat.,A kitten rested on the rug.,Stock prices rose sharply toda...,The financial markets surged.
The cat sat on the mat.,1.0,0.894,0.838,0.82
A kitten rested on the rug.,0.894,1.0,0.773,0.797
Stock prices rose sharply toda...,0.838,0.773,1.0,0.864
The financial markets surged.,0.82,0.797,0.864,1.0


---
## Part 3: Sentiment Analysis

Hugging Face `pipeline` provides a high-level API for common tasks. The sentiment analysis pipeline uses a BERT model fine-tuned on movie reviews.

In [11]:
sentiment = pipeline("sentiment-analysis", device=device)

reviews = [
    "This movie was absolutely wonderful! The acting was superb.",
    "Terrible film. I walked out after 30 minutes.",
    "It was okay, nothing special but not bad either.",
    "A masterpiece of modern cinema. Truly breathtaking.",
    "The plot made no sense and the dialogue was awful.",
]

results = sentiment(reviews)

for review, result in zip(reviews, results):
    print(f"{result['label']:8} ({result['score']:.3f})  {review}")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Device set to use mps


POSITIVE (1.000)  This movie was absolutely wonderful! The acting was superb.
NEGATIVE (1.000)  Terrible film. I walked out after 30 minutes.
POSITIVE (0.989)  It was okay, nothing special but not bad either.
POSITIVE (1.000)  A masterpiece of modern cinema. Truly breathtaking.
NEGATIVE (1.000)  The plot made no sense and the dialogue was awful.


### Other Pipelines

Hugging Face provides pipelines for many tasks. Here are a few:

In [12]:
# Named Entity Recognition
ner = pipeline("ner", aggregation_strategy="simple", device=device)
entities = ner("Barack Obama graduated from Harvard Law School and served as President of the United States.")

for ent in entities:
    print(f"{ent['entity_group']:10} {ent['word']:20} (score: {ent['score']:.3f})")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Device set to use mps


PER        Barack Obama         (score: 0.999)
ORG        Harvard Law School   (score: 0.876)
LOC        United States        (score: 0.999)


In [13]:
# Zero-shot classification — classify text without training
classifier = pipeline("zero-shot-classification", device=device)

result = classifier(
    "The new iPhone features a faster processor and improved camera system.",
    candidate_labels=["technology", "politics", "sports", "science"]
)

for label, score in zip(result["labels"], result["scores"]):
    print(f"{label:15} {score:.3f}")

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


Device set to use mps


technology      0.973
science         0.014
sports          0.009
politics        0.004


---
## Advanced Zero-Shot Prompting (FLAN-T5)
The following section adds a deeper zero-shot evaluation workflow using an encoder-decoder instruction-tuned model (FLAN-T5).
It includes: data download, prompt templates, loss-by-choice evaluation, and dataset-level accuracy measurement.
Run these cells if you want to reproduce the experiments from the AI-for-Humanists tutorial. (GPU recommended for larger models.)

In [None]:
# Install helper packages if needed (uncomment when running in a fresh environment)
# %%capture
# !pip install gdown sentencepiece transformers tqdm

In [None]:
# Imports for the advanced zero-shot workflow (only what's not already imported above)
import gdown
# json, torch, numpy, pandas, tqdm and display are already imported in the top imports cell
import random

In [None]:
# Load FLAN-T5 (instruction-tuned encoder-decoder)
model_id = "google/flan-t5-large"
print('Loading', model_id, 'onto', device)
model = T5ForConditionalGeneration.from_pretrained(model_id).to(device)
tokenizer = T5Tokenizer.from_pretrained(model_id)

### Dataset: Goodreads reviews (example)
The cells below download prepared JSON files used in the AI-for-Humanists examples. If you prefer, upload your own JSON/CSV and adapt the loading code.

In [None]:
# Download example Goodreads subset used in the tutorial (if needed)
texts_url = 'https://drive.google.com/uc?id=1qEZ3k9fZa_KITSQtFlhHq7zImUbBYCY2'
labels_url = 'https://drive.google.com/uc?id=1d-6abYwcwKfbYYdVH7mys7IeF4j9BMXP'
texts_filename = 'book_review_texts.json'
labels_filename = 'book_review_labels.json'
try:
    gdown.download(texts_url, texts_filename, quiet=True)
    gdown.download(labels_url, labels_filename, quiet=True)
    with open(texts_filename, 'r') as f:
        all_texts = json.load(f)
    with open(labels_filename, 'r') as f:
        all_labels = json.load(f)
    print('Downloaded and loaded Goodreads example files')
except Exception as e:
    print('Could not download Goodreads example files; please upload JSONs or set paths.', e)
    all_texts = []
    all_labels = []

In [None]:
# Helper: balanced subsample of two classes
def subsample_two_classes(all_texts, all_labels, label_1, label_2, n):
  import numpy as _np
  import random as _random
  all_texts = _np.array(all_texts)
  all_labels = _np.array(all_labels)
  idxs_label_1 = _np.where(all_labels == label_1)[0].tolist()
  idxs_label_2 = _np.where(all_labels == label_2)[0].tolist()
  n_each_class = int(n/2)
  idxs_label_1 = idxs_label_1[:n_each_class]
  idxs_label_2 = idxs_label_2[:n_each_class]
  subset_idxs = idxs_label_1 + idxs_label_2
  _random.shuffle(subset_idxs)
  subset_texts = list(all_texts[subset_idxs])
  subset_labels = list(all_labels[subset_idxs])
  return subset_texts, subset_labels

In [None]:
# Prompt templates (examples used in the tutorial)
def apply_prompt_1(text, possible_choices):
  return f'Which genre of book is the following review about?\nReview: {text}\nChoices: {possible_choices[0]} or {possible_choices[1]}\nAnswer:'

def apply_prompt_2(text, possible_choices):
  return f'Review: {text}\nChoices: {possible_choices[0]} or {possible_choices[1]}\nGenre:'

def apply_prompt_3(text, possible_choices):
  return f'Review: {text}\nGenre:'

def apply_prompt_4(text, possible_choices):
  return f'\nReview: {text}\nWhich genre of book is the review about?'

def apply_prompt_5(text, possible_choices):
  return f'Review: {text}\nChoices: {possible_choices[0]} or {possible_choices[1]}\nAnswer:'

In [None]:
# Loss-by-choice classification utilities
def classify_example(text, label, possible_choices, verbose=False):
    # Build prompted input (text should already be prompted)
    inputs = tokenizer(text, return_tensors='pt', truncation=True).to(device)
    input_ids = inputs.input_ids
    losses_and_targets = []
    for target_pretokenized in possible_choices:
        target = tokenizer(target_pretokenized, return_tensors='pt', truncation=True).to(device)
        target_ids = target.input_ids
        with torch.no_grad():
            outputs = model(input_ids=input_ids, labels=target_ids)
        loss = outputs.loss.item()
        losses_and_targets.append((loss, target_pretokenized))
    losses_and_targets.sort()
    _, best_choice = losses_and_targets[0]
    return best_choice == label

def classify_dataset(prompted_examples, labels, possible_choices, verbose=False):
    num_examples = len(prompted_examples)
    correct = 0
    for i in tqdm(range(num_examples)):
        prompted_example = prompted_examples[i]
        label = labels[i]
        is_correct = classify_example(prompted_example, label, possible_choices, verbose=(i<5 and verbose))
        correct += int(is_correct)
    return correct / num_examples

In [None]:
# Example: run prompt comparisons on the Goodreads subset (if loaded)
if all_texts and all_labels:
    task_texts, task_labels = subsample_two_classes(all_texts, all_labels, 'history_biography', 'poetry', n=100)
    original_label_to_new_name = {'history_biography': 'history/biography', 'poetry': 'poetry'}
    possible_choices = list(original_label_to_new_name.values())
    task_labels = [original_label_to_new_name[l] for l in task_labels]
    task_texts_prompt_1 = [apply_prompt_1(t, possible_choices) for t in task_texts]
    display(Markdown('**Prompt 1:**'))
    accuracy = classify_dataset(task_texts_prompt_1, task_labels, possible_choices, verbose=True)
    display(Markdown(f'**Prompt 1 accuracy: {accuracy*100:.2f}%**'))
else:
    print('Goodreads example not loaded; skip evaluation cells.')

---
## Part 4: Fine-Tune GPT-2 on Custom Text

Fine-tuning adapts the pretrained model to generate text in the style of your dataset.

Set `TRAIN_FILE` to the path of a `.txt` file you want to train on.

In [14]:
import os

TRAIN_FILE = "train.txt"  # path to your text file
OUTPUT_DIR = "gpt2-finetuned"
EPOCHS = 3
BATCH_SIZE = 2
BLOCK_SIZE = 128  # sequence length for training chunks
LEARNING_RATE = 5e-5
RUN_FINETUNE = os.path.exists(TRAIN_FILE)

if not RUN_FINETUNE:
    print(f"Skipping fine-tuning; {TRAIN_FILE} not found.")

Skipping fine-tuning; train.txt not found.


In [15]:
from torch.utils.data import Dataset, DataLoader
import os


class TextDataset(Dataset):
    """Tokenize a text file and split it into fixed-length chunks for training."""

    def __init__(self, file_path, tokenizer, block_size):
        with open(file_path, "r", encoding="utf-8") as f:
            text = f.read()

        tokens = tokenizer.encode(text)
        self.examples = [
            torch.tensor(tokens[i : i + block_size])
            for i in range(0, len(tokens) - block_size, block_size)
        ]
        print(f"Loaded {len(tokens):,} tokens -> {len(self.examples)} training chunks")

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        return self.examples[idx]

In [16]:
if RUN_FINETUNE:
    train_dataset = TextDataset(TRAIN_FILE, gpt2_tokenizer, BLOCK_SIZE)
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
else:
    train_loader = None

In [17]:
from torch.optim import AdamW

if RUN_FINETUNE:
    optimizer = AdamW(gpt2_model.parameters(), lr=LEARNING_RATE)
    gpt2_model.train()

    for epoch in range(EPOCHS):
        total_loss = 0
        for step, batch in enumerate(train_loader):
            batch = batch.to(device)
            outputs = gpt2_model(batch, labels=batch)
            loss = outputs.loss

            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            total_loss += loss.item()

            if (step + 1) % 50 == 0:
                print(f"  Epoch {epoch + 1}, Step {step + 1}, Loss: {loss.item():.4f}")

        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch + 1}/{EPOCHS} complete. Avg loss: {avg_loss:.4f}")
else:
    print("Skipping fine-tuning loop.")

Skipping fine-tuning loop.


### Save and Load the Fine-Tuned Model

In [18]:
if RUN_FINETUNE:
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    gpt2_model.save_pretrained(OUTPUT_DIR)
    gpt2_tokenizer.save_pretrained(OUTPUT_DIR)
    print(f"Model saved to {OUTPUT_DIR}/")
else:
    print("Skipping save; no fine-tuned model.")

Skipping save; no fine-tuned model.


In [19]:
if RUN_FINETUNE:
    ft_model = GPT2LMHeadModel.from_pretrained(OUTPUT_DIR).to(device)
    ft_tokenizer = GPT2Tokenizer.from_pretrained(OUTPUT_DIR)
    ft_model.eval()

    generate(ft_model, ft_tokenizer, "Once upon a time", max_new_tokens=80, temperature=0.9)
else:
    print("Skipping fine-tuned generation; no model available.")

Skipping fine-tuned generation; no model available.


In [None]:
# Example: Using zai-org/GLM-OCR for OCR with Hugging Face
from transformers import pipeline

# Load the OCR pipeline with the specified model
ocr = pipeline("image-to-text", model="zai-org/GLM-OCR")

# Example usage: replace 'your_image_path.jpg' with your image file
image_path = "your_image_path.jpg"
result = ocr(image_path)
print(result)

# The result will be a list of dicts with recognized text
# Example output: [{'generated_text': 'Recognized text here'}]