<a href="https://colab.research.google.com/github/nicolaiberk/llm_ws/blob/main/notebooks/05a_prompting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Annotation with Generative Models

Today, we are going to see how to generate text and annotations with generative LLMs.

> ❗ ACTIVATE THE GPU BY SELECTING RUNTIME IN THE UPPER RIGHT > CONNECT TO RUNTIME > T4 GPU

In [None]:
  !pip install transformers accelerate setfit "pydantic-ai-slim[huggingface]"

> ❗ RESTART THE NOTEBOOK (DROPDOWN NEXT TO RUN ALL > RESTART SESSION)

In [None]:
!export HF_TOKEN='insert your hf_token'

## Generating text with generative models

We will start by simply generating some text using a family of small generative models developed by huggingface.

### Simple Inference

The all-powerful `pipeline` is again the simplest way to get inference running quickly:

In [None]:
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M-Instruct")
messages = [
    "Let me tell you a story. Once upon a time,"
]
pipe(messages)

### Chat

Many applications of LLMs require a chat template. We can use the tokenizer to enforce this template. The template simply indicates which parts of the text are from to the user and which are/should be from the assistant.

Remember: LLM chats are just roleplay with special tokens!

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct", padding_side='left')
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct")

In [None]:
messages = [
    {"role": "user", "content": "Who are you?"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False)
tokenized_chat

In this context, it is helpful to add the generation prompt indicating that the text should be generated in the role of the assistant. Otherwise, the model might generate more text as the user instead ([more](https://huggingface.co/docs/transformers/en/chat_templating?template=Mistral#addgenerationprompt)).

In [None]:
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

In [None]:
tokenized_chat

In [None]:
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	return_dict=True, # retains attention mask
	return_tensors="pt", # returns tensors
).to(model.device) # more efficient to put on device

In [None]:
inputs

In [None]:
outputs = model.generate(**inputs, max_new_tokens=30) # note the max_new_tokens parameter
print(tokenizer.decode(outputs[0])) # note that the entire conversation is returned, including the system prompt.

### Zero-shot prompting

In order to get proper annotations from our model, we can simply ask the model to generate the relevant outputs. This is as simple as writing a prompt. Remember the best practices we discussed earlier today.

In [None]:
messages = [
    [{"role": "user", "content": """You are an expert annotator with years of experience annotating social science data.
    Your main task is to annotate whether the following text is about AI or not.
    Respond ONLY with the label "AI" or "NOT AI". Do NOT provide an explanation

    Text: "SmolLM is a pretty impressive model!"
    """}
    ],
    [{"role": "user", "content": """You are an expert annotator with years of experience annotating social science data.
    Your main task is to annotate whether the following text is about AI or not.
    Respond ONLY with the label "AI" or "NOT AI". Do NOT provide an explanation

    Text: "The weather is horrible today!"
    """}]
]

inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
  padding=True,
	return_dict=True, # retains attention mask
	return_tensors="pt", # returns tensors
).to(model.device) # more efficient to put on device

In [None]:
outputs = model.generate(**inputs, max_new_tokens=3)
print(tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:]))
print(tokenizer.decode(outputs[1][inputs['input_ids'].shape[1]:]))

### Few-shot

In [None]:
messages = [
    [{"role": "user", "content": """You are an expert annotator with years of experience annotating social science data.
    Your main task is to annotate whether the following text is about artificial intelligence or not.
    Respond ONLY with the label "AI" or "NOT AI". Do NOT provide an explanation

    Example:
    "SmolLM is a pretty impressive model!"
    """},
     {"role": "assistant", "content": "AI"},
     {"role": "user", "content": """
    Example:
    "The weather is horrible today!"
    """},
     {"role": "assistant", "content": "NOT AI"},
     {"role": "user", "content": """
     Text: "The impact of the new wave on automation on the labour market is not yet clear."
     """}
    ]]

In [None]:
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
  padding=True,
	return_dict=True, # retains attention mask
	return_tensors="pt", # returns tensors
).to(model.device) # more efficient to put on device

In [None]:
outputs = model.generate(**inputs, max_new_tokens=3)
print(tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:]))

### Dynamic Few-shot

### BONUS: Setfit

## Controlling model output with `pydantic`

In [None]:
from pydantic import BaseModel

from pydantic_ai import Agent

agent = Agent('huggingface:Qwen/Qwen3-235B-A22B')


class CityLocation(BaseModel):
    city: str
    country: str


agent = Agent('google-gla:gemini-1.5-flash', output_type=CityLocation)
result = agent.run_sync('Where were the olympics held in 2012?')
print(result.output)
#> city='London' country='United Kingdom'
print(result.usage())
#> RunUsage(input_tokens=57, output_tokens=8, requests=1)

In [None]:
from pydantic import BaseModel, Field

# Define what we want to extract
class Sentiment(BaseModel):
    """Simple sentiment analysis"""
    sentiment: str = Field(description="Is this POSITIVE, NEGATIVE, or NEUTRAL?")
    confidence: float = Field(description="How confident are you? (0.0 to 1.0)")

# Test texts to analyze
test_texts = [
    "I absolutely love this new policy! It will help so many families.",
    "This decision is terrible and will hurt our community.",
    "The committee met yesterday to discuss the budget proposal."
]

print("Analyzing sentiments:")
print("-" * 30)

for text in test_texts:
    # Ask AI to analyze sentiment
    inputs = tokenizer.apply_chat_template(
        [
            {
                "role": "system",
                "content": "Classify sentiment as POSITIVE, NEGATIVE, or NEUTRAL."
            },
            {
                "role": "user",
                "content": text
            }
        ],
        add_generation_prompt=True,
        padding=True,
        return_dict=True, # retains attention mask
        return_tensors="pt", # returns tensors
      ).to(model.device)

    response = model.generate(**inputs, max_new_tokens=3)

    # Get the result
    result = Sentiment.model_validate_json(tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:]))

    # Print results
    print(f"\nText: '{text}'")
    print(f"→ Sentiment: {result.sentiment}")
    print(f"→ Confidence: {result.confidence:.2f}")

## Training Encoders with Synthetic Annotations

In [None]:
# NLP Workshop: LLM Inference, Text Similarity, and Model Training
# A hands-on introduction to modern NLP techniques with Hugging Face

"""
Workshop Outline:
1. LLM Inference with Zero-shot and Few-shot Prompting
2. Text Similarity using Transformer Embeddings
3. Dynamic Few-shot Prompting
4. Training BERT on Synthetic LLM Labels

Prerequisites: Basic Python knowledge, familiarity with transformers concept
"""

# ============================================================================
# SETUP AND IMPORTS
# ============================================================================

# Install required packages (run in terminal or uncomment below)
# !pip install transformers torch sentence-transformers datasets scikit-learn

import torch
import numpy as np
import pandas as pd
from transformers import (
    AutoTokenizer, AutoModelForCausalLM, AutoModel,
    AutoModelForSequenceClassification, TrainingArguments, Trainer
)
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score, classification_report
from datasets import Dataset
import random

print("Setup complete! 🚀")

# ============================================================================
# PART 1: LLM INFERENCE WITH ZERO-SHOT AND FEW-SHOT PROMPTING
# ============================================================================

print("\n" + "="*60)
print("PART 1: LLM INFERENCE - ZERO-SHOT AND FEW-SHOT PROMPTING")
print("="*60)

# Load a smaller language model for demonstration
model_name = "HuggingFaceTB/SmolLM2-360M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

def generate_text(prompt, max_length=100, temperature=0.7):
    """Generate text using the loaded model"""
    inputs = tokenizer.encode(prompt, return_tensors="pt")

    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=max_length,
            temperature=temperature,
            pad_token_id=tokenizer.eos_token_id,
            do_sample=True
        )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# ZERO-SHOT PROMPTING EXAMPLE
print("\n📝 Zero-shot Prompting Example:")
zero_shot_prompt = "Classify the sentiment of this text as positive, negative, or neutral: 'I love this product!' Answer:"

result = generate_text(zero_shot_prompt, max_length=50)
print(f"Prompt: {zero_shot_prompt}")
print(f"Result: {result}")

# STUDENT INTERACTION 1
print("\n🤔 STUDENT EXERCISE 1:")
print("Try creating your own zero-shot prompt for a different task (e.g., topic classification, question answering)")
print("Modify the 'your_prompt' variable below and run the cell!")

# TODO: Students fill this in
your_prompt = "Classify this email as spam or not spam: 'Get rich quick! Click here now!' Answer:"
your_result = generate_text(your_prompt, max_length=50)
print(f"Your result: {your_result}")

# FEW-SHOT PROMPTING EXAMPLE
print("\n📝 Few-shot Prompting Example:")
few_shot_prompt = """
Classify sentiment as positive, negative, or neutral:

Text: "This movie was amazing!"
Sentiment: positive

Text: "I hated every minute of it."
Sentiment: negative

Text: "It was okay, nothing special."
Sentiment: neutral

Text: "Best purchase ever!"
Sentiment:"""

result = generate_text(few_shot_prompt, max_length=60)
print(f"Few-shot result: {result}")

# STUDENT INTERACTION 2
print("\n🤔 STUDENT EXERCISE 2:")
print("Compare zero-shot vs few-shot results. Which performs better? Why?")
print("Try adding more examples to the few-shot prompt and observe the difference.")

# ============================================================================
# PART 2: TEXT SIMILARITY USING TRANSFORMER EMBEDDINGS
# ============================================================================

print("\n" + "="*60)
print("PART 2: TEXT SIMILARITY WITH TRANSFORMER EMBEDDINGS")
print("="*60)

# Load sentence transformer model
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')

def calculate_similarity(text1, text2):
    """Calculate cosine similarity between two texts"""
    embeddings = sentence_model.encode([text1, text2])
    similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
    return similarity

# Example texts for similarity comparison
texts = [
    "The cat sat on the mat.",
    "A feline rested on the rug.",
    "Dogs are great pets.",
    "I love pizza and pasta.",
    "Italian food is delicious."
]

print("\n📊 Text Similarity Matrix:")
print("Comparing different text pairs:")

for i in range(len(texts)):
    for j in range(i+1, len(texts)):
        similarity = calculate_similarity(texts[i], texts[j])
        print(f"'{texts[i][:30]}...' vs '{texts[j][:30]}...': {similarity:.3f}")

# STUDENT INTERACTION 3
print("\n🤔 STUDENT EXERCISE 3:")
print("Add your own texts to the list and see how they compare!")
print("Which pairs have the highest/lowest similarity? Does it make sense?")

# TODO: Students add their texts here
student_texts = [
    "Your text 1 here",
    "Your text 2 here",
    # Add more texts...
]

# Semantic search example
def semantic_search(query, documents, top_k=3):
    """Find most similar documents to a query"""
    query_embedding = sentence_model.encode([query])
    doc_embeddings = sentence_model.encode(documents)

    similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:top_k]

    results = []
    for idx in top_indices:
        results.append({
            'document': documents[idx],
            'similarity': similarities[idx]
        })
    return results

# Example semantic search
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with multiple layers.",
    "Natural language processing helps computers understand text.",
    "Computer vision enables machines to interpret visual information.",
    "Reinforcement learning trains agents through trial and error."
]

query = "How do computers understand language?"
search_results = semantic_search(query, documents)

print(f"\n🔍 Semantic Search Results for: '{query}'")
for i, result in enumerate(search_results, 1):
    print(f"{i}. (Score: {result['similarity']:.3f}) {result['document']}")

# ============================================================================
# PART 3: DYNAMIC FEW-SHOT PROMPTING
# ============================================================================

print("\n" + "="*60)
print("PART 3: DYNAMIC FEW-SHOT PROMPTING")
print("="*60)

class DynamicFewShotPrompter:
    def __init__(self, examples, sentence_model):
        self.examples = examples
        self.sentence_model = sentence_model

    def get_relevant_examples(self, query, k=3):
        """Retrieve k most similar examples to the query"""
        query_embedding = self.sentence_model.encode([query])
        example_texts = [ex['input'] for ex in self.examples]
        example_embeddings = self.sentence_model.encode(example_texts)

        similarities = cosine_similarity(query_embedding, example_embeddings)[0]
        top_indices = np.argsort(similarities)[::-1][:k]

        return [self.examples[idx] for idx in top_indices]

    def create_prompt(self, query, task_description, k=3):
        """Create a dynamic few-shot prompt"""
        relevant_examples = self.get_relevant_examples(query, k)

        prompt = f"{task_description}\n\n"

        for ex in relevant_examples:
            prompt += f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"

        prompt += f"Input: {query}\nOutput:"
        return prompt

# Example dataset for sentiment analysis
sentiment_examples = [
    {"input": "I love this movie!", "output": "positive"},
    {"input": "This food tastes terrible", "output": "negative"},
    {"input": "The weather is nice today", "output": "positive"},
    {"input": "I'm feeling sad", "output": "negative"},
    {"input": "This book is okay", "output": "neutral"},
    {"input": "Amazing service at this restaurant", "output": "positive"},
    {"input": "The product broke after one day", "output": "negative"},
    {"input": "Not bad, could be better", "output": "neutral"},
    {"input": "Absolutely fantastic experience", "output": "positive"},
    {"input": "Waste of money", "output": "negative"}
]

# Initialize dynamic prompter
prompter = DynamicFewShotPrompter(sentiment_examples, sentence_model)

# Test dynamic prompting
test_query = "This pizza is incredibly delicious"
dynamic_prompt = prompter.create_prompt(
    test_query,
    "Classify the sentiment of the following text as positive, negative, or neutral:",
    k=3
)

print(f"📝 Dynamic Few-shot Prompt for: '{test_query}'")
print(f"\n{dynamic_prompt}")

# Compare with random few-shot
random_examples = random.sample(sentiment_examples, 3)
random_prompt = "Classify the sentiment of the following text as positive, negative, or neutral:\n\n"
for ex in random_examples:
    random_prompt += f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"
random_prompt += f"Input: {test_query}\nOutput:"

print(f"\n📝 Random Few-shot Prompt (for comparison):")
print(f"\n{random_prompt}")

# STUDENT INTERACTION 4
print("\n🤔 STUDENT EXERCISE 4:")
print("Try different queries and compare dynamic vs random few-shot selection.")
print("Do you notice any differences in the selected examples?")

# TODO: Students test with their own queries
student_query = "I'm not sure how I feel about this"
student_prompt = prompter.create_prompt(student_query, "Classify sentiment:", k=3)
print(f"\nYour dynamic prompt preview:\n{student_prompt[:200]}...")

# ============================================================================
# PART 4: TRAINING BERT ON SYNTHETIC LLM LABELS
# ============================================================================

print("\n" + "="*60)
print("PART 4: TRAINING BERT ON SYNTHETIC LLM LABELS")
print("="*60)

# Create synthetic dataset (simulating LLM-generated labels)
synthetic_data = [
    {"text": "I absolutely love this product", "label": 1},
    {"text": "This is terrible quality", "label": 0},
    {"text": "Not sure about this purchase", "label": 2},
    {"text": "Best decision ever", "label": 1},
    {"text": "Completely disappointed", "label": 0},
    {"text": "It's alright, nothing special", "label": 2},
    {"text": "Highly recommend to everyone", "label": 1},
    {"text": "Worst experience ever", "label": 0},
    {"text": "Could be better or worse", "label": 2},
    {"text": "Exceeded my expectations", "label": 1}
]

# Convert to dataset format
df = pd.DataFrame(synthetic_data)
dataset = Dataset.from_pandas(df)

print("📊 Synthetic Dataset Overview:")
print(f"Total samples: {len(dataset)}")
print(f"Label distribution:")
print(df['label'].value_counts().sort_index())
print(f"\nSample data:")
print(df.head())

# Load BERT model for sequence classification
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=3  # positive, negative, neutral
)

def tokenize_function(examples):
    """Tokenize the input texts"""
    return tokenizer(examples["text"], truncation=True, padding=True)

# Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Split into train/test (small dataset, so simple split)
train_size = int(0.8 * len(tokenized_dataset))
train_dataset = tokenized_dataset.select(range(train_size))
test_dataset = tokenized_dataset.select(range(train_size, len(tokenized_dataset)))

print(f"\n📚 Dataset Split:")
print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")

# Define training arguments (simplified for workshop)
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Metric computation function
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    accuracy = accuracy_score(labels, predictions)
    return {"accuracy": accuracy}

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

print("\n🏋️ Starting BERT Training...")
print("Note: This is a simplified example. In practice, you'd use larger datasets!")

# Train the model (commented out to avoid long execution time in demo)
# trainer.train()

print("✅ Training setup complete!")
print("\nIn a real scenario, you would:")
print("1. Generate more synthetic labels using an LLM")
print("2. Clean and validate the synthetic data")
print("3. Train on a larger dataset")
print("4. Evaluate on human-labeled test data")
print("5. Compare performance with the original LLM")

# STUDENT INTERACTION 5
print("\n🤔 FINAL STUDENT EXERCISE:")
print("Discussion Questions:")
print("1. What are the advantages of training BERT on LLM-generated labels?")
print("2. What potential issues should we watch out for?")
print("3. How would you validate that the synthetic labels are good quality?")
print("4. In what scenarios would this approach be most useful?")

# Quick inference example (without training)
def predict_sentiment(text):
    """Quick inference example"""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1).item()

    labels = {0: "negative", 1: "positive", 2: "neutral"}
    confidence = predictions[0][predicted_class].item()

    return labels[predicted_class], confidence

# Test the model (before training, so results will be random)
test_text = "I think this workshop was helpful"
prediction, confidence = predict_sentiment(test_text)
print(f"\n🔮 Model Prediction (before fine-tuning):")
print(f"Text: '{test_text}'")
print(f"Prediction: {prediction} (confidence: {confidence:.3f})")

print("\n" + "="*60)
print("🎉 WORKSHOP COMPLETE!")
print("="*60)
print("Key Takeaways:")
print("✅ Zero-shot vs Few-shot prompting strategies")
print("✅ Text similarity with transformer embeddings")
print("✅ Dynamic example selection for better prompting")
print("✅ Training smaller models on LLM-generated data")
print("\nNext steps: Experiment with larger datasets and different model architectures!")