<a href="https://colab.research.google.com/github/nicolaiberk/llm_ws/blob/main/notebooks/05a_prompting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Annotation with Generative Models

Today, we are going to see how to generate text and annotations with generative LLMs.

> ❗ ACTIVATE THE GPU BY SELECTING RUNTIME IN THE UPPER RIGHT > CONNECT TO RUNTIME > T4 GPU

In [None]:
  !pip install transformers accelerate setfit

> ❗ RESTART THE NOTEBOOK (DROPDOWN NEXT TO RUN ALL > RESTART SESSION)

## Generating text with generative models

We will start by simply generating some text using a family of small generative models developed by huggingface.

### Simple Inference

The all-powerful `pipeline` is again the simplest way to get inference running quickly:

In [None]:
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M-Instruct")
messages = [
    "Let me tell you a story. Once upon a time,"
]
pipe(messages)

### Chat

Many applications of LLMs require a chat template. We can use the tokenizer to enforce this template. The template simply indicates which parts of the text are from to the user and which are/should be from the assistant.

Remember: LLM chats are just roleplay with special tokens!

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct", padding_side='left')
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct")

In [None]:
messages = [
    {"role": "user", "content": "Who are you?"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False)
tokenized_chat

In this context, it is helpful to add the generation prompt indicating that the text should be generated in the role of the assistant. Otherwise, the model might generate more text as the user instead ([more](https://huggingface.co/docs/transformers/en/chat_templating?template=Mistral#addgenerationprompt)).

In [None]:
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

In [None]:
tokenized_chat

In [None]:
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	return_dict=True, # retains attention mask
	return_tensors="pt", # returns tensors
).to(model.device) # more efficient to put on device

In [None]:
inputs

In [None]:
outputs = model.generate(**inputs, max_new_tokens=100) # note the max_new_tokens parameter
print(tokenizer.decode(outputs[0])) # note that the entire conversation is returned, including the system prompt.

### Zero-shot prompting

In order to get proper annotations from our model, we can simply ask the model to generate the relevant outputs. This is as simple as writing a prompt. Remember the best practices we discussed earlier today.

In [None]:
messages = [
    [{"role": "user", "content": """You are an expert annotator with years of experience annotating social science data.
    Your main task is to annotate whether the following text is about AI or not.
    Respond ONLY with the label "AI" or "NOT AI". Do NOT provide an explanation

    Text: "SmolLM is a pretty impressive model!"
    """}
    ],
    [{"role": "user", "content": """You are an expert annotator with years of experience annotating social science data.
    Your main task is to annotate whether the following text is about AI or not.
    Respond ONLY with the label "AI" or "NOT AI". Do NOT provide an explanation

    Text: "The weather is horrible today!"
    """}]
]

inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
  padding=True,
	return_dict=True, # retains attention mask
	return_tensors="pt", # returns tensors
).to(model.device) # more efficient to put on device

In [None]:
messages

In [None]:
inputs

In [None]:
outputs = model.generate(**inputs, max_new_tokens=3)
print(tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:]))
print(tokenizer.decode(outputs[1][inputs['input_ids'].shape[1]:]))

### Few-shot

In [None]:
messages = [
    [{"role": "user", "content": """You are an expert annotator with years of experience annotating social science data.
    Your main task is to annotate whether the following text is about artificial intelligence or not.
    Respond ONLY with the label "AI" or "NOT AI". Do NOT provide an explanation

    Example:
    "SmolLM is a pretty impressive model!"
    """},
     {"role": "assistant", "content": "AI"},
     {"role": "user", "content": """
    Example:
    "The weather is horrible today!"
    """},
     {"role": "assistant", "content": "NOT AI"},
     {"role": "user", "content": """
     Text: "The impact of the new wave on automation on the labour market is not yet clear."
     """}
    ]]

In [None]:
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False)
tokenized_chat

In [None]:
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
  padding=True,
	return_dict=True, # retains attention mask
	return_tensors="pt", # returns tensors
).to(model.device) # more efficient to put on device

In [None]:
outputs = model.generate(**inputs, max_new_tokens=3)
print(tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:]))

# Informed Prompting

In [None]:
from sentence_transformers import SentenceTransformer, CrossEncoder

We load a pretrained model:

In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

Then we encode some sentences of interest:

In [None]:
sentences = [
    "The Great Wall of China was built over several dynasties, with most of the existing structure dating from the Ming Dynasty (1368-1644).",
    "The blue whale's heart alone can weigh as much as an automobile and is roughly the size of a small car.",
    "Studies show that the Dunning-Kruger effect causes people with low ability in a domain to overestimate their competence in that area.",
]

And encode them as embeddings:

In [None]:
# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)

We can then calculate the cosine similarity of the sentences with each other:

In [None]:
# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)

## Similarity Search

This is particularly useful if we are searching something using a query:

In [None]:
query = "How large is a blue whales heart?"
query_embedding = model.encode([query])
similarities = model.similarity(query_embedding, embeddings)
print(similarities)

Looks good! Now we can then select the most similar context to add to the prompt:

In [None]:
best_index = similarities.squeeze().argmax().item() # get the index of the highest similarity

In [None]:
prompt = "Answer the Question. \nQuery:" + query + "\nContext: " + sentences[best_index]
print(prompt)

### BONUS: Setfit

Setfit is a particularly efficient solution for few-shot learning. YOu can find a brief explainer with code [here](https://huggingface.co/blog/setfit).

# Exercise

1. Think of a concept of interest for your research. Operationalize it with some labels. Define some example texts with the associated labels for each category. These texts will serve as the context explaining to our model how to annotate.

In [None]:
# Define the context
annotated_news_articles = [
    "Border patrol agents apprehended over 2,000 migrants attempting to cross the southern border illegally last week. The surge comes amid renewed debates over immigration policy reform in Congress.",
    "A new study reveals that climate change is accelerating the melting of Antarctic ice sheets at an unprecedented rate. Scientists warn this could lead to significant sea level rise within the next decade.",
    "The Federal Reserve announced a 0.25% interest rate cut to stimulate economic growth amid concerns about inflation. Market analysts expect this move to boost consumer spending during the holiday season.",
    "Immigration courts are facing a backlog of over 3 million cases, with average wait times exceeding four years. Legal advocates are calling for increased funding to hire more immigration judges.",
    "Tech giant announces breakthrough in quantum computing that could revolutionize data processing capabilities. The new chip design promises to solve complex problems exponentially faster than traditional computers.",
    "A bipartisan group of senators introduced legislation to streamline the legal immigration process for skilled workers. The bill aims to reduce visa processing times and increase annual caps for certain categories.",
    "Wildfire season has started earlier than expected across the western United States, with three major blazes already burning thousands of acres. Drought conditions and high temperatures are contributing to the increased fire risk.",
    "New archaeological discoveries in Egypt have uncovered a previously unknown pharaoh's tomb dating back 3,400 years. The tomb contains well-preserved artifacts that could reshape understanding of ancient Egyptian history.",
    "Local authorities rescued 45 undocumented immigrants from a suspected human trafficking operation in a warehouse outside Houston. The investigation has led to multiple arrests and is ongoing.",
    "Major automaker recalls 500,000 vehicles due to faulty brake systems that could increase accident risk. The company will provide free repairs at authorized dealerships nationwide."
]

# Associated labels
labels = [
    "immigration",
    "NOT immigration",
    "NOT immigration",
    "immigration",
    "NOT immigration",
    "immigration",
    "NOT immigration",
    "NOT immigration",
    "immigration",
    "NOT immigration"
]

2. Create a query to pass to our model and an example text you wish to classify.

In [None]:
query = """
You are an expert annotator of news content, having years of experience as research assistant in social science projects.
You assess whether the below text is about immigration.
Only aswer with "immigration" or "NOT immigration". Do not provide an explanation.

Here is the text:
"""

In [None]:
article_to_classify = "Asylum seekers at the U.S.-Mexico border are experiencing longer wait times due to new processing requirements implemented by immigration officials. Advocacy groups report that families are waiting up to six months in temporary shelters before their initial hearings."

3. Using the sentence-transformer model, assess how similar each example is to the text you wish to classify.

In [None]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

In [None]:
context_embeddings = model.encode(annotated_news_articles)

In [None]:
query_embedding = model.encode([article_to_classify]) ## note that we do NOT use the query here as we are interested in the example we annotate
similarities = model.similarity(query_embedding, context_embeddings)

In [None]:
similarities

4. Add the most similar example with the right annotation to the prompt. If you can, use the chat template from above.

In [None]:
best_index = similarities.squeeze().argmax().item() # get the position of the most similar embedding
best_index

In [None]:
example_text = annotated_news_articles[best_index]
example_label = labels[best_index]

In [None]:
example_text

In [None]:
chat_template = [
    {"role": "user", "content": query + example_text},
     {"role": "assistant", "content": example_label},
     {"role": "user", "content": article_to_classify}
    ]

In [None]:
chat_template

5. Post the message to the model. Are you happy with the model's annotation?

In [None]:
## model definition
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct", padding_side='left')
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct")

In [None]:
## tokenization
inputs = tokenizer.apply_chat_template(
	chat_template,
	add_generation_prompt=True,
  padding=True,
	return_dict=True, # retains attention mask
	return_tensors="pt", # returns tensors
).to(model.device) # more efficient to put on device

In [None]:
## inference
outputs = model.generate(**inputs, max_new_tokens=3)
print(tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:]))