<a href="https://colab.research.google.com/github/nicolaiberk/llm_ws/blob/main/notebooks/05a_prompting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Annotation with Generative Models

Today, we are going to see how to generate text and annotations with generative LLMs.

> ❗ ACTIVATE THE GPU BY SELECTING RUNTIME IN THE UPPER RIGHT > CONNECT TO RUNTIME > T4 GPU

In [1]:
  !pip install transformers accelerate setfit

Collecting setfit
  Downloading setfit-1.1.3-py3-none-any.whl.metadata (12 kB)
Collecting evaluate>=0.3.0 (from setfit)
  Downloading evaluate-0.4.5-py3-none-any.whl.metadata (9.5 kB)
Downloading setfit-1.1.3-py3-none-any.whl (75 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.5-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate, setfit
Successfully installed evaluate-0.4.5 setfit-1.1.3


> ❗ RESTART THE NOTEBOOK (DROPDOWN NEXT TO RUN ALL > RESTART SESSION)

## Generating text with generative models

We will start by simply generating some text using a family of small generative models developed by huggingface.

### Simple Inference

The all-powerful `pipeline` is again the simplest way to get inference running quickly:

In [1]:
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-360M-Instruct")
messages = [
    "Let me tell you a story. Once upon a time,"
]
pipe(messages)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/724M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

Device set to use cuda:0


[[{'generated_text': 'Let me tell you a story. Once upon a time, in a faraway land, there lived two friends, Pierre and Jacques. Pierre was a farmer, and Jacques was a baker. They loved their farm and their bread, and they always had a big feast together.\n\nOne day, a big storm came and ruined all their crops. Pierre and Jacques were very sad and worried what would happen to their farm. They didn\'t know what to do.\n\nPierre said, "Jacques, we can\'t eat the same bread again. We need to find a way to make bread without our crops." Jacques replied, "Maybe we can use our skills to make bread from other things like wheat from other farms, or maybe we can even make bread from the same kind of wheat that grew after the storm."\n\nPierre and Jacques thought this was a great idea! They started working together, using their farming skills to make bread from other kinds of wheat, and even made bread from the same wheat that grew after the storm. The people of their village loved their new bre

### Chat

Many applications of LLMs require a chat template. We can use the tokenizer to enforce this template. The template simply indicates which parts of the text are from to the user and which are/should be from the assistant.

Remember: LLM chats are just roleplay with special tokens!

In [2]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct", padding_side='left')
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct")

In [13]:
messages = [
    {"role": "user", "content": "Who are you?"},
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False)
tokenized_chat

'<|im_start|>system\nYou are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>\n<|im_start|>user\nWho are you?<|im_end|>\n'

In this context, it is helpful to add the generation prompt indicating that the text should be generated in the role of the assistant. Otherwise, the model might generate more text as the user instead ([more](https://huggingface.co/docs/transformers/en/chat_templating?template=Mistral#addgenerationprompt)).

In [14]:
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

In [15]:
tokenized_chat

'<|im_start|>system\nYou are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>\n<|im_start|>user\nWho are you?<|im_end|>\n<|im_start|>assistant\n'

In [16]:
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	return_dict=True, # retains attention mask
	return_tensors="pt", # returns tensors
).to(model.device) # more efficient to put on device

In [17]:
inputs

{'input_ids': tensor([[    1,  9690,   198,  2683,   359,   253,  5356,  5646, 11173,  3365,
          3511,   308, 34519,    28,  7018,   411,   407, 19712,  8182,     2,
           198,     1,  4093,   198, 10576,   359,   346,    47,     2,   198,
             1,   520,  9531,   198]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [18]:
outputs = model.generate(**inputs, max_new_tokens=100) # note the max_new_tokens parameter
print(tokenizer.decode(outputs[0])) # note that the entire conversation is returned, including the system prompt.

<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
Who are you?<|im_end|>
<|im_start|>assistant
I'm a chatbot designed to assist users with their language learning needs. I was trained on a vast amount of text data, including various languages, grammar rules, and vocabulary. I can help with language learning, grammar, and vocabulary exercises, as well as provide explanations for various language concepts.<|im_end|>


### Zero-shot prompting

In order to get proper annotations from our model, we can simply ask the model to generate the relevant outputs. This is as simple as writing a prompt. Remember the best practices we discussed earlier today.

In [33]:
messages = [
    [{"role": "user", "content": """You are an expert annotator with years of experience annotating social science data.
    Your main task is to annotate whether the following text is about AI or not.
    Respond ONLY with the label "AI" or "NOT AI". Do NOT provide an explanation

    Text: "SmolLM is a pretty impressive model!"
    """}
    ],
    [{"role": "user", "content": """You are an expert annotator with years of experience annotating social science data.
    Your main task is to annotate whether the following text is about AI or not.
    Respond ONLY with the label "AI" or "NOT AI". Do NOT provide an explanation

    Text: "The weather is horrible today!"
    """}]
]

inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
  padding=True,
	return_dict=True, # retains attention mask
	return_tensors="pt", # returns tensors
).to(model.device) # more efficient to put on device

In [34]:
messages

[[{'role': 'user',
   'content': 'You are an expert annotator with years of experience annotating social science data.\n    Your main task is to annotate whether the following text is about AI or not.\n    Respond ONLY with the label "AI" or "NOT AI". Do NOT provide an explanation\n\n    Text: "SmolLM is a pretty impressive model!"\n    '}],
 [{'role': 'user',
   'content': 'You are an expert annotator with years of experience annotating social science data.\n    Your main task is to annotate whether the following text is about AI or not.\n    Respond ONLY with the label "AI" or "NOT AI". Do NOT provide an explanation\n\n    Text: "The weather is horrible today!"\n    '}]]

In [20]:
inputs

{'input_ids': tensor([[    1,  9690,   198,  2683,   359,   253,  5356,  5646, 11173,  3365,
          3511,   308, 34519,    28,  7018,   411,   407, 19712,  8182,     2,
           198,     1,  4093,   198,  2683,   359,   354,  4507, 13666,  1508,
           351,   929,   282,  1786, 13666,   674,  1329,  2092,   940,    30,
           472,  2789,  1085,  3856,   314,   288, 13666,   368,  1991,   260,
          1695,  1694,   314,   563,  5646,   355,   441,    30,   472, 30417,
         44339,   351,   260,  4368,   476, 13701,    18,   355,   476, 18083,
          5646,  2227,  3315,  9695,  1538,   354,  7718,  1004,  9378,    42,
           476,  9207,   308, 34519,   314,   253,  5740, 10402,  1743,  6653,
          2367,     2,   198,     1,   520,  9531,   198],
        [    2,     2,     2,     1,  9690,   198,  2683,   359,   253,  5356,
          5646, 11173,  3365,  3511,   308, 34519,    28,  7018,   411,   407,
         19712,  8182,     2,   198,     1,  4093,   198, 

In [21]:
outputs = model.generate(**inputs, max_new_tokens=3)
print(tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:]))
print(tokenizer.decode(outputs[1][inputs['input_ids'].shape[1]:]))

AI<|im_end|><|im_end|>
NOT AI<|im_end|>


### Few-shot

In [29]:
messages = [
    [{"role": "user", "content": """You are an expert annotator with years of experience annotating social science data.
    Your main task is to annotate whether the following text is about artificial intelligence or not.
    Respond ONLY with the label "AI" or "NOT AI". Do NOT provide an explanation

    Example:
    "SmolLM is a pretty impressive model!"
    """},
     {"role": "assistant", "content": "AI"},
     {"role": "user", "content": """
    Example:
    "The weather is horrible today!"
    """},
     {"role": "assistant", "content": "NOT AI"},
     {"role": "user", "content": """
     Text: "The impact of the new wave on automation on the labour market is not yet clear."
     """}
    ]]

In [30]:
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=False)
tokenized_chat

['<|im_start|>system\nYou are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>\n<|im_start|>user\nYou are an expert annotator with years of experience annotating social science data.\n    Your main task is to annotate whether the following text is about artificial intelligence or not.\n    Respond ONLY with the label "AI" or "NOT AI". Do NOT provide an explanation\n\n    Example:\n    "SmolLM is a pretty impressive model!"\n    <|im_end|>\n<|im_start|>assistant\nAI<|im_end|>\n<|im_start|>user\n\n    Example:\n    "The weather is horrible today!"\n    <|im_end|>\n<|im_start|>assistant\nNOT AI<|im_end|>\n<|im_start|>user\n\n     Text: "The impact of the new wave on automation on the labour market is not yet clear."\n     <|im_end|>\n']

In [31]:
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
  padding=True,
	return_dict=True, # retains attention mask
	return_tensors="pt", # returns tensors
).to(model.device) # more efficient to put on device

In [32]:
outputs = model.generate(**inputs, max_new_tokens=3)
print(tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:]))

NOT AI<|im_end|>


# Informed Prompting

In [35]:
from sentence_transformers import SentenceTransformer, CrossEncoder

We load a pretrained model:

In [36]:
model = SentenceTransformer("all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Then we encode some sentences of interest:

In [37]:
sentences = [
    "The Great Wall of China was built over several dynasties, with most of the existing structure dating from the Ming Dynasty (1368-1644).",
    "The blue whale's heart alone can weigh as much as an automobile and is roughly the size of a small car.",
    "Studies show that the Dunning-Kruger effect causes people with low ability in a domain to overestimate their competence in that area.",
]

And encode them as embeddings:

In [38]:
# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)

(3, 384)


We can then calculate the cosine similarity of the sentences with each other:

In [39]:
# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)

tensor([[ 1.0000, -0.0797, -0.0810],
        [-0.0797,  1.0000,  0.0047],
        [-0.0810,  0.0047,  1.0000]])


## Similarity Search

This is particularly useful if we are searching something using a query:

In [40]:
query = "How large is a blue whales heart?"
query_embedding = model.encode([query])
similarities = model.similarity(query_embedding, embeddings)
print(similarities)

tensor([[ 0.0311,  0.6708, -0.0386]])


Looks good! Now we can then select the most similar context to add to the prompt:

In [41]:
best_index = similarities.squeeze().argmax().item() # get the index of the highest similarity

In [42]:
prompt = "Answer the Question. \nQuery:" + query + "\nContext: " + sentences[best_index]
print(prompt)

Answer the Question. 
Query:How large is a blue whales heart?
Context: The blue whale's heart alone can weigh as much as an automobile and is roughly the size of a small car.


### BONUS: Setfit

Setfit is a particularly efficient solution for few-shot learning. YOu can find a brief explainer with code [here](https://huggingface.co/blog/setfit).

# Exercise

1. Think of a concept of interest for your research. Operationalize it with some labels. Define some example texts with the associated labels for each category. These texts will serve as the context explaining to our model how to annotate.

In [48]:
# Define the context
annotated_news_articles = [
    "Border patrol agents apprehended over 2,000 migrants attempting to cross the southern border illegally last week. The surge comes amid renewed debates over immigration policy reform in Congress.",
    "A new study reveals that climate change is accelerating the melting of Antarctic ice sheets at an unprecedented rate. Scientists warn this could lead to significant sea level rise within the next decade.",
    "The Federal Reserve announced a 0.25% interest rate cut to stimulate economic growth amid concerns about inflation. Market analysts expect this move to boost consumer spending during the holiday season.",
    "Immigration courts are facing a backlog of over 3 million cases, with average wait times exceeding four years. Legal advocates are calling for increased funding to hire more immigration judges.",
    "Tech giant announces breakthrough in quantum computing that could revolutionize data processing capabilities. The new chip design promises to solve complex problems exponentially faster than traditional computers.",
    "A bipartisan group of senators introduced legislation to streamline the legal immigration process for skilled workers. The bill aims to reduce visa processing times and increase annual caps for certain categories.",
    "Wildfire season has started earlier than expected across the western United States, with three major blazes already burning thousands of acres. Drought conditions and high temperatures are contributing to the increased fire risk.",
    "New archaeological discoveries in Egypt have uncovered a previously unknown pharaoh's tomb dating back 3,400 years. The tomb contains well-preserved artifacts that could reshape understanding of ancient Egyptian history.",
    "Local authorities rescued 45 undocumented immigrants from a suspected human trafficking operation in a warehouse outside Houston. The investigation has led to multiple arrests and is ongoing.",
    "Major automaker recalls 500,000 vehicles due to faulty brake systems that could increase accident risk. The company will provide free repairs at authorized dealerships nationwide."
]

# Associated labels
labels = [
    "immigration",
    "NOT immigration",
    "NOT immigration",
    "immigration",
    "NOT immigration",
    "immigration",
    "NOT immigration",
    "NOT immigration",
    "immigration",
    "NOT immigration"
]

2. Create a query to pass to our model and an example text you wish to classify.

In [62]:
query = """
You are an expert annotator of news content, having years of experience as research assistant in social science projects.
You assess whether the below text is about immigration.
Only aswer with "immigration" or "NOT immigration". Do not provide an explanation.

Here is the text:
"""

In [67]:
article_to_classify = "Asylum seekers at the U.S.-Mexico border are experiencing longer wait times due to new processing requirements implemented by immigration officials. Advocacy groups report that families are waiting up to six months in temporary shelters before their initial hearings."

3. Using the sentence-transformer model, assess how similar each example is to the text you wish to classify.

In [51]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

In [52]:
context_embeddings = model.encode(annotated_news_articles)

In [68]:
query_embedding = model.encode([article_to_classify]) ## note that we do NOT use the query here as we are interested in the example we annotate
similarities = model.similarity(query_embedding, context_embeddings)

In [54]:
similarities

tensor([[ 0.4319,  0.0703,  0.0522,  0.5475,  0.1120,  0.2951,  0.0755,  0.0734,
          0.3540, -0.0090]])

4. Add the most similar example with the right annotation to the prompt. If you can, use the chat template from above.

In [55]:
best_index = similarities.squeeze().argmax().item() # get the position of the most similar embedding
best_index

3

In [56]:
example_text = annotated_news_articles[best_index]
example_label = labels[best_index]

In [57]:
example_text

'Immigration courts are facing a backlog of over 3 million cases, with average wait times exceeding four years. Legal advocates are calling for increased funding to hire more immigration judges.'

In [65]:
chat_template = [
    {"role": "user", "content": query + example_text},
     {"role": "assistant", "content": example_label},
     {"role": "user", "content": article_to_classify}
    ]

In [66]:
chat_template

[{'role': 'user',
  'content': '\nYou are an expert annotator of news content, having years of experience as research assistant in social science projects. \nYou assess whether the below text is about immigration. \nOnly aswer with "immigration" or "NOT immigration". Do not provide an explanation.\n\nHere is the text:\nImmigration courts are facing a backlog of over 3 million cases, with average wait times exceeding four years. Legal advocates are calling for increased funding to hire more immigration judges.'},
 {'role': 'assistant', 'content': 'immigration'},
 {'role': 'user',
  'content': 'Asylum seekers at the U.S.-Mexico border are experiencing longer wait times due to new processing requirements implemented by immigration officials. Advocacy groups report that families are waiting up to six months in temporary shelters before their initial hearings.'}]

5. Post the message to the model. Are you happy with the model's annotation?

In [69]:
## model definition
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct", padding_side='left')
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct")

In [72]:
## tokenization
inputs = tokenizer.apply_chat_template(
	chat_template,
	add_generation_prompt=True,
  padding=True,
	return_dict=True, # retains attention mask
	return_tensors="pt", # returns tensors
).to(model.device) # more efficient to put on device

In [73]:
## inference
outputs = model.generate(**inputs, max_new_tokens=3)
print(tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:]))

immigration<|im_end|>
