<a href="https://colab.research.google.com/github/nicolaiberk/llm_ws/blob/main/notebooks/06a_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Informed Prompting

In an earlier session, we have explored how to query generative models and how these queries can be enriched with examples (or 'context') to provide more information to the model in one- or few-shot queries. In these cases, we provided the *same* context disregarding the query entry. Today, we will see that model responses can be substantially improved by carefully selecting the context provided to the model.

> ❗ ACTIVATE THE GPU BY SELECTING RUNTIME IN THE UPPER RIGHT > CONNECT TO RUNTIME > T4 GPU

In [1]:
!pip install sentence_transformers datasets faiss-gpu-cu12 transformers



> ❗ RESTART THE NOTEBOOK (DROPDOWN NEXT TO RUN ALL > RESTART SESSION)

The [sentence-transformers](https://sbert.net/) library provides an ecosystem of models designed specifically for efficient embedding generation. It works very similar to transformers:

In [2]:
from sentence_transformers import SentenceTransformer, CrossEncoder

We load a pretrained model:

In [3]:
similarity_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Then we encode some sentences of interest:

In [4]:
sentences = [
    "The Great Wall of China was built over several dynasties, with most of the existing structure dating from the Ming Dynasty (1368-1644).",
    "The blue whale's heart alone can weigh as much as an automobile and is roughly the size of a small car.",
    "Studies show that the Dunning-Kruger effect causes people with low ability in a domain to overestimate their competence in that area.",
]

And encode them as embeddings:

In [5]:
# 2. Calculate embeddings by calling model.encode()
embeddings = similarity_model.encode(sentences)
print(embeddings.shape)

(3, 384)


We can then calculate the cosine similarity of the sentences with each other:

In [6]:
# 3. Calculate the embedding similarities
similarities = similarity_model.similarity(embeddings, embeddings)
print(similarities)

tensor([[ 1.0000, -0.0797, -0.0810],
        [-0.0797,  1.0000,  0.0047],
        [-0.0810,  0.0047,  1.0000]])


## Similarity Search

This is particularly useful if we are searching something using a query:

In [7]:
query = "How large is a blue whales heart?"
query_embedding = model.encode([query])
similarities = model.similarity(query_embedding, embeddings)
print(similarities)

tensor([[ 0.0311,  0.6708, -0.0386]])


Looks good! Now we can then select the most similar context to add to the prompt:

In [8]:
best_index = similarities.squeeze().argmax().item() # get the index of the highest similarity

We can now add this context to our query, providing the relevant information to our model:



In [20]:
prompt = [
    {"role": "system", "content": "Answer the Question."},
    {"role": "user", "content": query},
    {"role": "system", "content": "Context: " + sentences[best_index]}
]
print(prompt)

[{'role': 'system', 'content': 'Answer the Question.'}, {'role': 'user', 'content': 'How large is a blue whales heart?'}, {'role': 'system', 'content': "Context: The blue whale's heart alone can weigh as much as an automobile and is roughly the size of a small car."}]


Let's provide this prompt to the model and see how it responds:

In [27]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

# Check for GPU availability and set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M-Instruct", torch_dtype=torch.float16).to(device)

Using device: cuda


In [28]:
inputs = tokenizer.apply_chat_template(
	prompt,
	add_generation_prompt=True,
  padding=True,
	return_dict=True, # retains attention mask
	return_tensors="pt", # returns tensors
).to(model.device) # more efficient to put on device

In [32]:
output = model.generate(**inputs, max_new_tokens=100)

In [33]:
tokenizer.decode(output[0])

"<|im_start|>system\nAnswer the Question.<|im_end|>\n<|im_start|>user\nHow large is a blue whales heart?<|im_end|>\n<|im_start|>system\nContext: The blue whale's heart alone can weigh as much as an automobile and is roughly the size of a small car.<|im_end|>\n<|im_start|>assistant\nA blue whale's heart is approximately 15 feet long and weighs around 1,000 pounds.<|im_end|>"

This, of course is more useful when you have a larger set of information to choose from to provide the context. Let's therefore get a mini-version of wikipedia content to choose the relevant context from. This data is conveniently available on the huggingface hub:

In [60]:
from datasets import load_dataset

dataset = load_dataset("rag-datasets/rag-mini-wikipedia", "text-corpus")

As you can see below, the data consists of different text passages from Wikipedia articles:

In [71]:
dataset['passages'][1234]

{'passage': 'The ears are also used in certain displays of aggression and during the males\' mating period. If an elephant wants to intimidate a predator or rival, it will spread its ears out wide to make itself look more massive and imposing. During the breeding season, males give off an odour from a gland located behind their eyes. Joyce Poole, a well-known elephant researcher, has theorized that the males will fan their ears in an effort to help propel this "elephant cologne" great distances.',
 'id': 1235}

Let's clean this corpus up a little bit and encode all texts to embeddings. We start by writing the cleaning function:

In [72]:
import re

## cleanup function
def clean_text(example):
    text = example["passage"]
    text = re.sub(r"[^a-zA-Z0-9\s.,!?;:'\"-]", "", text)  # remove weird chars
    text = re.sub(r"\s+", " ", text).strip()  # normalize spaces
    example["passage"] = text
    return example

And apply it to our texts:

In [73]:
dataset = dataset.map(clean_text)

Map:   0%|          | 0/3200 [00:00<?, ? examples/s]

Now we can load an embedding model and

In [None]:
def embed(example):
  text = example["dialogue"]
  example["embedding"] = model.encode(text)
  return example

In [None]:
# Always clean + use this corpus consistently
corpus = []
for item in dataset["passages"]:
    text = str(item).strip()
    if text:
        corpus.append(text)


In [36]:
import faiss



# Embedding model
print("Encoding corpus...")
embedder = SentenceTransformer("all-MiniLM-L6-v2").to(device)
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True).to(device)
corpus_embeddings_np = corpus_embeddings.cpu().numpy()

# FAISS index
index = faiss.IndexFlatL2(corpus_embeddings_np.shape[1])
index.add(corpus_embeddings_np)

# Reranker model (if uncommented)
# reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2").to(device)

# Generator (choose one: local HF model or OpenAI)

# Embed query
query_embedding = embedder.encode([query], convert_to_tensor=True).to(device).cpu().numpy()

# Retrieve top-k from FAISS
D, I = index.search(query_embedding, k=5)
retrieved_docs = [corpus[idx] for idx in I[0]]

print("Retrieved indices:", I[0])
print("Retrieved docs:")
for doc in retrieved_docs:
    print("-", repr(doc))

Encoding corpus...
Retrieved indices: [1240 2594 3181 2590  953]
Retrieved docs:
- '{\'passage\': "With a mass just over 5 kg (11 lb), elephant brains are larger than those of any land animal, and although the largest whales have body masses twentyfold those of a typical elephant, whale brains are barely twice the mass of an elephant\'s. A wide variety of behaviour, including those associated with grief, making music, art, altruism, allomothering, play, use of tools,     compassion and self-awareness      evidence a highly intelligent species on par with cetaceans    and primates   .", \'id\': 1241}'
- "{'passage': 'Polar bears rank with the Kodiak bear as among the largest living land carnivores, and male polar bears may weigh twice as much as a Siberian tiger. Most adult males weigh 350 650 kg (770 1500+ lb) and measure 2.5 3.0 m (8.2 9.8 ft) in length. Adult females are roughly half the size of males and normally weigh 150 250 kg (330 550 lb), measuring 2 2.5 m (6.6 8.2 ft), but dou

In [None]:
# # Rerank
# rerank_pairs = [[str(query), str(doc)] for doc in retrieved_docs]
# scores = reranker.predict(rerank_pairs)
# reranked_docs = [doc for _, doc in sorted(zip(scores, retrieved_docs), reverse=True)]

In [None]:
# Combine for context
context = "\n\n".join(retrieved_docs[:2])
prompt = f"""Answer the following question using the provided context.\n\nContext:\n{context}\n\nQuestion: {query}\nAnswer:"""

# Generate
response = generator(prompt)[0]["generated_text"]

''

In [None]:
response

'Answer the following question using the provided context.\n\nContext:\n{\'passage\': "With a mass just over 5 kg (11 lb), elephant brains are larger than those of any land animal, and although the largest whales have body masses twentyfold those of a typical elephant, whale brains are barely twice the mass of an elephant\'s. A wide variety of behaviour, including those associated with grief, making music, art, altruism, allomothering, play, use of tools,     compassion and self-awareness      evidence a highly intelligent species on par with cetaceans    and primates   .", \'id\': 1241}\n\n{\'passage\': \'Polar bears rank with the Kodiak bear as among the largest living land carnivores, and male polar bears may weigh twice as much as a Siberian tiger. Most adult males weigh 350 650 kg (770 1500+ lb) and measure 2.5 3.0 m (8.2 9.8 ft) in length. Adult females are roughly half the size of males and normally weigh 150 250 kg (330 550 lb), measuring 2 2.5 m (6.6 8.2 ft), but double their 

In [None]:
generator(prompt)

[{'generated_text': 'Answer the following question using the provided context.\n\nContext:\n{\'passage\': "With a mass just over 5 kg (11 lb), elephant brains are larger than those of any land animal, and although the largest whales have body masses twentyfold those of a typical elephant, whale brains are barely twice the mass of an elephant\'s. A wide variety of behaviour, including those associated with grief, making music, art, altruism, allomothering, play, use of tools,     compassion and self-awareness      evidence a highly intelligent species on par with cetaceans    and primates   .", \'id\': 1241}\n\n{\'passage\': \'Polar bears rank with the Kodiak bear as among the largest living land carnivores, and male polar bears may weigh twice as much as a Siberian tiger. Most adult males weigh 350 650 kg (770 1500+ lb) and measure 2.5 3.0 m (8.2 9.8 ft) in length. Adult females are roughly half the size of males and normally weigh 150 250 kg (330 550 lb), measuring 2 2.5 m (6.6 8.2 ft

In [None]:
response.split("Answer:")

['Answer the following question using the provided context.\n\nContext:\n{\'passage\': "With a mass just over 5 kg (11 lb), elephant brains are larger than those of any land animal, and although the largest whales have body masses twentyfold those of a typical elephant, whale brains are barely twice the mass of an elephant\'s. A wide variety of behaviour, including those associated with grief, making music, art, altruism, allomothering, play, use of tools,     compassion and self-awareness      evidence a highly intelligent species on par with cetaceans    and primates   .", \'id\': 1241}\n\n{\'passage\': \'Polar bears rank with the Kodiak bear as among the largest living land carnivores, and male polar bears may weigh twice as much as a Siberian tiger. Most adult males weigh 350 650 kg (770 1500+ lb) and measure 2.5 3.0 m (8.2 9.8 ft) in length. Adult females are roughly half the size of males and normally weigh 150 250 kg (330 550 lb), measuring 2 2.5 m (6.6 8.2 ft), but double their