<a href="https://colab.research.google.com/github/nicolaiberk/llm_ws/blob/main/notebooks/06a_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Informed Prompting

In an earlier session, we have explored how to query generative models and how these queries can be enriched with examples (or 'context') to provide more information to the model in one- or few-shot queries. In these cases, we provided the *same* context disregarding the query entry. Today, we will see that model responses can be substantially improved by carefully selecting the context provided to the model.

> ❗ ACTIVATE THE GPU BY SELECTING RUNTIME IN THE UPPER RIGHT > CONNECT TO RUNTIME > T4 GPU

In [None]:
!pip install sentence_transformers datasets faiss-gpu-cu12 transformers torch

> ❗ RESTART THE NOTEBOOK (DROPDOWN NEXT TO RUN ALL > RESTART SESSION)

The [sentence-transformers](https://sbert.net/) library provides an ecosystem of models designed specifically for efficient embedding generation. It works very similar to transformers:

In [None]:
from sentence_transformers import SentenceTransformer, CrossEncoder
import torch

# Check for GPU availability and set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

We load a pretrained model:

In [None]:
similarity_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2").to(device)

Then we encode some sentences of interest:

In [None]:
sentences = [
    "The Great Wall of China was built over several dynasties, with most of the existing structure dating from the Ming Dynasty (1368-1644).",
    "The blue whale's heart alone can weigh as much as an automobile and is roughly the size of a small car.",
    "Studies show that the Dunning-Kruger effect causes people with low ability in a domain to overestimate their competence in that area.",
]

And encode them as embeddings:

In [None]:

# 2. Calculate embeddings by calling model.encode()
embeddings = similarity_model.encode(sentences)
print(embeddings.shape)

We can then calculate the cosine similarity of the sentences with each other:

In [None]:
# 3. Calculate the embedding similarities
similarities = similarity_model.similarity(embeddings, embeddings)
print(similarities)

## Similarity Search

This is particularly useful if we are searching something using a query:

In [None]:
query = "How large is a blue whales heart?"
query_embedding = similarity_model.encode([query])
similarities = similarity_model.similarity(query_embedding, embeddings)
print(similarities)

Looks good! Now we can then select the most similar context to add to the prompt:

In [None]:
best_index = similarities.squeeze().argmax().item() # get the index of the highest similarity

We can now add this context to our query, providing the relevant information to our model:



In [None]:
prompt = [
    {"role": "system", "content": "Answer the Question."},
    {"role": "user", "content": query},
    {"role": "system", "content": "Context: " + sentences[best_index]}
]
print(prompt)

Let's provide this prompt to the model and see how it responds (it will take a moment to load the model):

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct", dtype=torch.float16).to(device)

In [None]:
inputs = tokenizer.apply_chat_template(
	prompt,
	add_generation_prompt=True,
  padding=True,
	return_dict=True, # retains attention mask
	return_tensors="pt", # returns tensors
).to(model.device) # more efficient to put on device

In [None]:
output = model.generate(**inputs, max_new_tokens=100)

In [None]:
tokenizer.decode(output[0])

## Retrieval-Augmented Generation

This, of course is more useful when you have a larger set of information to choose from to provide the context. Let's therefore get a mini-version of wikipedia content to choose the relevant context from. This data is conveniently available on the huggingface hub:

In [None]:
from datasets import load_dataset

dataset = load_dataset("rag-datasets/rag-mini-wikipedia", "text-corpus")

As you can see below, the data consists of different text passages from Wikipedia articles:

In [None]:
dataset['passages'][1234]

Let's clean this corpus up a little bit and encode all texts to embeddings. We start by writing the cleaning function removing empty texts and writing all texts to a list:

In [None]:
import re

## cleanup function
def clean_text(example):
    text = example["passage"]
    text = re.sub(r"[^a-zA-Z0-9\s.,!?;:'\"-]", "", text)  # remove weird chars
    text = re.sub(r"\s+", " ", text).strip()  # normalize spaces
    example["passage"] = text
    return example

And apply it to our texts:

In [None]:
dataset = dataset.map(clean_text)

Lastly, we remove empty texts and reset the index:

In [None]:
dataset = dataset.filter(lambda example: example["passage"].strip() != "")

Now we can use the embedding model from above to generate the embeddings:

In [None]:
corpus_embeddings = similarity_model.encode([text for text in dataset["passages"]['passage']], convert_to_tensor=True).cpu().numpy()

In [None]:
corpus_embeddings.shape # we get our vectors

We then use a library called `faiss` to provide fast search through our vectors - this is especially important when we have large context datasets.

In [None]:
import faiss

# FAISS index
index = faiss.IndexFlatL2(corpus_embeddings.shape[1])
index.add(corpus_embeddings)

In [None]:
query_embedding = similarity_model.encode([query], convert_to_tensor=True).to(device).cpu().numpy()

In [None]:
# Retrieve top-k from FAISS
D, I = index.search(query_embedding, k=5)
retrieved_docs = [dataset['passages'][int(idx)]['passage'] for idx in I[0]]

In [None]:
context = '\n'.join(retrieved_docs)
context

In [None]:
prompt = [
    {"role": "system", "content": "Answer the Question. If no relevant information is provided in the context, respond with 'I cannot answer this question based on the provided context'."},
    {"role": "user", "content": query},
    {"role": "context", "content": context}
]

Tokenize the chat template and provide it to the model:

In [None]:
inputs = tokenizer.apply_chat_template(
	prompt,
	add_generation_prompt=True,
  padding=True,
	return_dict=True, # retains attention mask
	return_tensors="pt", # returns tensors
).to(model.device) # more efficient to put on device

In [None]:
output = model.generate(**inputs, max_new_tokens=1000)

In [None]:
print(tokenizer.decode(output[0]))