<a href="https://colab.research.google.com/github/lisabecker/nlp-fundamentals/blob/main/0504_llms_and_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 5 Use Case - Introduction to Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG)

In this exercise, we'll get familiar with a popular method of employing Large Language Models (LLMs) - Retrieval-Augmented Generation (RAG). LLMs like GPT-2 have revolutionized how machines understand and generate human language with impressive capabilities in text generation. However, their ability to leverage external, specific knowledge into responses is limited.

This is where RAG comes into play. By combining the generative capabilities of LLMs with the precision of information retrieval, RAG models provide contextually enriched and accurate responses. Throughout this notebook, you'll get hands-on experience with both LLMs and the RAG setup. You'll learn how to extract text from online resources, preprocess it, and use it as a knowledge base for a RAG model. This exercise will not only help your understanding of these models but also equip you with practical skills in implementing them.

## 1. Set up the environment

In [None]:
# Install the necessary packages
!pip install --q transformers==4.36.2 sentence-transformers==2.2.2

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


## 2. Exploring a Pre-Trained Large Language Model (LLM)

In this section, we initialize and explore the capabilities of GPT-2, a pre-trained Large Language Model (LLM). GPT-2 is well-known for its ability to generate coherent and contextually relevant text.

First, we set up the GPT-2 tokenizer and model. The tokenizer converts our input text into a format that the model can understand (tokenization), and the model is then used to generate a text response.

We also include a special configuration to use the end-of-sequence token as a padding token. This adjustment is necessary because GPT-2 does not have a default padding token, and padding is crucial for handling variable-length inputs.

The function `generate_text` is defined to encapsulate the text generation process. It takes a prompt as input, tokenizes it, generates a response using GPT-2, and then decodes this response back into human-readable text.

![LLM Visualisation](https://github.com/lisabecker/nlp-fundamentals/blob/main/graphics/llm.png?raw=true)

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Initialize GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Function to generate text using GPT-2
def generate_text(prompt, max_length=50):
    # Encode the input with attention mask
    encoding = tokenizer(prompt, return_tensors="pt", max_length=max_length)
    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']

    # Generate response using the model
    output_ids = model.generate(input_ids, attention_mask=attention_mask, max_length=max_length)
    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Decode and return the generated text
    return output_text

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
# Example usage
prompt = "Queen Elizabeth II died on"
print(f"\n=== Prompt ===\n'{prompt}\n")
response = generate_text(prompt)
print(f"\n=== GPT2 Response ===\n'{response}...'")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



=== Prompt ===
'Queen Elizabeth II died on


=== GPT2 Response ===
'Queen Elizabeth II died on the 15th of July, 1714, at the age of sixty-five.

The first of the three daughters of the Queen Elizabeth II, Elizabeth, was born in 1714, and was the daughter of the...'


As you can see, the answer is factually incorrect. This is because LLMs don't have "world knowledge", but simply predict the next word in a sentence. This leads to typically grammatically correct, but sometimes factually incorrect outputs. This is where RAG comes into play

## 3. Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) combines the power of language models with external knowledge retrieval to generate responses that are not only contextually relevant but also factually accurate. In this section, we will create a simple RAG setup.

<img src="https://github.com/lisabecker/nlp-fundamentals/blob/main/graphics/rag.png?raw=true" width="75%">

### Dataset Definition

We start by defining a simple Q&A dataset. This dataset is a collection of question-answer pairs where each question has a corresponding factual answer. The dataset is represented as a dictionary where questions are keys and answers are values.

This dataset serves as our knowledge base, from which the model will retrieve information to augment its responses.


In [None]:
# Define your dataset
qa_dataset_dict = {
    "When did Queen Elizabeth II. die?": "Queen Elizabeth II. died on September 8th 2022.",
    "Who created the 'Fundamentals of Natural Language Processing' course for O'Reilly?": "Lisa Becker.",
    "Who won the 2022 FIFA World Cup?": "Argentina.",
    # Add more Q&A pairs as needed!
}

# Extracting questions and answers
questions = list(qa_dataset_dict.keys())
answers = list(qa_dataset_dict.values())
print(f"Questions: {questions}")
print(f"Answers: {answers}")

Questions: ['When did Queen Elizabeth II. die?', "Who created the 'Fundamentals of Natural Language Processing' course for O'Reilly?", 'Who won the 2022 FIFA World Cup?']
Answers: ['Queen Elizabeth II. died on September 8th 2022.', 'Lisa Becker.', 'Argentina.']


### 3.1 Embedding Model

To enable our RAG setup to retrieve relevant information from our Q&A dataset, we need to convert our questions into a format that allows for similarity comparison. This process is known as embedding.

We use the `SentenceTransformer` model to generate embeddings for each question in our dataset. These embeddings are high-dimensional vector representations that capture the semantic meaning of the questions.

Once we have these embeddings, we can compare any new incoming question's embedding with them to find the most semantically similar existing question.

In [None]:
from sentence_transformers import SentenceTransformer

# Initialize a sentence transformer model for embeddings
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Extract questions from the dataset
questions = list(qa_dataset_dict.keys())

# Embedding questions
question_embeddings = embedder.encode(questions)

In [None]:
print(f"Question: {list(qa_dataset_dict.keys())[0]}")
print(f"Embedding dimentions: {len(question_embeddings[0])}")
print(f"Embeddings:\n {question_embeddings[0]}")

Question: When did Queen Elizabeth II. die?
Embedding dimentions: 384
Embeddings:
 [ 5.65683953e-02 -1.36119612e-02  9.75991692e-03  6.66206181e-02
  1.73170958e-02  7.42301624e-03  1.11426355e-03 -1.52329309e-02
 -5.11999354e-02  4.05504033e-02 -7.08348751e-02 -9.96007863e-03
 -2.76151877e-02 -6.15488850e-02 -7.34820263e-03  3.64168845e-02
 -1.82407089e-02 -1.01685524e-02 -6.06066883e-02  4.16703485e-02
  3.53916511e-02  1.46009438e-02  4.23478596e-02  6.47900328e-02
 -1.80922337e-02 -4.05574255e-02 -9.04035866e-02  5.66097796e-02
  2.46191174e-02 -4.26385701e-02 -5.78839518e-02 -6.97086975e-02
 -6.07863143e-02  2.82645710e-02  1.19467629e-02 -9.81370546e-03
  4.25702445e-02  4.75785658e-02 -1.67285893e-02 -4.29072268e-02
 -6.23256639e-02 -2.55048499e-02  7.53114140e-03  3.82317007e-02
  4.61017899e-02  7.28345802e-03 -3.10447197e-02  6.07980751e-02
  2.05025878e-02 -9.08783078e-03  3.26229353e-03  3.46134230e-03
  2.20970344e-03 -7.34083503e-02  2.35510282e-02 -1.40735498e-02
 -7.024

### 3.2 Cosine Similarity

With our questions embedded, the next step is to embed an incoming question and compare it to our dataset. We use cosine similarity for this comparison.

Cosine similarity measures the cosine of the angle between two vectors, in our case, the embedding of the incoming question and each question in our dataset. This metric helps us determine which question in our dataset is most similar to the incoming one.

We print the cosine similarity scores for illustration, showing how similar each dataset question is to the incoming question.



![Cosine Similarity Visualisation](https://github.com/lisabecker/nlp-fundamentals/blob/main/graphics/cosine_similarity.png?raw=true)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Embed an incoming question
incoming_question = "Queen Elizabeth II died on"
incoming_question_embedding = embedder.encode([incoming_question])
print(f"{incoming_question}\n")

# Compute cosine similarity
cos_similarities = cosine_similarity(incoming_question_embedding, question_embeddings)
for index, q in enumerate(questions):
  print(f"Question: {q}\nCosine Similarity: {cos_similarities[0][index]}\n")

Queen Elizabeth II died on

Question: When did Queen Elizabeth II. die?
Cosine Similarity: 0.8671935796737671

Question: Who created the 'Fundamentals of Natural Language Processing' course for O'Reilly?
Cosine Similarity: -0.06880398839712143

Question: Who won the 2022 FIFA World Cup?
Cosine Similarity: 0.1470877081155777



### Finding the Most Relevant Q&A Pair

After calculating the cosine similarities, we identify the question in our dataset that has the highest similarity score with the incoming question. This is our most relevant question, and its corresponding answer is likely to contain pertinent information related to the incoming question.

We then extract this most relevant question and its answer from our dataset, as they will be used to generate an informed response.


In [None]:
# Find the index of the most relevant question
most_relevant_idx = np.argmax(cos_similarities)

# Extract the most relevant question and its answer from your dictionary
relevant_question = questions[most_relevant_idx]
relevant_answer = answers[most_relevant_idx]
print(f"Relevant question: {relevant_question}")
print(f"Relevant answer: {relevant_answer}")

Relevant question: When did Queen Elizabeth II. die?
Relevant answer: Queen Elizabeth II. died on September 8th 2022.


### 3.4 Generate Answer with Additional Knowledge

Finally, we combine the answer to the most relevant question from our dataset with the incoming question. This combined input is then fed into the GPT-2 model to generate a response.

By doing this, we leverage the retrieved factual information (answer from the dataset) and the contextual understanding of the LLM (GPT-2) to create a response that is both relevant and informed by external knowledge. This process exemplifies a simple yet effective RAG system.

In [None]:
# Combine the relevant answer and incoming question
combined_input = relevant_answer + " " + incoming_question

# Generate a response using the LLM
extended_response = generate_text(combined_input)
print(extended_response)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Queen Elizabeth II. died on September 8th 2022. Queen Elizabeth II died on September 8th 2022.

The Queen Elizabeth II. died on September 8th 2022. Queen Elizabeth II died on September 8th 2022.

The Queen Elizabeth


## 4. Additional Resources

- [Hugging Face Transformers](https://huggingface.co/docs/transformers/index)
- [Sentence-Transformers](https://www.sbert.net/)
- [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)