<a href="https://colab.research.google.com/github/lisabecker/nlp-fundamentals/blob/main/0504_llms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs)

This section provides an overview of what LLMs are and introduces the concept of Retrieval-Augmented Generation (RAG).

## 1. Setting up the environment

In [1]:
# Install the necessary packages
!python3 -m pip install --q transformers sentence-transformers

## 2. Exploring a Pre-Trained Large Language Model (LLM)

In this section, we initialize and explore the capabilities of GPT-2, a pre-trained Large Language Model (LLM). GPT-2 is well-known for its ability to generate coherent and contextually relevant text.

First, we set up the GPT-2 tokenizer and model. The tokenizer converts our input text into a format that the model can understand (tokenization), and the model is then used to generate a text response.

We also include a special configuration to use the end-of-sequence token as a padding token. This adjustment is necessary because GPT-2 does not have a default padding token, and padding is crucial for handling variable-length inputs.

The function `generate_text` is defined to encapsulate the text generation process. It takes a prompt as input, tokenizes it, generates a response using GPT-2, and then decodes this response back into human-readable text.

![LLM Visualisation](https://github.com/lisabecker/nlp-fundamentals/blob/main/graphics/llm.png?raw=true)

In [2]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Initialize GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Function to generate text using GPT-2
def generate_text(prompt, max_length=50):
    # Encode the input with attention mask
    encoding = tokenizer(prompt, return_tensors="pt", max_length=max_length)
    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']

    # Generate response using the model
    output_ids = model.generate(input_ids, attention_mask=attention_mask, max_length=max_length)

    # Decode and return the generated text
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Example usage
prompt = "Queen Elizabeth II died on"
response = generate_text(prompt)
print(f"\nGPT2 Response: '{response}...'")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



GPT2 Response: 'Queen Elizabeth II died on the 15th of July, 1714, at the age of sixty-five.

The first of the three daughters of the Queen Elizabeth II, Elizabeth, was born in 1714, and was the daughter of the...'


## 3. Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) combines the power of language models with external knowledge retrieval to generate responses that are not only contextually relevant but also factually accurate. In this section, we will create a simple RAG setup.

![RAG Visualisation](https://github.com/lisabecker/nlp-fundamentals/blob/main/graphics/rag.png?raw=true)

### Dataset Definition

We start by defining a simple Q&A dataset. This dataset is a collection of question-answer pairs where each question has a corresponding factual answer. The dataset is represented as a dictionary where questions are keys and answers are values.

This dataset serves as our knowledge base, from which the model will retrieve information to augment its responses.


In [4]:
# Define your dataset
qa_dataset_dict = {
    "When did Queen Elizabeth II. die?": "Queen Elizabeth II. died on September 8th 2022.",
    "Who created the 'Fundamentals of Natural Language Processing' course for O'Reilly?": "Lisa Becker.",
    "Who won the 2022 FIFA World Cup?": "Argentina.",
    # Add more Q&A pairs as needed!
}

# Extracting questions and answers
questions = list(qa_dataset_dict.keys())
answers = list(qa_dataset_dict.values())
print(f"Questions: {questions}")
print(f"Answers: {answers}")

Questions: ['When did Queen Elizabeth II. die?', "Who created the 'Fundamentals of Natural Language Processing' course for O'Reilly?", 'Who won the 2022 FIFA World Cup?']
Answers: ['Queen Elizabeth II. died on September 8th 2022.', 'Lisa Becker.', 'Argentina.']


### 3.1 Embedding Model

To enable our RAG setup to retrieve relevant information from our Q&A dataset, we need to convert our questions into a format that allows for similarity comparison. This process is known as embedding.

We use the `SentenceTransformer` model to generate embeddings for each question in our dataset. These embeddings are high-dimensional vector representations that capture the semantic meaning of the questions.

Once we have these embeddings, we can compare any new incoming question's embedding with them to find the most semantically similar existing question.

In [5]:
from sentence_transformers import SentenceTransformer

# Initialize a sentence transformer model for embeddings
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Extract questions from the dataset
questions = list(qa_dataset_dict.keys())

# Embedding questions
question_embeddings = embedder.encode(questions)

In [6]:
print(f"Question: {list(qa_dataset_dict.keys())[0]}")
print(f"Embedding dimentions: {len(question_embeddings[0])}")
print(f"Embeddings:\n {question_embeddings[0]}")

Question: When did Queen Elizabeth II. die?
Embedding dimentions: 384
Embeddings:
 [ 5.65683655e-02 -1.36119397e-02  9.75994393e-03  6.66206256e-02
  1.73170771e-02  7.42301857e-03  1.11427810e-03 -1.52328601e-02
 -5.11999205e-02  4.05504145e-02 -7.08348826e-02 -9.96009167e-03
 -2.76151896e-02 -6.15489483e-02 -7.34817935e-03  3.64168398e-02
 -1.82406902e-02 -1.01685338e-02 -6.06067143e-02  4.16703038e-02
  3.53916809e-02  1.46008953e-02  4.23478410e-02  6.47900179e-02
 -1.80922579e-02 -4.05573919e-02 -9.04035941e-02  5.66098392e-02
  2.46191192e-02 -4.26385589e-02 -5.78839742e-02 -6.97087497e-02
 -6.07863590e-02  2.82646026e-02  1.19467815e-02 -9.81372688e-03
  4.25702818e-02  4.75785695e-02 -1.67285465e-02 -4.29072306e-02
 -6.23256750e-02 -2.55048871e-02  7.53114000e-03  3.82316783e-02
  4.61017787e-02  7.28343846e-03 -3.10446993e-02  6.07980192e-02
  2.05025449e-02 -9.08780005e-03  3.26228957e-03  3.46131041e-03
  2.20972463e-03 -7.34083578e-02  2.35510189e-02 -1.40735703e-02
 -7.024

### 3.2 Cosine Similarity

With our questions embedded, the next step is to embed an incoming question and compare it to our dataset. We use cosine similarity for this comparison.

Cosine similarity measures the cosine of the angle between two vectors, in our case, the embedding of the incoming question and each question in our dataset. This metric helps us determine which question in our dataset is most similar to the incoming one.

We print the cosine similarity scores for illustration, showing how similar each dataset question is to the incoming question.



![Cosine Similarity Visualisation](https://github.com/lisabecker/nlp-fundamentals/blob/main/graphics/cosine_similarity.png?raw=true)

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Embed an incoming question
incoming_question = "Queen Elizabeth II died on"
incoming_question_embedding = embedder.encode([incoming_question])
print(f"{incoming_question}\n")

# Compute cosine similarity
cos_similarities = cosine_similarity(incoming_question_embedding, question_embeddings)
for index, q in enumerate(questions):
  print(f"Question: {q}\nCosine Similarity: {cos_similarities[0][index]}\n")

Queen Elizabeth II died on

Question: When did Queen Elizabeth II. die?
Cosine Similarity: 0.8671935796737671

Question: Who created the 'Fundamentals of Natural Language Processing' course for O'Reilly?
Cosine Similarity: -0.068804070353508

Question: Who won the 2022 FIFA World Cup?
Cosine Similarity: 0.14708779752254486



### Finding the Most Relevant Q&A Pair

After calculating the cosine similarities, we identify the question in our dataset that has the highest similarity score with the incoming question. This is our most relevant question, and its corresponding answer is likely to contain pertinent information related to the incoming question.

We then extract this most relevant question and its answer from our dataset, as they will be used to generate an informed response.


In [8]:
# Find the index of the most relevant question
most_relevant_idx = np.argmax(cos_similarities)

# Extract the most relevant question and its answer from your dictionary
relevant_question = questions[most_relevant_idx]
relevant_answer = answers[most_relevant_idx]
print(f"Relevant question: {relevant_question}")
print(f"Relevant answer: {relevant_answer}")

Relevant question: When did Queen Elizabeth II. die?
Relevant answer: Queen Elizabeth II. died on September 8th 2022.


### 3.4 Generate Answer with Additional Knowledge

Finally, we combine the answer to the most relevant question from our dataset with the incoming question. This combined input is then fed into the GPT-2 model to generate a response.

By doing this, we leverage the retrieved factual information (answer from the dataset) and the contextual understanding of the LLM (GPT-2) to create a response that is both relevant and informed by external knowledge. This process exemplifies a simple yet effective RAG system.

In [9]:
# Combine the relevant answer and incoming question
combined_input = relevant_answer + " " + incoming_question

# Generate a response using the LLM
extended_response = generate_text(combined_input)
print(extended_response)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Queen Elizabeth II. died on September 8th 2022. Queen Elizabeth II died on September 8th 2022.

The Queen Elizabeth II. died on September 8th 2022. Queen Elizabeth II died on September 8th 2022.

The Queen Elizabeth


## 4. Additional Resources

- [Hugging Face Transformers](https://huggingface.co/docs/transformers/index)
- [Sentence-Transformers](https://www.sbert.net/)
- [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity)