# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 8: Retrieval-Augmented Generation</font>

# <font color="#003660">RAG Basics</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>

<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... will know the basics of Retrieval-Augmented Generation (RAG) is. <br>
        ... will know to implement a RAG-chain from scratch.
    </font>
</div>
</p>

The following content is heavily inspired by the following excellent sources:

* [HuggingFace (2024): NLP Course](https://huggingface.co/learn/nlp-course/)
* [Huggingface (2024): Open-Source AI Cookbook](https://huggingface.co/learn/cookbook/index)
* [Nguyen (2024): Code a simple RAG from scratch](https://huggingface.co/blog/ngxson/make-your-own-rag)

In [None]:
!pip install -U pymupdf4llm datasets transformers faiss-cpu accelerate langchain langchain-community langchain-huggingface

## Answering Questions using LLMs

In [1]:
import os
import re
from tqdm.notebook import tqdm
import pymupdf4llm

from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM, set_seed

DEVICE = "cuda"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    torch_dtype="auto",
    device_map=DEVICE,
)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    torch_dtype="auto",
    device_map=DEVICE,
)

In [None]:
def generate_response(messages):    
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

In [None]:
set_seed(0)
prompt = "Who plays Daenerys Targaryen?"
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
response = generate_response(messages)
print(response)

WOW! How wrong this answer is! (The actress is Emilia Clarke.)

To go on we need to restart the runtime to free the GPU memory.

In [None]:
os.kill(os.getpid(), 9)

## Retrieval-Augmented Generation
![](https://github.com/olivermueller/amlta-2024/blob/main/Session_08/imgs/RAG.png?raw=true)

(Image adapted from [Kaltenpoth and Müller (2024)](https://energy.acm.org/eir/dont-touch-the-power-line-a-proof-of-concept-for-aligned-llm-based-assistance-systems-to-support-the-maintenance-in-the-electricity-distribution-system/))

Retrieval-augmented generation (RAG), introduced by [Lewis et al. (2020)](https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html) incorporates external knowledge in form of a vector database into the language model answers. A retriever (mostly an encoder-only transformer lm) retrieves k documents most similar to the query. Those documents provide the context for an LLM to answer the user question.

Now let's check if it improves the answer.

# Implementing a basic RAG system from scratch

In [2]:
import os
import re
from tqdm.notebook import tqdm
import pymupdf4llm
import urllib
from IPython.display import display, Markdown

from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM, set_seed

DEVICE = "cuda"

In [None]:
os.mkdir("documents")
os.mkdir("imgs")
os.mkdir("markdown_documents")
urllib.request.urlretrieve("https://raw.githubusercontent.com/olivermueller/amlta-2024/refs/heads/main/Session_08/documents/Game_of_Thrones.pdf", "documents/Game_of_Thrones.pdf")
urllib.request.urlretrieve("https://raw.githubusercontent.com/olivermueller/amlta-2024/refs/heads/main/Session_08/documents/How_I_Met_Your_Mother.pdf", "documents/How_I_Met_Your_Mother.pdf")
urllib.request.urlretrieve("https://raw.githubusercontent.com/olivermueller/amlta-2024/refs/heads/main/Session_08/markdown_documents/Game_of_Thrones.md", "markdown_documents/Game_of_Thrones.md")
urllib.request.urlretrieve("https://raw.githubusercontent.com/olivermueller/amlta-2024/refs/heads/main/Session_08/markdown_documents/How_I_Met_Your_Mother.md", "markdown_documents/How_I_Met_Your_Mother.md")

# Processing PDFs

We will provide two PDF files: A Wikipedia entry of the TV series Game of Thrones and another entry of the How I met your Mother TV series to show that our retrieval works and it is not just luck.
As PDFs are not directly readable for LLMs, we need to convert them in a readable format such as markdown. For that we can use [PyMuPDF4LLM](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) a library sxplicitly designed to convert PDFs to LLM-readable markdown.

In [4]:
documents_path = "documents"
markdown_documents_path = "markdown_documents"

In [None]:
documents = os.listdir(documents_path)

for document in documents:
    document_path = os.path.join(documents_path, document)
    md_file = pymupdf4llm.to_markdown(
        document_path
    )
    md_file_path = os.path.join(markdown_documents_path, document.replace(".pdf", ".md"))
    with open(md_file_path, "w", encoding="utf-8") as file:
        file.write(md_file)

## Designing a Vector Storage (Retrieval Database)

Now we need to design the vector storage based on our documents.

![](https://github.com/olivermueller/amlta-2024/blob/main/Session_08/imgs/vectordb.png?raw=true)

(Adapted from [Xie et al. (2023)](https://doi.org/10.1109/BigDIA60676.2023.10429609))

A vector database for retrieval is splitted into indexing and querying. While indexing is done using an encoder once, the querying is done via similarity search, e.g., cosine similarity ([Xie et al., 2023](https://doi.org/10.1109/BigDIA60676.2023.10429609)).

### Loading Documents

In [None]:
markdown_documents = os.listdir(markdown_documents_path)

md_files = []

for markdown_document in markdown_documents:
    markdown_document_path = os.path.join(markdown_documents_path, markdown_document)
    with open(markdown_document_path) as file:
        md_files.append([markdown_document, file.read()])

In [None]:
display(Markdown(md_files[0][1][:1000]))

In [None]:
print(md_files[0][1][:1000])

As we can see in the displayed markdown above, there are many cross-references within the wikipedia articles that will disturb the LLM. Therefore, we will remove the markdown links (with little help of ChatGPT).

In [None]:
def remove_markdown_links(text):
    """
    Removes Markdown links from the given text while keeping the link text.
    
    Args:
        text (str): The input Markdown text.
        
    Returns:
        str: The text with Markdown links removed. 
    
    Yeah this was ChatGPT ;)
    """
    # Regex to match Markdown links [text](link)
    pattern = r'\[([^\]]+)\]\([^\)]+\)'
    # Replace the matched pattern with just the text inside the brackets
    cleaned_text = re.sub(pattern, r'\1', text)
    return cleaned_text

In [None]:
display(Markdown(remove_markdown_links(md_files[0][1])[:1000]))

In [None]:
print(remove_markdown_links(md_files[0][1])[:1000])

Now the text looks more readable.

In [None]:
markdown_documents = os.listdir(markdown_documents_path)

md_files = []

for markdown_document in markdown_documents:
    markdown_document_path = os.path.join(markdown_documents_path, markdown_document)
    with open(markdown_document_path) as file:
        md_files.append([markdown_document, remove_markdown_links(file.read())])

### Chunking Texts

As usually neither encoder nor decoder models can hold complete documents in their contexts, the documents are usually chunked.

We will use some chunking according to token length. It is also common to use some overlap between the chunks to represent how the contained information chain semantically ([Wang et al., 2024](https://doi.org/10.18653/v1/2024.emnlp-main.981)).

In [None]:
embedding_tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v2-base-en", model_max_length=999999999)
print("File name:", md_files[0][0], "Tokens:", len(tokenizer(md_files[0][1], truncation=False)["input_ids"]))

In [8]:
OVERLAP = 32
CHUNK_LENGTH = 512

md_files_chunked = []
for md_file in md_files:
    md_file_tokenized = embedding_tokenizer(md_file[1], truncation=False)["input_ids"]
    for i in range(CHUNK_LENGTH, len(md_file_tokenized), CHUNK_LENGTH-OVERLAP):
        md_files_chunked.append([md_file[0], embedding_tokenizer.decode(md_file_tokenized[i-CHUNK_LENGTH: i], skip_special_tokens=True)])

In [None]:
print(md_files_chunked[0])
print(md_files_chunked[1])

### Loading Embedding Models

In [None]:
embedding_tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v2-base-en")
embedding_model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
embedding_model.to(DEVICE)

In [11]:
embeddings = embedding_model.encode([md_files_chunked[0]])

In [None]:
print(list(embeddings[0][:10]) + ["..."], len(embeddings[0]))

### Creating the final vector storage (Retrieval Database)

We will apply the easiest (and slowest) way of storing the vectors in a list.

In [None]:
VECTOR_DB = []

for md_file in tqdm(md_files_chunked):
    VECTOR_DB.append({"embeddings": embedding_model.encode([md_file[1]])[0], "content": md_file[1], "metadata":{"source": md_file[0]}})

In [14]:
def cosine_similarity(a, b):
  dot_product = sum([x * y for x, y in zip(a, b)])
  norm_a = sum([x ** 2 for x in a]) ** 0.5
  norm_b = sum([x ** 2 for x in b]) ** 0.5
  return dot_product / (norm_a * norm_b)

Your TODO:

In [17]:
def retrieve(query, top_n=3, embedding_model=embedding_model):
  query_embedding = embedding_model.encode([query])[0]
  # temporary list to store (chunk, similarity) pairs
  similarities = []
  # TODO: calculate cosine similarity between query and each chunk in the VECTOR_DB
  # Hint: each VECTOR_DB entry is a dictionary with keys "embeddings" (embeddings) and "content" (text chunk)



  #write your code here
  
  
  
  # sort by similarity in descending order, because higher similarity means more relevant chunks
  similarities.sort(key=lambda x: x[1], reverse=True)
  # finally, return the top N most relevant chunks
  return similarities[:top_n]

In [None]:
retrieved_docs = retrieve("Who plays Daenerys Targaryen?")

for doc in retrieved_docs:
    print(doc)

## Determining the Generator

In [19]:
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    torch_dtype="auto",
    device_map=DEVICE,
)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct",
    torch_dtype="auto",
    device_map=DEVICE,
)

## Defining a Generation Prompt

In [None]:
prompt = '''Use only the following context chunks to answer the question: {input_query}
Don't make up any new information. Here are the chunks:
{chunks}
Now anwser the question: {input_query}'''

# Generating based on Documents

So now let's bring everythin together. We can reuse the ``generate_response`` method for querying an LLM.

In [None]:
def generate_response(messages):    
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=512
    )
    generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response

In [None]:
input_query = "Who plays Daenerys Targaryen?"
retrieved_knowledge = retrieve(input_query)

print('Retrieved knowledge:')
for chunk, similarity in retrieved_knowledge:
    print(f' - (similarity: {similarity:.2f}) {chunk}')

In [21]:
chunks = '\n'.join([f' - {chunk}' for chunk, similarity in retrieved_knowledge])
question = prompt.format(input_query=input_query, chunks=chunks, input_query=input_query)

In [None]:
set_seed(0) # for reproducibility
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": question}
]
response = generate_response(messages)

# Creating a RAG Chain

Your TODO:

In [23]:
def generate_rag_response(input_query):
    # TODO: retrieve relevant chunks from the VECTOR_DB
    # TODO: create a prompt using the retrieved chunks and the input query
    # TODO: generate a response using the RAG model
    # TODO: return the generated response


    # write your code here



    pass

# Finally a RAG Chain

In [None]:
set_seed(0)
input_query = input('Ask me a question: ')
print(generate_rag_response(input_query))

<a href="https://imgflip.com/i/9fsgxc"><img src="https://i.imgflip.com/9fsgxc.jpg" title="made at imgflip.com"/></a><div><a href="https://imgflip.com/memegenerator">from Imgflip Meme Generator</a></div>