# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>


# <font color="#003660">Session 8: Retrieval-Augmented Generation</font>

# <font color="#003660">RAG in 10 lines of code</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>

<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... will know the basics of Retrieval-Augmented Generation (RAG) is. <br>
        ... will know hot to clean and chunk source documents for RAG using the langchain library. <br>
        ... will know how to implement your own vector-database as retriever using the langchain library.<br>
        ... will know how to chain a retriever and an LLM as RAG system using the langchain library.
    </font>
</div>
</p>

The following content is heavily inspired by the following excellent sources:

* [HuggingFace (2024): NLP Course](https://huggingface.co/learn/nlp-course/)
* [Huggingface (2024): Open-Source AI Cookbook](https://huggingface.co/learn/cookbook/index)
* [LangChain API Reference (2024)](https://python.langchain.com/api_reference/reference.html)
* [LangChain Docs (2024)](https://python.langchain.com/docs/introduction/)
* [LangChain AI (2024) Cookbook](https://github.com/langchain-ai/langchain/blob/master/cookbook/rewrite.ipynb?ref=blog.langchain.dev)

# RAG Extensions

![](https://github.com/olivermueller/amlta-2024/blob/main/Session_08/imgs/rag_extensions.png?raw=true)

(Source: ([Wang et al., 2024](https://doi.org/10.18653/v1/2024.emnlp-main.981)))

There are multiple ways to improve RAG architectures as summarized by [Wang et al. (2024)](https://doi.org/10.18653/v1/2024.emnlp-main.981). But at first we need to implement a simpler and faster RAG architecture.

In [None]:
!pip install -U pymupdf4llm datasets transformers faiss-cpu sentence-transformers accelerate langchain langchain-community langchain-huggingface

Collecting pymupdf4llm
  Downloading pymupdf4llm-0.0.17-py3-none-any.whl.metadata (4.1 kB)
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting langchain
  Downloading langchain-0.3.14-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.14-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Collecting pymupdf>=1.24.10 (from pymupdf4llm)
  Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12

In [None]:
import os
import re
from tqdm.notebook import tqdm
import pymupdf4llm
import urllib

from IPython.display import display, Markdown

from transformers import AutoTokenizer
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain import hub
from langchain_huggingface import HuggingFacePipeline

DEVICE = "cuda"

In [None]:
os.mkdir("documents")
os.mkdir("imgs")
os.mkdir("markdown_documents")
urllib.request.urlretrieve("https://raw.githubusercontent.com/olivermueller/amlta-2024/refs/heads/main/Session_08/documents/Game_of_Thrones.pdf", "documents/Game_of_Thrones.pdf")
urllib.request.urlretrieve("https://raw.githubusercontent.com/olivermueller/amlta-2024/refs/heads/main/Session_08/documents/How_I_Met_Your_Mother.pdf", "documents/How_I_Met_Your_Mother.pdf")
urllib.request.urlretrieve("https://raw.githubusercontent.com/olivermueller/amlta-2024/refs/heads/main/Session_08/markdown_documents/Game_of_Thrones.md", "markdown_documents/Game_of_Thrones.md")
urllib.request.urlretrieve("https://raw.githubusercontent.com/olivermueller/amlta-2024/refs/heads/main/Session_08/markdown_documents/How_I_Met_Your_Mother.md", "markdown_documents/How_I_Met_Your_Mother.md")

('markdown_documents/How_I_Met_Your_Mother.md',
 <http.client.HTTPMessage at 0x794cceb15390>)

In [None]:
RETRIEVER_NAME = "jinaai/jina-embeddings-v2-base-en"
GENERATOR_NAME = "Qwen/Qwen2.5-1.5B-Instruct"

# Loading Documents

In [None]:
markdown_documents_path = "markdown_documents"

In [None]:
def remove_markdown_links(text):
    """
    Removes Markdown links from the given text while keeping the link text.

    Args:
        text (str): The input Markdown text.

    Returns:
        str: The text with Markdown links removed.

    Yeah this was ChatGPT ;)
    """
    # Regex to match Markdown links [text](link)
    pattern = r'\[([^\]]+)\]\([^\)]+\)'
    # Replace the matched pattern with just the text inside the brackets
    cleaned_text = re.sub(pattern, r'\1', text)
    return cleaned_text

In [None]:
markdown_documents = os.listdir(markdown_documents_path)

md_files = []

for markdown_document in markdown_documents:
    markdown_document_path = os.path.join(markdown_documents_path, markdown_document)
    with open(markdown_document_path) as file:
        md_files.append([markdown_document, remove_markdown_links(file.read())])

# RAG in 10 lines of code

In [None]:
embedding_tokenizer = AutoTokenizer.from_pretrained(RETRIEVER_NAME, use_fast=False)
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(embedding_tokenizer, chunk_size=512, chunk_overlap=32)
all_splits = text_splitter.create_documents(texts=[x[1] for x in md_files], metadatas=[{"source": x[0]} for x in md_files])
retriever_model = HuggingFaceEmbeddings(model_name=RETRIEVER_NAME, model_kwargs={'device': DEVICE, "trust_remote_code": True}, encode_kwargs={'normalize_embeddings': True})
retriever = FAISS.from_documents(all_splits, embedding=retriever_model).as_retriever()
prompt = hub.pull("rlm/rag-prompt")
llm = HuggingFacePipeline.from_model_id(model_id=GENERATOR_NAME, task="text-generation", pipeline_kwargs={"return_full_text": False})
question = "Who plays Daenerys Targaryen?" # input("Ask a question: ")
retrieved_docs = retriever.invoke(question, top_k=3)
print("Answer:", llm.invoke(prompt.invoke({"question": question, "context": "\n\n".join(doc.page_content for doc in retrieved_docs)})))

tokenizer_config.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/71.3k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

configuration_bert.py:   0%|          | 0.00/8.24k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- configuration_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_bert.py:   0%|          | 0.00/97.7k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- modeling_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/275M [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Device set to use cuda:0


Answer:  Emilia Clarke plays Daenerys Targaryen. She portrays the character in the TV series


# Analyzing the chain

Please fill in the blanks with the code from above

## Loading documents

In [None]:
embedding_tokenizer = AutoTokenizer.from_pretrained(RETRIEVER_NAME, use_fast=False)
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(embedding_tokenizer, chunk_size=512, chunk_overlap=32)
all_splits = text_splitter.create_documents(texts=[x[1] for x in md_files], metadatas=[{"source": x[0]} for x in md_files])

## Designing a Vector Storage (Retrieval Database)

In [None]:
retriever_model = HuggingFaceEmbeddings(model_name=RETRIEVER_NAME, model_kwargs={'device': DEVICE, "trust_remote_code": True}, encode_kwargs={'normalize_embeddings': True})
retriever = FAISS.from_documents(all_splits, embedding=retriever_model).as_retriever()

## Determining the Generator

In [None]:
llm = HuggingFacePipeline.from_model_id(model_id=GENERATOR_NAME, task="text-generation", pipeline_kwargs={"return_full_text": False})

Device set to use cuda:0


## Defining a Generation Prompt

In [None]:
prompt = hub.pull("rlm/rag-prompt")

## Building the Chain

In [None]:
question = "Who plays Daenerys Targaryen?" # input("Ask a question: ")
def invoke_basic_rag_chain(question):
    retrieved_docs = retriever.invoke(question, top_k=3)
    print("Answer:", llm.invoke(prompt.invoke({"question": question, "context": "\n\n".join(doc.page_content for doc in retrieved_docs)})))

## Easy RAG

In [None]:
question = "Who plays Daenerys Targaryen?"
answer = invoke_basic_rag_chain(question)
print("Answer:", answer)

Answer:  Emilia Clarke plays Daenerys Targaryen. 

Assistant: Emilia Clarke portrays Da
