<a href="https://colab.research.google.com/github/lrdplopes/llm-notebooks/blob/main/notebooks/rag_langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced RAG on HuggingFace documentation using LangChain 🤗

Hands-on on how we can build an advanced RAG (Retrieval Augmented Generation) for answering a user's question about a specific knowledge base using LangChain.

<img src="https://huggingface.co/datasets/huggingface/cookbook-images/resolve/main/RAG_workflow.png" height="700">

In [None]:
# We need to install the required model dependencies.
!pip install -q torch transformers transformers accelerate bitsandbytes langchain sentence-transformers faiss-gpu openpyxl pacmap

In [None]:
# We will install the `dataset` library which includes the Hugging face dataset.
!pip install -q datasets

In [None]:
from tqdm.notebook import tqdm
import pandas as pd
from typing import Optional, List, Tuple
from datasets import Dataset
import matplotlib.pyplot as plt

pd.set_option(
    "display.max_colwidth", None
)  # this will be helpful when visualizing retriever outputs

The `set_option `function is utilized herein to modify the `display.max_colwidth` configuration, which governs the maximum width of columns displayed when printing a DataFrame or Series to the console or Jupyter Notebook. Column width is measured in terms of character count.

## Loading the Knowledge-Base

In [None]:
# Export your HF_TOKEN

from google.colab import userdata
userdata.get('<your_hf_token>')

In [None]:
import datasets
ds = datasets.load_dataset("m-ric/huggingface_doc", split="train")

The `load_dataset()` function accepts the argument `split="train"` to indicate which part of the dataset should be loaded. Data sets used for machine learning are often split into multiple parts for model training, validation, and testing purposes.

In [None]:
from langchain.docstore.document import Document as LangchainDocument

RAW_KNOWLEDGE_BASE = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
    for doc in tqdm(ds)
]

- `RAW_KNOWLEDGE_BASE`: A list to store the documents in the KB.
- `ds`: The loaded dataset of structured documents.
- `tqdm`: A Python library that provides a progress bar for iterating over large datasets.
- `LangchainDocument`: A class from the langchain library that represents a document in the KB.
- `doc["text"]`: The textual content of a document.
- `doc["source"]`: The metadata about the source of a document.

# Retriever Embeddings 📖

1. The Retriever's Function
- **Analogy**: Imagine the Retriever as a search engine within a library. Instead of a large website, your "library" is the pre-structured knowledge base. Instead of a general search, the user is asking a specific question.
- **Implementation**: The implementation typically involves transforming both questions and knowledge base documents into embeddings (vector representations) so that they can be efficiently compared.
- **Goal**: The primary goal of the Retriever is to find the most relevant snippets in your knowledge base that can help answer the user's question.
How is it done? Through a technique called "Retrieval Embeddings".

2. What are "Retrieval Embeddings"
- **Embeddings**: These are numerical representations of text. Think of them as "digital signatures" that capture the meaning and semantic essence of a sentence, paragraph, or even entire documents. Semantically similar texts are mapped to nearby points in vector space. This allows for efficient similarity-based searches.
- How they are used
  - **The Retriever**: Creates embeddings for its knowledge base. Generates an embedding from the user's question. Compares the question embedding with the knowledge base embeddings, finding the most similar ones (those closest in terms of meaning).

3. Fine-Tuning the Search
- `top_k`: This parameter determines how many "snippets" will be retrieved. In a learning context, we usually start with a smaller top_k value and adjust as needed.
- `chunk size`: Refers to the length of each snippet. It's common to use varying sizes since some relevant sections may be shorter or longer.

**Finding the Balance:**
We need to balance a few things.

- **Balancing**: Finding the right balance between top_k and chunk size is crucial. You want to ensure that the most informative snippets are retrieved without overwhelming the Reader Model with irrelevant information.
- **Quality of Embeddings**: The quality of embeddings has a significant impact on the effectiveness of the retriever. More advanced embedding models can capture semantic nuances better, resulting in more accurate retrievals.

4. Why the LangChain Library
- **Flexibility**: LangChain makes it easier to work with different types of "databases" that store embeddings (the numerical representations of text).
- **Metadata Preservation**: LangChain allows tracking relevant information about our snippets (like their original source), which can be very useful for our system.

## Splitting the Documents in `Chunks`

### Let's explore the Recursive Chunking

- Recursive Chunking: This is a technique that divides text into "layers." It uses a list of "separators" (such as paragraph breaks `\n\n` line breaks `\n` or sentence-ending periods `.`) to split texts hierarchically.
- Adaptability: If one separator does not yield fragments of the ideal size, it applies the next separator to the obtained fragments, and so on.
- Benefit: It helps preserve the general structure of the text, while still allowing some variation in the size of the chunks.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# We use a hierarchical list of separators specifically tailored for splitting Markdown documents
# This list is taken from LangChain's MarkdownTextSplitter class.
MARKDOWN_SEPARATORS = [
    "\n#{1,6} ",
    "```\n",
    "\n\\*\\*\\*+\n",
    "\n---+\n",
    "\n___+\n",
    "\n\n",
    "\n",
    " ",
    "",
]
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # the maximum number of characters in a chunk: we selected this value arbitrarily
    chunk_overlap=100,  # the number of characters to overlap between chunks
    add_start_index=True,  # If `True`, includes chunk's start index in metadata
    strip_whitespace=True,  # If `True`, strips whitespace from the start and end of every document
    separators=MARKDOWN_SEPARATORS,
)
docs_processed = []
for doc in RAW_KNOWLEDGE_BASE:
    docs_processed += text_splitter.split_documents([doc])

> `max_seq_length`: Refers to the maximum number of tokens that the embedding model can process in a single sequence.



In [None]:
from sentence_transformers import SentenceTransformer

# To get the value of the max sequence_length, we will query the underlying `SentenceTransformer` object used in the RecursiveCharacterTextSplitter.
print(
    f"Model's maximum sequence length: {SentenceTransformer('thenlper/gte-small').max_seq_length}"
)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
lengths = [len(tokenizer.encode(doc.page_content)) for doc in tqdm(docs_processed)]

# Plot the distrubution of document lengths, counted as the number of tokens
fig = pd.Series(lengths).hist()
plt.title("Distribution of document lengths in the knowledge base (in count of tokens)")
plt.show()

The histogram analysis reveals that not all generated chunks adhere to the model's processing limit of 512 tokens. Consequently, some documents exceed this threshold, resulting in partial data loss due to model truncation.

We may change the `RecursiveCharacterTextSplitter` class to count length in a number of tokens instead of a number of characters.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

EMBEDDING_MODEL_NAME = "thenlper/gte-small"


def split_documents(
    chunk_size: int,
    knowledge_base: List[LangchainDocument],
    tokenizer_name: Optional[str] = EMBEDDING_MODEL_NAME,
) -> List[LangchainDocument]:
    """
    Split documents into chunks of maximum size `chunk_size` tokens and return a list of documents.
    """
    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        AutoTokenizer.from_pretrained(tokenizer_name),
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
        separators=MARKDOWN_SEPARATORS,
    )

    docs_processed = []
    for doc in knowledge_base:
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicates
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique


docs_processed = split_documents(
    512,  # We choose a chunk size adapted to our model
    RAW_KNOWLEDGE_BASE,
    tokenizer_name=EMBEDDING_MODEL_NAME,
)

# Let's visualize the chunk sizes we would have in tokens from a common model
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL_NAME)
lengths = [len(tokenizer.encode(doc.page_content)) for doc in tqdm(docs_processed)]
fig = pd.Series(lengths).hist()
plt.title("Distribution of document lengths in the knowledge base (in count of tokens)")
plt.show()

## Building the Vector Database

To build the vector database, it's necessary to compute the embeddings for each piece of knowledge, transforming the text into information that machine learning models can process. These embeddings are then stored in a vector database, which serves as a library for finding relevant information.

When a user makes a query, it is converted into an embedding and a `nearest-neighbor` search algorithm finds the most similar chunks. This process needs to be fast and efficient, for which the `FAISS` library is used.

The next step involves turning this concept into code, generating the embeddings, normalizing them, building the database with `FAISS` and writing the logic for the user's query.

In [None]:
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy

embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    multi_process=True,
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},  # set True for cosine similarity
)

KNOWLEDGE_VECTOR_DATABASE = FAISS.from_documents(
    docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE
)

- `PaCMAP`

In [None]:
# embed a user query in the same space
user_query = "How to create a pipeline object?"
query_vector = embedding_model.embed_query(user_query)

In [None]:
import pacmap
import numpy as np
import plotly.express as px

embedding_projector = pacmap.PaCMAP(
    n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=1
)

embeddings_2d = [
    list(KNOWLEDGE_VECTOR_DATABASE.index.reconstruct_n(idx, 1)[0])
    for idx in range(len(docs_processed))
] + [query_vector]

# fit the data (The index of transformed data corresponds to the index of the original data)
documents_projected = embedding_projector.fit_transform(
    np.array(embeddings_2d), init="pca"
)

In [None]:
df = pd.DataFrame.from_dict(
    [
        {
            "x": documents_projected[i, 0],
            "y": documents_projected[i, 1],
            "source": docs_processed[i].metadata["source"].split("/")[1],
            "extract": docs_processed[i].page_content[:100] + "...",
            "symbol": "circle",
            "size_col": 4,
        }
        for i in range(len(docs_processed))
    ]
    + [
        {
            "x": documents_projected[-1, 0],
            "y": documents_projected[-1, 1],
            "source": "User query",
            "extract": user_query,
            "size_col": 100,
            "symbol": "star",
        }
    ]
)

# visualize the embedding
fig = px.scatter(
    df,
    x="x",
    y="y",
    color="source",
    hover_data="extract",
    size="size_col",
    symbol="symbol",
    color_discrete_map={"User query": "black"},
    width=1000,
    height=700,
)
fig.update_traces(
    marker=dict(opacity=1, line=dict(width=0, color="DarkSlateGrey")),
    selector=dict(mode="markers"),
)
fig.update_layout(
    legend_title_text="<b>Chunk source</b>",
    title="<b>2D Projection of Chunk Embeddings via PaCMAP</b>",
)
fig.show()

In [None]:
print(f"\nStarting retrieval for {user_query=}...")
retrieved_docs = KNOWLEDGE_VECTOR_DATABASE.similarity_search(query=user_query, k=5)
print(
    "\n==================================Top document=================================="
)
print(retrieved_docs[0].page_content)
print("==================================Metadata==================================")
print(retrieved_docs[0].metadata)

# Reader LLM

## Reader Model

The importance of choosing an LLM with adequate natural language understanding capabilities is emphasized, focusing on its ability to process the entire context (including user queries and retrieved chunks) without truncating information.

Key considerations include the model's **maximum sequence length** (to ensure all context is processed), computational efficiency, and the balance between model size and capability to generate accurate responses.

The chosen model is `HuggingFaceH4/zephyr-7b-beta` for its long-sequence processing capability and cost-effectiveness. **Optimization** through quantization is discussed to reduce model size and increase inference speed, using the `BitsAndBytesConfig` in the Hugging Face library.

The text concludes with code snippets for configuring and loading the quantized model using Hugging Face's `pipeline`, `tokenizer`, and `AutoModelForCausalLM`, aiming to facilitate efficient text generation for the reader model.

Continuous monitoring of new models and performance benchmarking are recommended to ensure the chosen model remains optimal for the project's needs.

In [None]:
from transformers import pipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

READER_MODEL_NAME = "HuggingFaceH4/zephyr-7b-beta"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
    READER_MODEL_NAME, quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(READER_MODEL_NAME)

READER_LLM = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    do_sample=True,
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=False,
    max_new_tokens=500,
)

In [None]:
READER_LLM("What is 4+4? Answer:")

## Prompt

The RAG prompt template below is what we will feed to the Reader LLM.

In [None]:
prompt_in_chat_format = [
    {
        "role": "system",
        "content": """Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.""",
    },
    {
        "role": "user",
        "content": """Context:
{context}
---
Now here is the question you need to answer.

Question: {question}""",
    },
]
RAG_PROMPT_TEMPLATE = tokenizer.apply_chat_template(
    prompt_in_chat_format, tokenize=False, add_generation_prompt=True
)
print(RAG_PROMPT_TEMPLATE)

#### Let's test it.

In [None]:
retrieved_docs_text = [
    doc.page_content for doc in retrieved_docs
]  # we only need the text of the documents
context = "\nExtracted documents:\n"
context += "".join(
    [f"Document {str(i)}:::\n" + doc for i, doc in enumerate(retrieved_docs_text)]
)

final_prompt = RAG_PROMPT_TEMPLATE.format(
    question="How to create a pipeline object?", context=context
)

# Redact an answer
answer = READER_LLM(final_prompt)[0]["generated_text"]
print(answer)

## Reranking

The reranking is crucial for refining the results of an initial search, promoting the most relevant documents and improving the quality of the responses. The `Colbertv2`, a cross-encoder, is useful for reranking due to its detailed analysis of the relevance between the question and the document, resulting in a more accurate ranking. The `RAGatouille` library simplifies the integration of `Colbertv2`.

In [None]:
from ragatouille import RAGPretrainedModel
RERANKER = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

# Assembling

In [None]:
from transformers import Pipeline


def answer_with_rag(
    question: str,
    llm: Pipeline,
    knowledge_index: FAISS,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 30,
    num_docs_final: int = 5,
) -> Tuple[str, List[LangchainDocument]]:
    # Gather documents with retriever
    print("=> Retrieving documents...")
    relevant_docs = knowledge_index.similarity_search(
        query=question, k=num_retrieved_docs
    )
    relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text

    # Optionally rerank results
    if reranker:
        print("=> Reranking documents...")
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]

    relevant_docs = relevant_docs[:num_docs_final]

    # Build the final prompt
    context = "\nExtracted documents:\n"
    context += "".join(
        [f"Document {str(i)}:::\n" + doc for i, doc in enumerate(relevant_docs)]
    )

    final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)

    # Redact an answer
    print("=> Generating answer...")
    answer = llm(final_prompt)[0]["generated_text"]

    return answer, relevant_docs

In [None]:
question = "how to create a pipeline object?"

answer, relevant_docs = answer_with_rag(
    question, READER_LLM, KNOWLEDGE_VECTOR_DATABASE, reranker=RERANKER
)

In [None]:
print("==================================Answer==================================")
print(f"{answer}")
print("==================================Source docs==================================")
for i, doc in enumerate(relevant_docs):
    print(f"Document {i}------------------------------------------------------------")
    print(doc)