# Understanding Retrieval Question Answering


### Setup


In [1]:
%pip install -Uqqq rich tiktoken wandb langchain unstructured tabulate pdf2image chromadb

Note: you may need to restart the kernel to use updated packages.


In [None]:
# to use huggingface models:
%pip install --upgrade transformers

In [None]:
%pip install -U "huggingface_hub[cli]" 
%huggingface-cli login

In [2]:
import os, random
from pathlib import Path
import tiktoken
from getpass import getpass
from rich.markdown import Markdown

In [None]:
# Check if the Hugging Face API token is set
if os.getenv("HUGGINGFACE_API_TOKEN") is None:
    if any(['VSCODE' in x for x in os.environ.keys()]):
        print('Please enter password in the VS Code prompt at the top of your VS Code window!')
    os.environ["HUGGINGFACE_API_TOKEN"] = getpass("Paste your Hugging Face API token from: https://huggingface.co/settings/tokens\n")

assert os.getenv("HUGGINGFACE_API_TOKEN", "").startswith("hf_"), "This doesn't look like a valid Hugging Face API token"
print("Hugging Face API token is configured")

Please enter password in the VS Code prompt at the top of your VS Code window!


## Langchain


LangChain is a framework for developing applications powered by language models. We will use some of its features in the code below. Let's start by configuring W&B tracing.

In [3]:
# we need a single line of code to start tracing langchain with W&B
os.environ["LANGCHAIN_WANDB_TRACING"] = "true"

# wandb documentation to configure wandb using env variables
# https://docs.wandb.ai/guides/track/advanced/environment-variables
# here we are configuring the wandb project name
os.environ["WANDB_PROJECT"] = "llmapps"

## Parsing documents


We will use a small sample of markdown documents in this notebook. Let's find them and make sure we can stuff them into the prompt. That means they may need to be chunked and not exceed some number of tokens.

In [None]:
%pip install -U langchain-community

In [4]:
import torch
import transformers

from transformers import set_seed
from langchain_community.document_loaders import DirectoryLoader, TextLoader

# Set random seed for reproducibility
set_seed(17)

# Initialize the tokenizer and model
# model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")


Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.38it/s]
Some parameters are on the meta device because they were offloaded to the cpu and disk.


https://python.langchain.com/v0.2/docs/how_to/document_loader_directory/

In [5]:
import time

# Function to find all markdown files in a directory and return a LangChain Document
def find_md_files(directory):
    "Find all markdown files in a directory and return a LangChain Document"
    start_time = time.time()
    # dl = DirectoryLoader(directory, "**/*.md")
    loader = DirectoryLoader(directory,  glob="**/*.md", loader_cls=TextLoader, show_progress=True)
    documents = loader.load()
    end_time = time.time()
    print(f"Time taken to load documents: {end_time - start_time:.2f} seconds")
    return documents

# Load documents from the specified directory
documents = find_md_files(directory="docs_sample/")
print(f"Number of documents loaded: {len(documents)}")

100%|██████████| 11/11 [00:00<00:00, 5508.94it/s]

Time taken to load documents: 0.01 seconds
Number of documents loaded: 11





In [6]:
# Function to count tokens in each document
def count_tokens(documents):
    token_counts = [len(tokenizer.encode(document.page_content)) for document in documents]
    return token_counts

# Count tokens in the documents
token_counts = count_tokens(documents)
print(f"Token counts: {token_counts}")

Token counts: [366, 2597, 2939, 4180, 802, 1206, 538, 957, 2092, 2526, 1645]


We will use LangChain built in MarkdownTextSplitter to split the documents into sections. Actually splitting Markdown without breaking syntax is not that easy. This splitter strips out syntax.

- We can pass the chunk_size param and avoid lenghty chunks.
- The chunk_overlap param is useful so you don't cut sentences randomly. This is less necessary with Markdown
 
The MarkdownTextSplitter also takes care of removing double line breaks and save us some tokens that way.

In [7]:
from langchain.text_splitter import MarkdownTextSplitter

md_text_splitter = MarkdownTextSplitter(chunk_size=1000)
document_sections = md_text_splitter.split_documents(documents)
# len(document_sections), max(count_tokens(document_sections))
print(f"Number of document sections: {len(document_sections)}")
print(f"Max tokens in a section: {max(count_tokens(document_sections))}")

Number of document sections: 124
Max tokens in a section: 384


Take a look at the first section:



In [8]:
Markdown(document_sections[0].page_content)

## Embeddings


Now we will use embeddings with a vector database retriever to find relevant documents for a query.



In [None]:
%pip install -U langchain_huggingface

In [None]:
%pip install sentence-transformers

In [None]:
from langchain.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings


# Custom embedding function class
# class CustomEmbeddingFunction:
#     def embed_documents(self, texts):
#         return [self.generate_embeddings(text) for text in texts]

#     def embed_query(self, query):
#         return self.generate_embeddings(query)

#     def generate_embeddings(self, text):
#         inputs = tokenizer(text, return_tensors="pt").to("cuda")
#         with torch.no_grad():
#             outputs = model(**inputs, output_hidden_states=True)
#         # Use the last hidden state as the embedding and convert to float32
#         embeddings = outputs.hidden_states[-1].mean(dim=1).squeeze().to(torch.float32).cpu().numpy()
#         return embeddings

# # Create an instance of the custom embedding function
# embedding_function = CustomEmbeddingFunction()

# Generate embeddings for each document section Using HuggingFaceEmbeddings from langchain_huggingface
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}  # Use 'cuda' for GPU or 'cpu' for CPU
encode_kwargs = {"normalize_embeddings": False}
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs)



# Store the document sections in a Chroma vector store
db = Chroma.from_documents(document_sections, embeddings)

We can create a retriever from the db now, we can pass the k param to get the most relevant sections from the similarity search



In [10]:
retriever = db.as_retriever(search_kwargs=dict(k=3))


In [None]:
query = "How can I share my W&B report with my team members in a public W&B project?"
docs = retriever.invoke(query)

In [12]:
# Let's see the results
for doc in docs:
    print(doc.metadata["source"])

docs_sample\collaborate-on-reports.md
docs_sample\teams.md
docs_sample\collaborate-on-reports.md


In [None]:
# Debugging: Print the embeddings for all document sections
for i, doc in enumerate(document_sections):
    doc_embedding = embeddings.embed_documents([doc.page_content])[0]
    print(f"Document {i} Embedding:", doc_embedding)
    print(f"Document {i} Source:", doc.metadata["source"])

## Stuff Prompt


We'll now take the content of the retrieved documents, stuff them into prompt template along with the query, and pass into an LLM to obtain the answer.



In [13]:
from langchain.prompts import PromptTemplate

PromptTemplate:

Prompt template for a language model.

A prompt template consists of a string template. It accepts a set of parameters from the user that can be used to generate a prompt for a language model.

The template can be formatted using either f-strings (default) or jinja2 syntax.

Example:

    .. code-block:: python

        from langchain_core.prompts import PromptTemplate

        # Instantiation using from_template (recommended)
        prompt = PromptTemplate.from_template("Say {foo}")
        prompt.format(foo="bar")

        # Instantiation using initializer
        prompt = PromptTemplate(template="Say {foo}")

In [14]:
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Helpful Answer:"""
PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

context = "\n\n".join([doc.page_content for doc in docs])
prompt = PROMPT.format(context=context, question=query)

Use langchain to our huggingface model chat API with the question



In [None]:
# Generate the response using the Hugging Face model
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# inputs = tokenizer(prompt, return_tensors="pt").to("cpu")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=150)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Markdown(response)


In [20]:
# Post-process the response to remove the prompt and question
helpful_response = response.split("Helpful Answer:")[1].strip()

# Display the response
display(Markdown(helpful_response))

## Using Langchain

With Langchain tools, we can efficiently do this in a few lines of code

In [28]:
model_id = "google/gemma-2-2b"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda")


Loading checkpoint shards: 100%|██████████| 3/3 [01:17<00:00, 25.84s/it]


In [29]:
# Print device map for debugging
print(f"Device map: {model.hf_device_map}")

Device map: {'': device(type='cuda')}


In [30]:
from langchain.chains import RetrievalQA
from langchain_huggingface import HuggingFaceEndpoint, HuggingFacePipeline
from transformers import pipeline
from tqdm import tqdm


# Initialize the HuggingFace pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=70)  

# Initialize the HuggingFace LLM
# model_kwargs = {"tokenizer": tokenizer}
# llm = HuggingFaceEndpoint(model=model, model_kwargs=model_kwargs)
llm = HuggingFacePipeline(pipeline=pipe)


# Create a RetrievalQA chain
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
# result = qa.run(query)
# Add a progress bar
with tqdm(total=1, desc="Running RetrievalQA") as pbar:
    result = qa.run(query)
    pbar.update(1)

# Display the result
display(Markdown(result))

Running RetrievalQA:   0%|          | 0/1 [00:00<?, ?it/s]The 'max_batch_size' argument of HybridCache is deprecated and will be removed in v4.46. Use the more precisely named 'batch_size' argument instead.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Running RetrievalQA: 100%|██████████| 1/1 [00:52<00:00, 52.64s/it]


In [31]:
# Post-process the response to remove the prompt and question
helpful_result = result.split("Helpful Answer:")[1].strip()

# Display the response
display(Markdown(helpful_result))

In [32]:
import wandb
wandb.finish()

In [None]:
%pip list --format=freeze > requirements.txt