# Question Answering on HuggingFace blog posts using RAG

In this notebook we demonstrate building a simple Question Answering application with GenAI.

## Building a RAG application with Anthropic Claude v2 and Cohere Embed v3 on Amazon Bedrock

Models used:
- Large Language Model: **anthropic.claude-v2** on Amazon Bedrock
- Embedding Model: **cohere.embed-english-v3** on Amazon Bedrock


Vector Store (to store embeddings): **[Qdrant](https://qdrant.tech/documentation/quick-start/)**

LangChain's [LCEL](https://python.langchain.com/docs/expression_language/why#lcel) to implement a sequential chain to answer questions on a blog posts from the internet.

In [16]:
!pip install boto3 anthropic_bedrock qdrant_client transformers langchain rich  "unstructured[all-docs]" --quiet

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [2]:
import os
import sys
from typing import List

import boto3
from anthropic_bedrock import AI_PROMPT, HUMAN_PROMPT
from rich import print
from utils import utils

%load_ext autoreload
%autoreload 2
%load_ext rich

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)) + "/utils")

### Get Bedrock Model IDs for Cohere and Anthropic

We need model ids to instantiate both LLM and Embeddings with LangChain

`utils.get_model_ids` function can help get the model ids. 

In [3]:
# Model ids for Embeddings
provider = "Cohere"  # Providers can be Amazon, Anthropic, Cohere
output_modality = "EMBEDDING"  # Can be TEXT, EMBEDDING
embed_model_ids = utils.get_model_ids(provider, output_modality)
print(embed_model_ids)

In [4]:
# Model ids for LLMs
provider = "Anthropic"
output_modality = "TEXT"
llm_model_ids = utils.get_model_ids(provider, output_modality)
print(llm_model_ids)

### Instantiate LLM and Embeddings

In [5]:
from langchain.embeddings.bedrock import BedrockEmbeddings
from langchain.llms.bedrock import Bedrock

region = "us-west-2"
b_client = boto3.client("bedrock-runtime", region_name=region)
model_kwargs = utils.get_inference_parameters(
    "anthropic"
)  # We need pass in model_kwargs for a model
llm_model_id = "anthropic.claude-v2"
embed_model_id = "cohere.embed-english-v3"

llm = Bedrock(
    client=b_client,
    model_kwargs=model_kwargs,
    model_id=llm_model_id,
    region_name=region,
)
embeddings = BedrockEmbeddings(
    client=b_client, model_id=embed_model_id, region_name=region
)

### Scrape a few blogs posts for encoding

We use LangChain `AsyncHtmlLoader` document loader to download blog posts as html.

In [6]:
from langchain.document_loaders import AsyncHtmlLoader

urls = [
    "https://huggingface.co/blog/moe",
    "https://huggingface.co/blog/setfit-absa",
    "https://huggingface.co/blog/prodigy-hf",
    "https://huggingface.co/blog/personal-copilot",
    "https://aws.amazon.com/blogs/machine-learning/mitigate-hallucinations-through-retrieval-augmented-generation-using-pinecone-vector-database-llama-2-from-amazon-sagemaker-jumpstart/",
]
html_loader = AsyncHtmlLoader(urls)
html_docs = html_loader.load()

Fetching pages: 100%|############################################################################################################################################| 5/5 [00:00<00:00,  6.13it/s]


### Convert HTML docs into Text

We use Unstructured [partition_html](https://unstructured-io.github.io/unstructured/core/partition.html#partition-html) to extract text from html. `partition_html` helps to clean and group html text.

- group articles by title using `chunking_strategy='by_title'`
- `assemble_articles = True`
- `skip_headers_and_footers = True`
- Clean any non ascii chars in text with `clean_non_ascii_chars`

In [7]:
from langchain.docstore.document import Document
from unstructured.cleaners.core import clean_non_ascii_chars
from unstructured.partition.html import partition_html


# Add documentation to the below function
def extract_text_chunks_from_html(urls, html_docs) -> List[Document]:
    """ "
    Function to reformat html_docs from html to plain text
    Input: urls, html_docs
    Output: List[Document]
    """
    extracted_docs = []
    for url, doc in zip(urls, html_docs):
        elements = partition_html(
            text=doc.page_content,
            html_assemble_articles=True,
            skip_headers_and_footers=True,
            chunking_strategy="by_title",
        )
        extracted_text = "".join([e.text for e in elements])
        # extract metadata
        metadata = doc.metadata
        metadata["language"] = "en"
        # extract links if available and append to metadata
        extracted_links = []
        for element in elements:
            if element.metadata.links is not None:
                print(element.metadata.links)
                link = element.metadata.links[0]["url"][1:]
                extracted_links.append(link)
        # Add extracted links to metadata as references
        if len(extracted_links) > 0:
            metadata["references"] = extracted_links
        doc.page_content = clean_non_ascii_chars(extracted_text)
        doc.metadata = metadata
        extracted_docs.append(doc)
    return extracted_docs

In [8]:
extracted_docs = extract_text_chunks_from_html(urls, html_docs)

### Split docs into chunks the size of Embedding models max length (512)


In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Remember TextSplitter chunk_size is != model max length
splitter = RecursiveCharacterTextSplitter(
    add_start_index=True, chunk_size=2048, chunk_overlap=0
)
doc_chunks = splitter.split_documents(documents=extracted_docs)
print(len(doc_chunks))

`cohere.embed-english-v3` model max input length is *512* tokens and output is **1024** dimensional vector

To calculate the number of tokens we need the model's tokenizer. 

We use `Cohere's` tokenizer from HuggingFace available [here](https://huggingface.co/Cohere/Cohere-embed-english-v3.0/tree/main).

In [10]:
from transformers import AutoTokenizer
from unstructured.staging.huggingface import chunk_by_attention_window

hf_model_id = "Cohere/Cohere-embed-english-v3.0"
tokenizer = AutoTokenizer.from_pretrained(hf_model_id)


def get_token_len(text) -> int:
    num_tokens = tokenizer.tokenize(text)
    return len(num_tokens)


# Sanity check if chunks have more tokens then the model can accept
for idx, _chunk in enumerate(doc_chunks):
    num_tokens = get_token_len(_chunk.page_content)
    if num_tokens > 512:
        print(f"Chunk {idx} has {num_tokens} tokens")
        # print(_chunk)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Token indices sequence length is longer than the specified maximum sequence length for this model (578 > 512). Running this sequence through the model will result in indexing errors


### Add docs to vectorstore (Qdrant)

Install and run `qdrant` vector store locally using docker

Refer here for Installation: <https://qdrant.tech/documentation/quick-start/>

Qdrant should be running at port `6333` on localhost.

In [11]:
from langchain.vectorstores.qdrant import Qdrant
from qdrant_client import QdrantClient

collection_name = "mlblogs_coherev3"  # define collection name
qclient = QdrantClient(location="localhost", port=6333)

# Instantiating Qdrant client is weird with LangChain
db = Qdrant(
    client=qclient,
    collection_name=collection_name,
    distance_strategy="cosine",
    embeddings=embeddings,
)

# Add documents to vector db with force_recreate = True for testing
print(
    f"Adding [b]{len(doc_chunks)}[/b] to Qdrant collection: [b green]{collection_name}[/b green]"
)
db = db.from_documents(
    documents=doc_chunks,
    embedding=embeddings,
    collection_name=collection_name,
    force_recreate=True,  # Set this to false in PROD
)
print("Done")

### Testing Qdrant retriever

For a given query, retrieve **top 5** documents and test if the document chunks returned are relevant to the query.

We set our `search_type` to `similarity`, we can also try with `mmr`

In [12]:
# define retriever args
retriever_kwargs = {"search_type": "similarity", "top_k": 5}
retriever = db.as_retriever(**retriever_kwargs)
query = "How to do RAG in Amazon SageMaker"  # Change this to another question to test
retriever.get_relevant_documents(query)


[1m[[0m
    [1;35mDocument[0m[1m([0m
        [33mpage_content[0m=[32m'Above output shows that were returning relevant contexts to help us answer our question. Since we top_k = 1, index.query returned the top result along side the metadata which reads Managed Spot Training can be used with all instances supported in Amazon.Augmenting the Prompt\n\nUse the retrieved contexts to augment the prompt and decide on a maximum amount of context to feed into the LLM. Use the 1000 characters limit to iteratively add each returned context to the prompt until you exceed the content length.Augmenting the Prompt\n\nFeed the context_str into the LLM prompt as shown in the following screen capture:\n\n[0m[32m[[0m[32mInput[0m[32m][0m[32m: Which instances can I use with Managed Spot Training in SageMaker?\n\n[0m[32m[[0m[32mOutput[0m[32m][0m[32m: Based on the context provided, you can use Managed Spot Training with all instances supported in Amazon SageMaker. Therefore, the answe

### Create RAG prompt

Here we create a RAG prompt with re-ranking built in. 

We ask Claude to first rerank the documents in context from 1 to 5 and evaluate the relevance accordingly. Then we ask to use *only the top ranked documents* to answer the question.

Adding the `<question>` and the end of the prompt increases LLM output quality (Best practice).

In [13]:
from langchain.prompts.prompt import PromptTemplate

rag_template = """Given the following retrieved context documents, your task is to rerank the contexts based on their relevance to truthfully and completely answering the user's question provided in the <question> tags.
Then use only the top ranked context to provide an answer to the question.
If you don't have the information just say so. Sometimes the retrieved documents may not contain the information you need. In such cases, say 'Sorry, I don't have enough information'.

Retrieved documents: 

{context}

please rerank the documents above from most (1) to least (5) relevant in directly and fully answering the user's specific question "<question>".
Evaluate relevance based on how precisely each document answers this question if taken alone.

Document ranking:
1.
2.
3.
4.
5.

Now using only the top ranked documents, please provide a clear and concise answer to the question in <answer> tags.

Do NOT output <answer> with any preamble. Just answer the question in a direct manner.

User's question: <question> {question} </question>"""

# We need to add HUMAN_PROMPT and AI_PROMPT to the template (Anthropic specific)
rag_prompt = PromptTemplate.from_template(
    template=f"{HUMAN_PROMPT}{rag_template}{AI_PROMPT}"
)

### Create sequential chain

- Formatting the retrieved context docs into <context1>..</context1> <context2>..</context2> tags helps the model to re-rank them efficiently.
- `format_context_docs` function does that and returns the formatted string back
- Finally, before the prompt, we call this function as a `RunnableLambda` that'd inject the formatted string into the `{context}` variable in the prompt.
- Also, we define `question` to be a `RunnablePassthrough`, this allows the question to be passed in directly into the `invoke_chain` call.

In [14]:
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

retriever_kwargs = {"search_type": "similarity", "top_k": 5}
retriever = db.as_retriever(
    **retriever_kwargs
)  # Passing this to context will get back 5 docs


def format_context_docs(query, retriever):
    docs = retriever.get_relevant_documents(query)
    context_string = ""
    for idx, _d in enumerate(docs):
        otag = f"<context{idx+1}>"
        ctag = f"</context{idx+1}>"
        c_text = f"{otag} {_d.page_content} {ctag}\n"
        context_string += c_text
    return context_string


rag_chain = (
    {
        "question": RunnablePassthrough(),
        "context": RunnableLambda(
            lambda output: format_context_docs(query=output, retriever=retriever)
        ),
    }
    | rag_prompt
    | llm
    | StrOutputParser()
)

### Let's ask a few questions based on following posts

- https://huggingface.co/blog/moe
- https://huggingface.co/blog/setfit-absa

In [15]:
from IPython.display import Markdown, display

queries = [
    "What is MoE?",
    "What is SetFit? How does it work",
    "What are model sizes of SetFit compared to the others",
]

for q in queries:
    print(f"[b]Question: [b green]{q}[/b green]")
    output = rag_chain.invoke(q)
    display(Markdown(output))
    print("===" * 15)

 <answer>A Mixture of Experts (MoE) consists of two main elements:

1. Sparse MoE layers that replace dense feed-forward network (FFN) layers. MoE layers contain a number of experts, where each expert is a neural network (typically FFNs). 

2. A gate network or router that determines which tokens are sent to which expert. 

So in summary, in a MoE model, every FFN layer is replaced with a MoE layer containing multiple experts and a gating network to route tokens to experts. This enables efficient pretraining and faster inference compared to dense transformer models.
</answer>

 <answer>SetFit is a few-shot learning framework for training sentence classification models. It works in 3 main steps:

1. Extract aspect candidates from text using spaCy. 

2. Train a SetFit model to classify candidates as aspects or non-aspects. This is done by concatenating each candidate with the full text to create training instances.

3. Train another SetFit model to classify sentiment polarity for extracted aspects.</answer>

 <answer>We see a clear advantage of SetFitABSA when the number of training instances is low, despite being 2x smaller than T5 and x3 smaller than GPT2-medium.  Even when compared to Llama 2, which is x64 larger, the performance is on par or better.</answer>