# Question answering with Retrieval Augmented Generation (RAG)

This notebook provides a practical example of how to build a question answering system using Retrieval Augmented Generation. To create a RAG-based Q&A tool, you need to do two main things: first, index your own set of documents for the system to use, and second, set up a way to retrieve the right documents when the application needs to answer a question.

## Overview of document indexing

Indexing your documents involves several steps:

1. Gather the documents you want to use.
2. Change these documents into simple text or markdown format.
3. Break the text down into smaller parts called "chunks".
4. Create document embeddings for these chunks.
5. Load the embeddings into a vector store system.

<img src="./img/rag_indexing.png " alt="RAG indexing overview" width="800"/>

## Overview of document retrieval

At retrieval time, the system will calculate the embeddings of the query phrase. The vector store will look up for document chunks semantically similar to the query, based on k-nearest-neighbor search algorithm over the document embeddings. The most relevant document chunks will be passed to a large language model (LLM) together with the instruction to answer the original question based on the retrieved chunks.

In [1]:
from utils.diagrams import rag
rag(theme="light", background_color="!white")

## Document preprocessing

### Acquire source document from the web and convert it into plain text

The following code snippet does three things: it gets a webpage from the internet, reads its HTML code, and then pulls out the main text from that page.

It's worth mentioning that Wikipedia has a public API that lets you easily download page contents in a structured format. But in this example, for the sake of this demonstration, we'll read the webpage's HTML code directly, just like how it looks in a web browser. On Wikipedia pages, the main part of the article is usually in a section marked with `div` tag and an attribute `id="bodyContent"`. Our code is designed to find this part and take out the text, leaving behind any other HTML code inside it.

In [2]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

doc_url = "https://en.wikipedia.org/wiki/University_of_Pavia"

loader = WebBaseLoader(
    web_paths=(doc_url,),
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(id=("bodyContent"))),
    bs_get_text_kwargs=dict(separator=" ")
)
docs = loader.load()

print(f"Document loader has returned {len(docs)} documents.")
print("The beginning of the first document:")
print("-" * 80)
print(docs[0].page_content[:1500])

Document loader has returned 1 documents.
The beginning of the first document:
--------------------------------------------------------------------------------

 
 
 Coordinates :  45°11′12″N   9°9′23″E ﻿ / ﻿ 45.18667°N 9.15639°E ﻿ /  45.18667; 9.15639 
 
 From Wikipedia, the free encyclopedia 
 
 
 Public university in Pavia, Italy 
 University of Pavia Università di Pavia Seal  of the University of Pavia Latin :  Alma Ticinensis Universitas Type Public Established 13 April 1361 ; 663 years ago  ( 1361-04-13 ) Rector Francesco Svelto Academic staff 981 Students 23,849 Undergraduates 11,983 Postgraduates 9,366 Location Pavia ,  Italy 45°11′12″N   9°9′23″E ﻿ / ﻿ 45.18667°N 9.15639°E ﻿ /  45.18667; 9.15639 Campus Urban/University town Colors    Pavia Yellow Affiliations Coimbra Group ,  EUA ,  Netval Website unipv .eu 
 The  University of Pavia  ( Italian :  Università degli Studi di Pavia ,  UNIPV  or  Università di Pavia ;  Latin :  Alma  Ticinensis  Universitas ) is a university locat

### Split the document into chunks

The following code snippet splits the source document into chunks. It keeps dividing the document based on meaningful breaks (like double new lines, single new lines, punctuation, and spaces between words) until each piece is smaller than the maximum chunk size. It also keeps a 20% of overlap between chunks.

It is worth noting that the length of a document can be measured in various ways. The most common method is counting the number of characters, and this is what langchain text splitters usually do by default. However, many large language models (LLMs) process the input text differently. They use a tool called a _tokenizer_ to break down the text and then measure its length in _number of tokens_ instead of characters. For instance, OpenAI models use a tokenizer known as `tiktoken`. To accurately determine the length of a prompt, it's essential to measure it in terms of the number of tokens. Fortunately, langchain is equipped with a feature that allows for measuring chunk lengths in terms of `tiktoken` token length.

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunk_size = 500
chunk_overlap_rate = 0.2

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=chunk_size, chunk_overlap=int(chunk_size * chunk_overlap_rate)
)

splits = text_splitter.split_documents(docs)
print(f"The document was split into {len(splits)} chunks.")

The document was split into 50 chunks.


In [4]:
text0 = splits[0].page_content
text1 = splits[1].page_content
print("The first two chunks:")
print("-" * 80)
print(text0)
print("-" * 80)
print(text1)

The first two chunks:
--------------------------------------------------------------------------------
Coordinates :  45°11′12″N   9°9′23″E ﻿ / ﻿ 45.18667°N 9.15639°E ﻿ /  45.18667; 9.15639 
 
 From Wikipedia, the free encyclopedia 
 
 
 Public university in Pavia, Italy 
 University of Pavia Università di Pavia Seal  of the University of Pavia Latin :  Alma Ticinensis Universitas Type Public Established 13 April 1361 ; 663 years ago  ( 1361-04-13 ) Rector Francesco Svelto Academic staff 981 Students 23,849 Undergraduates 11,983 Postgraduates 9,366 Location Pavia ,  Italy 45°11′12″N   9°9′23″E ﻿ / ﻿ 45.18667°N 9.15639°E ﻿ /  45.18667; 9.15639 Campus Urban/University town Colors    Pavia Yellow Affiliations Coimbra Group ,  EUA ,  Netval Website unipv .eu 
 The  University of Pavia  ( Italian :  Università degli Studi di Pavia ,  UNIPV  or  Università di Pavia ;  Latin :  Alma  Ticinensis  Universitas ) is a university located in  Pavia ,  Lombardy ,  Italy . There was evidence of teach

## Calculate embeddings

The langchain abstraction streamlines the process by combining the two key steps oc calculating the embeddings of documents and then loading these embeddings into a vector store. For the sake of the demonstration, we will calculate also manually the embedding of the first chunk. However, it's important to note that in typical usage, the VectorStore component of langchain simplifies this process.

Please note: to execute the rest of the notebook, you will need an OpenAI API key. [Sign up](https://platform.openai.com/signup) at their website, you might get some [free credits](https://help.openai.com/en/articles/4936830-what-happens-after-i-use-my-free-tokens-or-the-3-months-is-up-in-the-free-trial). You will find your secret API key in the user settings page of their site, as [described here](https://help.openai.com/en/articles/4936850-where-do-i-find-my-api-key).

In [5]:
from utils import ensure_openai_api_key
from langchain_openai import OpenAIEmbeddings

ensure_openai_api_key()
embeddings_engine = OpenAIEmbeddings()

In [6]:
emb0 = embeddings_engine.embed_documents([text0])[0]
print("Number of dimensions of the embedding:", len(emb0))
print("The first 50 dimensions of the text embedding:")
print(emb0[:50])

Number of dimensions of the embedding: 1536
The first 50 dimensions of the text embedding:
[0.015962178087853118, -0.009795560624431031, 0.007322455848248781, -0.03688348130969757, -0.02564796651300452, 0.022006111495659085, -0.025028076336916893, -0.00497849524680253, -0.01656915454495908, -0.037710000302717615, -0.0036257133342796526, -0.0044619199448424135, 0.02142496247651644, -0.011784375793091195, 0.004048659517009802, -0.027636781681664682, -0.002524762213246003, -0.010318593251543271, -0.010660824499686408, -0.028489131837938397, -0.0007575093281513071, 0.020908387640217616, -0.01726653076022701, -0.016582070126585913, -0.029418966170747247, 0.004210089386183831, 0.0031446526221642736, -0.00048065724570869477, 0.023775380403114815, -0.0410160837253785, -0.004000230276860818, -0.018454654373830443, -0.02864410391629901, -0.00949207239587805, 0.021773651515472992, -0.007851945334851852, -0.01271421105722263, -0.011306543045059937, 0.027094377544757364, 0.02183822383567164, 0.0201

## Load the embeddings into a vector store

There are many vector databases available, both commercial and open source ones. For this demo, we will use [Chroma](https://www.trychroma.com), an open-source vector database that runs locally and is perfect for proof-of-concept and hobby projects.

langchain already has an integration for Chroma.

In [7]:
from langchain_community.vectorstores import Chroma

!mkdir -p ./data

# Set this variable to True for the first time. The vector store will calculate the embeddings
# for all the documents. Later on, you can use the saved state to load the embeddings.
create_vector_store = True

if create_vector_store:
    # create a vectorstore from scratch
    vectorstore = Chroma.from_documents(
        documents=splits,
        embedding=embeddings_engine,
        persist_directory="./data"
    )
else:
    # later on, you can load the vectorstore from disk with:
    vectorstore = Chroma(
        embedding_function=embeddings_engine,
        persist_directory="./data"
    )

### Testing the vector store

We will ask the vector store to retrieve the semantically most relevant document chunk to a particular question.

In [8]:
sample_question = "What did Leonardo da Vinci studied at the University of Pavia?"
results = vectorstore.similarity_search(sample_question, k=1)
print("Sample question:", sample_question)
print("Most relevant chunk:")
print("-" * 80)
print(results[0].page_content)

Sample question: What did Leonardo da Vinci studied at the University of Pavia?
Most relevant chunk:
--------------------------------------------------------------------------------
Renaissance and Modern Period [ edit ] 
 Towards the 15th century, prominent teachers such as  Baldo degli Ubaldi ,  Lorenzo Valla ,  Giasone del Maino  taught students in the fields of law, philosophy and literary studies. In the same years, Elia di Sabato da Fermo, personal doctor of  Filippo Maria Visconti , was the first professor of medicine of the Jewish religion at a European university, while from 1490 a teaching of Hebrew was established at the university. [12] 
Not many years later, probably in 1511,  Leonardo da Vinci  studied anatomy together with  Marcantonio della Torre , professor of anatomy at the university. [13] 
During the ongoing  Italian War of 1521-6 , the authorities in Pavia were forced to close the university in 1524. [6] [14]  However, during the 16th century, after the university 

## Create the RAG application

We will use [langchain framework](https://python.langchain.com/docs/get_started/introduction) to assemble our RAG application. Langchain provides an architectural framework for creating applications leverage Large Language Models (LLMs). It provides reasonable abstraction over different components (document loaders, embedding models, vector stores, language models, and many more) and makes it easy to create data pipelines (called "chains") from them. It also comes with the integration of many dozens of different commercial and open source implementations of the pipeline components, allowing you to create highly modular LLM applications.

### Structuring the Prompt for LLMs

When crafting a prompt for a Large Language Model (LLM), it's essential to consider various semantic elements. A prompt, essentially the "question" posed to the LLM, can include:

- **Task Description**: Outlines what the task is.
- **Tone of Voice**: Specifies the desired style or persona the model should adopt.
- **Execution Instructions**: Provides detailed guidelines on how the task should be approached.
- **Contextual Information**: Offers additional background relevant to the task.
- **Example Solutions**: Presents examples to guide the expected response format.

Prompt engineering is a specialized area in information technology focused on optimizing how prompts are structured for different LLMs. It involves developing best practices and strategies to create effective prompts.

### Prompt Components in Retrieval Augmented Generation (RAG)

For a RAG application, the prompt should ideally include:

- **Model Persona/Tone of Voice**: Defines the identity or style the model should emulate.
- **Task Description**: Clearly states the task, such as answering a question based on provided document chunks.
- **Detailed Instructions**: Aims to minimize model-generated inaccuracies (hallucinations) and maintain answer conciseness.
- **User Question**: The actual query or problem posed to the model.
- **Extracted Document Chunks**: The relevant pieces of text the model uses to formulate its response.

In [9]:
from langchain_core.prompts import ChatPromptTemplate

prompt_template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:"""

prompt = ChatPromptTemplate.from_template(prompt_template)

### Prompt length engineering

All current large language model have an upper bound of context length, i.e. the maximum length of the prompt you can send to them. In this demo, we will use OpenAI's [GPT-3.5](https://platform.openai.com/docs/models/gpt-3-5) model. This model has reasonable latency, pricing and cognitive performance for even complex Q&A tasks. It has a context size of 4096 tokens.

It is also important to note that most of hosted LLM providers charge you for the number of tokens in the prompt (plus the number of tokens in the response) so you are stimulated to keep your prompts short. 

When we created the document chunks, we fixed the maximum length of a single chunk in 500 tokens. We plan to retrieve the 5 most relevant documents from the vector store, and add it to the prompt. Bellow you find a calculation that ensures the resulting prompt will fit in the context of the model.

In [10]:
import tiktoken
encoder = tiktoken.encoding_for_model("gpt-3.5-turbo")
raw_instructions = prompt.format(question="", context="")

max_context_length = 4096
number_of_documents_to_retrieve = 5
maximum_question_length = 500  # This is an estimation and should be enforced from user interface
raw_instructions_length = len(encoder.encode(raw_instructions))
maximum_prompt_length = (
    raw_instructions_length
    + maximum_question_length
    + number_of_documents_to_retrieve * chunk_size
)
print("Estimated upper bound of prompt length:", maximum_prompt_length)
assert maximum_prompt_length < max_context_length, "The prompt is too long."

Estimated upper bound of prompt length: 3060


### Define the RAG application

We will use langchain wrapper around OpenAI client library to connect to GPT-3.5 hosted on OpenAI servers. Have your OpenAI API key ready.

In [11]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

The following small helper function joins the retrieved document chunks into a single text.

In [12]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

Langchain defines a domain-specific language to describe LLM applications in a declarative way. This language, called "Langchain Expression Language" (LCEL) is a strict subset of Python and uses the pipe operator `|` to define data pipelines / chains in a similar concept as different linux shell do.

In [13]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
retriever = vectorstore.as_retriever()

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [14]:
print("Question:", sample_question)
answer = rag_chain.invoke(sample_question)
print("Answer:", answer)

Question: What did Leonardo da Vinci studied at the University of Pavia?
Answer: Leonardo da Vinci studied anatomy at the University of Pavia with Marcantonio della Torre, a professor of anatomy. The university was closed in 1524 during the ongoing Italian War, but scholars like Andrea Alciato and Gerolamo Cardano taught there when it reopened in the 16th century.


Dive deeper: https://python.langchain.com/docs/use_cases/question_answering/quickstart