# Basic RAG
Retrieval-augmented generation (RAG) is an AI framework that synergizes the capabilities of LLMs and information retrieval systems. It’s useful to answer questions or generate content leveraging external knowledge. There are two main steps in RAG: 1) retrieval: retrieve relevant information from a knowledge base with text embeddings stored in a vector store; 2) generation: insert the relevant information to the prompt for the LLM to generate information. In this file, we will build a basic example of RAG with four implementations:

- RAG from scratch with Mistral
- RAG with Mistral and LangChain
- RAG with Mistral and LlamaIndex

### Import needed packages
The first step is to install the needed packages `mistralai` and `faiss-cpu` and import the needed packages:



In [7]:
! pip3 install faiss-cpu==1.7.4 mistralai

Collecting faiss-cpu==1.7.4
  Downloading faiss-cpu-1.7.4.tar.gz (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.4/57.4 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: faiss-cpu
  Building wheel for faiss-cpu (pyproject.toml) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding wheel for faiss-cpu [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[8 lines of output][0m
  [31m   [0m running bdist_wheel
  [31m   [0m running build
  [31m   [0m running build_py
  [31m   [0m running build_ext
  [31m   [0m building 'faiss._swigfaiss' extension
  [31m   [0m swigging faiss/faiss/python/swigfaiss.i to faiss/faiss/pyth

In [8]:
from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage
import requests
import numpy as np
import faiss
import os
from getpass import getpass

api_key= getpass("MISTRAL_API_KEY")
client = MistralClient(api_key=api_key)

In [7]:
!pip3 install pymupdf
import fitz
# Define the path to the PDF file
pdf_path = '/content/drive/My Drive/qdrant_policies/policies/Late-Payment-Policy_FM1.pdf'

# Define the path for the output text file
txt_output_path = '/content/drive/My Drive/qdrant_policies/policies/Late-Payment-Policy_FM1.txt'

def pdf_to_text(pdf_path, txt_output_path):
    # Open the PDF file
    doc = fitz.open(pdf_path)

    # Initialize an empty string to hold the text
    text = ""

    # Iterate through each page
    for page in doc:
        # Extract text from the page
        text += page.get_text()

    # Write the extracted text to a text file
    with open(txt_output_path, 'w') as file:
        file.write(text)

    print(f"Text has been written to {txt_output_path}")

# Convert PDF to text and save to file
pdf_to_text(pdf_path, txt_output_path)


Collecting pymupdf
  Downloading PyMuPDF-1.24.5-cp310-none-manylinux2014_x86_64.whl (3.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyMuPDFb==1.24.3 (from pymupdf)
  Downloading PyMuPDFb-1.24.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.8/15.8 MB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDFb, pymupdf
Successfully installed PyMuPDFb-1.24.3 pymupdf-1.24.5
Text has been written to /content/drive/My Drive/qdrant_policies/policies/Late-Payment-Policy_FM1.txt


### Get data

In this very simple example, we are getting data from the late payment policy.

In [9]:
file_path='/Users/lexi/Downloads/Late-Payment-Policy_FM1.txt'
def read_text_file(file_path):
    with open(file_path, 'r') as file:
        text_content = file.read()
    return text_content
text = read_text_file(file_path)
print(text)

 
Late Payment Policy (FM1) 
 
1 
July 2019 version 
 
The University of British Columbia  
Board of Governors 
 
 
Policy No.: 
FM1 
Long Title: 
Late Payment of Fees and Accounts 
 
Short Title:  
Late Payment Policy 
 
 
 
1. General 
 
1.1 
Where  fees,  fines,  or  other  indebtedness  to  the  University  remain  unpaid  despite  the 
University having taken reasonable steps to notify the individual concerned, the University 
may report the outstanding obligation to credit reporting agencies, commence legal action, 
or utilize any other remedies that may be available to it, whether the outstanding obligation 
is owed by a faculty member, staff member, student, or other individual. 
 
1.2 
A late payment fee and interest may be charged. 
 
1.3 
In cases where the outstanding obligation is owed by a student, the University will attempt to 
secure payment using internal processes prior to commencing any legal action. Provided that 
the  University  has  first  taken  reasonable  ste

In [10]:
len(text)

4765

## Split document into chunks

In a RAG system, it is crucial to split the document into smaller chunks so that it’s more effective to identify and retrieve the most relevant information in the retrieval process later. In this example, we simply split our text by character, combine 4765 characters into each chunk, and we get 10 chunks.

In [11]:
chunk_size = 512
chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

In [12]:
len(chunks)

10

#### Considerations:
- **Chunk size**: Depending on your specific use case, it may be necessary to customize or experiment with different chunk sizes and chunk overlap to achieve optimal performance in RAG. For example, smaller chunks can be more beneficial in retrieval processes, as larger text chunks often contain filler text that can obscure the semantic representation. As such, using smaller text chunks in the retrieval process can enable the RAG system to identify and extract relevant information more effectively and accurately.  However, it’s worth considering the trade-offs that come with using smaller chunks, such as increasing processing time and computational resources.
- **How to split**: While the simplest method is to split the text by character, there are other options depending on the use case and document structure. For example, to avoid exceeding token limits in API calls, it may be necessary to split the text by tokens. To maintain the cohesiveness of the chunks, it can be useful to split the text by sentences, paragraphs, or HTML headers. If working with code, it’s often recommended to split by meaningful code chunks for example using an Abstract Syntax Tree (AST) parser.


### Create embeddings for each text chunk
For each text chunk, we then need to create text embeddings, which are numeric representations of the text in the vector space. Words with similar meanings are expected to be in closer proximity or have a shorter distance in the vector space.
To create an embedding, use Mistral’s embeddings API endpoint and the embedding model `mistral-embed`. We create a `get_text_embedding` to get the embedding from a single text chunk and then we use list comprehension to get text embeddings for all text chunks.


In [13]:
def get_text_embedding(input):
    embeddings_batch_response = client.embeddings(
          model="mistral-embed",
          input=input
      )
    return embeddings_batch_response.data[0].embedding

In [14]:
text_embeddings = np.array([get_text_embedding(chunk) for chunk in chunks])

In [15]:
text_embeddings.shape

(10, 1024)

In [16]:
text_embeddings

array([[-0.05279541,  0.0425415 ,  0.03515625, ..., -0.01850891,
        -0.01600647, -0.01660156],
       [-0.02555847,  0.04238892,  0.0335083 , ...,  0.00045657,
        -0.01304626, -0.03089905],
       [-0.00518799,  0.04684448,  0.0252533 , ..., -0.00501251,
        -0.00649643, -0.03262329],
       ...,
       [-0.04434204,  0.03909302,  0.02493286, ...,  0.00706863,
        -0.00595093, -0.0243988 ],
       [-0.04748535,  0.04962158,  0.02813721, ...,  0.00458908,
        -0.00642776, -0.02920532],
       [-0.01670837,  0.03120422,  0.03117371, ...,  0.00309753,
        -0.00422668, -0.03001404]])

### Load into a vector database
Once we get the text embeddings, a common practice is to store them in a vector database for efficient processing and retrieval. There are several vector database to choose from. In our simple example, we are using an open-source vector database Faiss, which allows for efficient similarity search.  

With Faiss, we instantiate an instance of the Index class, which defines the indexing structure of the vector database. We then add the text embeddings to this indexing structure.


In [17]:
d = text_embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(text_embeddings)

#### Considerations:
- **Vector database**: When selecting a vector database, there are several factors to consider including speed, scalability, cloud management, advanced filtering, and open-source vs. closed-source.

### Create embeddings for a question
Whenever users ask a question, we also need to create embeddings for this question using the same embedding models as before.


In [18]:
question = "What may the University charge when fees, fines, or other indebtedness remain unpaid?"
question_embeddings = np.array([get_text_embedding(question)])
question_embeddings.shape

(1, 1024)

In [19]:
question_embeddings

array([[-0.02053833,  0.0625    ,  0.04031372, ..., -0.01794434,
        -0.00689316, -0.00888824]])

#### Considerations:
- Hypothetical Document Embeddings (HyDE): In some cases, the user’s question might not be the most relevant query to use for identifying the relevant context. Instead, it maybe more effective to generate a hypothetical answer or a hypothetical document based on the user’s query and use the embeddings of the generated text to retrieve similar text chunks.

### Retrieve similar chunks from the vector database
We can perform a search on the vector database with `index.search`, which takes two arguments: the first is the vector of the question embeddings, and the second is the number of similar vectors to retrieve. This function returns the distances and the indices of the most similar vectors to the question vector in the vector database. Then based on the returned indices, we can retrieve the actual relevant text chunks that correspond to those indices.


In [20]:
D, I = index.search(question_embeddings, k=2)
print(I)

[[1 5]]


In [21]:
retrieved_chunk = [chunks[i] for i in I.tolist()[0]]
print(retrieved_chunk)

['\xa0legal\xa0action,\xa0\nor\xa0utilize\xa0any\xa0other\xa0remedies\xa0that\xa0may\xa0be\xa0available\xa0to\xa0it,\xa0whether\xa0the\xa0outstanding\xa0obligation\xa0\nis\xa0owed\xa0by\xa0a\xa0faculty\xa0member,\xa0staff\xa0member,\xa0student,\xa0or\xa0other\xa0individual.\xa0\n\xa0\n1.2 \nA\xa0late\xa0payment\xa0fee\xa0and\xa0interest\xa0may\xa0be\xa0charged.\xa0\n\xa0\n1.3 \nIn\xa0cases\xa0where\xa0the\xa0outstanding\xa0obligation\xa0is\xa0owed\xa0by\xa0a\xa0student,\xa0the\xa0University\xa0will\xa0attempt\xa0to\xa0\nsecure\xa0payment\xa0using\xa0internal\xa0processes\xa0prior\xa0to\xa0commencing\xa0any\xa0legal\xa0action.\xa0Provided\xa0that\xa0\nthe\xa0 University\xa0 has\xa0 first\xa0 taken\xa0 reasonable\xa0 steps\xa0 to\xa0 notify\xa0 the\xa0 ind', '\xa0unit\xa0in\xa0which\xa0the\xa0outstanding\xa0obligation\xa0was\xa0incurred\xa0shall\xa0\ntake\xa0reasonable\xa0steps\xa0to\xa0notify\xa0the\xa0individual\xa0concerned\xa0before\xa0taking\xa0any\xa0further\xa0steps.\xa0Such\xa0\n

#### Considerations:
- **Retrieval methods**: There are a lot different retrieval strategies. In our example, we are showing a simple similarity search with embeddings. Sometimes when there is metadata available for the data, it’s better to filter the data based on the metadata first before performing similarity search. There are also other statistical retrieval methods like TF-IDF and BM25 that use frequency and distribution of terms in the document to identify relevant text chunks.
- **Retrieved document**: Do we always retrieve individual text chunk as it is? Not always.
    - Sometimes, we would like to include more context around the actual retrieved text chunk. We call the actual retrieve text chunk “child chunk” and our goal is to retrieve a larger “parent chunk” that the “child chunk” belongs to.
    - On occasion, we might also want to provide weights to our retrieve documents. For example, a time-weighted approach would help us retrieve the most recent document.
    - One common issue in the retrieval process is the “lost in the middle” problem where the information in the middle of a long context gets lost. Our models have tried to mitigate this issue. For example, in the passkey task, our models have demonstrated the ability to find a "needle in a haystack" by retrieving a randomly inserted passkey within a long prompt, up to 32k context length. However, it is worth considering experimenting with reordering the document to determine if placing the most relevant chunks at the beginning and end leads to improved results.
  
### Combine context and question in a prompt and generate response

Finally, we can offer the retrieved text chunks as the context information within the prompt. Here is a prompt template where we can include both the retrieved text and user question in the prompt.



In [23]:
prompt = f"""
Context information is below.
---------------------
{retrieved_chunk}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {question}
Answer:
"""

In [24]:
def run_mistral(user_message, model="mistral-medium-latest"):
    messages = [
        ChatMessage(role="user", content=user_message)
    ]
    chat_response = client.chat(
        model=model,
        messages=messages
    )
    return (chat_response.choices[0].message.content)

In [25]:
run_mistral(prompt)

'When fees, fines, or other indebtedness remain unpaid, the University may charge a late payment fee and interest on the outstanding obligation. Additionally, the University may decline to provide further services to the individual in question until the obligation is paid. In cases where the obligation is owed by a student, the University will first attempt to secure payment using internal processes prior to commencing any legal action. The administrative unit responsible for the outstanding obligation will take reasonable steps to notify the individual concerned of the late fees or interest charges and the potential consequences of non-payment before taking any further steps. The University may also utilize legal action or other remedies available to it to collect the outstanding obligation.'

#### Considerations:
- Prompting techniques: Most of the prompting techniques can be used in developing a RAG system as well. For example, we can use few-shot learning to guide the model’s answers by providing a few examples. Additionally, we can explicitly instruct the model to format answers in a certain way.


In the next sections, we are going to show you how to do a similar basic RAG with some of the popular RAG frameworks. We will start with LlamaIndex and add other frameworks in the future.


## LlamaIndex

In [5]:
!pip3 install llama-index llama-index-llms-mistralai llama-index-embeddings-mistralai

Collecting llama-index
  Using cached llama_index-0.10.46-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-llms-mistralai
  Using cached llama_index_llms_mistralai-0.1.16-py3-none-any.whl.metadata (679 bytes)
Collecting llama-index-embeddings-mistralai
  Using cached llama_index_embeddings_mistralai-0.1.4-py3-none-any.whl.metadata (645 bytes)
Collecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama-index)
  Using cached llama_index_agent_openai-0.2.7-py3-none-any.whl.metadata (678 bytes)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Using cached llama_index_cli-0.1.12-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core==0.10.46 (from llama-index)
  Using cached llama_index_core-0.10.46-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index)
  Using cached llama_index_embeddings_openai-0.1.10-py3-none-any.whl.metadata (604 bytes)
Collecting llama-index-indices-managed-llama-cloud<0.2.0,>=0.1.2 (

In [26]:
import os
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.mistralai import MistralAI
from llama_index.embeddings.mistralai import MistralAIEmbedding

# Load data
reader = SimpleDirectoryReader(input_files=["Late-Payment-Policy_FM1.txt"])
documents = reader.load_data()
# Define LLM and embedding model
Settings.llm = MistralAI(model="open-mistral-7b", api_key=api_key)
Settings.embed_model = MistralAIEmbedding(model_name='mistral-embed', api_key=api_key)
# Create vector store index
index = VectorStoreIndex.from_documents(documents)


In [27]:
# Create query engine
query_engine = index.as_query_engine(similarity_top_k=2)
response = query_engine.query(
    "What may the University charge when fees, fines, or other indebtedness remain unpaid?"
)



In [28]:
import textwrap

def format_response(response):
    response_text = str(response)
    return '\n'.join(textwrap.wrap(response_text, width=80))  # Wrap text at 80 characters

formatted_response = format_response(response)
print(formatted_response)

The University may charge late fees or interest when fees, fines, or other
indebtedness remain unpaid.


In [29]:
response2 = query_engine.query(
    "Are individual academic departments authorized to withhold grades from Enrolment Services for any reason?"
)

formatted_response2 = format_response(response2)
print(formatted_response2)

No, individual academic departments are not authorized to withhold grades from
Enrolment Services for any reason, according to the Late-Payment-Policy_FM1.txt.
This is stated in section 1.4 of the July 2019 version of the policy.


In [30]:

response3 = query_engine.query(
    "What steps must an administrative unit take before declining further services to an individual with outstanding obligations?"
)
formatted_response3 = format_response(response3)
print(formatted_response3)

An administrative unit must take reasonable steps to notify the individual
concerned before declining further services. This notification should state the
late fees or interest charges, if any, which apply to the outstanding
obligation, as well as the potential consequences of non-payment.


In [31]:

response4 = query_engine.query(
    "What actions may the Department of Housing and Conferences take if a resident has outstanding obligations?")
formatted_response4 = format_response(response4)
print(formatted_response4)

The Department of Housing and Conferences may refuse admission to residences and
may withdraw residence privileges, including dining privileges, requiring a
resident to vacate the premises.


In [32]:
response5 = query_engine.query(
    "What may Parking and Access Control Services do in cases of unpaid obligations?")
formatted_response5 = format_response(response5)
print(formatted_response5)

Parking and Access Control Services may withdraw parking privileges and may tow
vehicles.


In [33]:
!pip3 install langchain_community



In [35]:
!pip3 install langchain_mistralai

Collecting langchain_mistralai
  Downloading langchain_mistralai-0.1.8-py3-none-any.whl.metadata (2.2 kB)
Collecting httpx-sse<1,>=0.3.1 (from langchain_mistralai)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Downloading langchain_mistralai-0.1.8-py3-none-any.whl (12 kB)
Downloading httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)
Installing collected packages: httpx-sse, langchain_mistralai
Successfully installed httpx-sse-0.4.0 langchain_mistralai-0.1.8


In [36]:
from langchain_community.document_loaders import TextLoader
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_mistralai.embeddings import MistralAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain





In [37]:
# Load data
loader = TextLoader("Late-Payment-Policy_FM1.txt")
docs = loader.load()
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(docs)
# Define the embedding model
embeddings = MistralAIEmbeddings(model="mistral-embed", mistral_api_key=api_key)
# Create the vector store
vector = FAISS.from_documents(documents, embeddings)
# Define a retriever interface
retriever = vector.as_retriever()

  from .autonotebook import tqdm as notebook_tqdm


In [38]:
# Define LLM
model = ChatMistralAI(mistral_api_key=api_key)
# Define prompt template
prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:

<context>
{context}
</context>

Question: {input}""")

# Create a retrieval chain to answer questions
document_chain = create_stuff_documents_chain(model, prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)
response = retrieval_chain.invoke({"input": "What may the University charge when fees, fines, or other indebtedness remain unpaid"})
print(response["answer"])

When fees, fines, or other indebtedness to the University remain unpaid, the University may charge late fees or interest, as stated in the Late Payment Policy (FM1) under section 1.5. The specific amount of these charges is not mentioned in the provided context.


In [39]:
response2 = retrieval_chain.invoke({"input": "What actions may the Department of Housing and Conferences take if a resident has outstanding obligations?"})
print(response2["answer"])

The Department of Housing and Conferences may refuse admission to residences and may withdraw residence privileges, including dining privileges, requiring a resident to vacate the premises if the resident has outstanding obligations.


In [40]:
response2 = retrieval_chain.invoke({"input": "What may Parking and Access Control Services do in cases of unpaid obligations?"})
print(response2["answer"])

Parking and Access Control Services may withdraw parking privileges and may tow vehicles in cases of unpaid obligations.
