# Basic RAG
Retrieval-augmented generation (RAG) is an AI framework that synergizes the capabilities of LLMs and information retrieval systems. It’s useful to answer questions or generate content leveraging external knowledge. There are two main steps in RAG: 1) retrieval: retrieve relevant information from a knowledge base with text embeddings stored in a vectore store; 2) generation: insert the relevant information to the prompt for the LLM to generate information. In this guide, we will walk through a very basic example of RAG with two implementations:

- RAG from scratch with Mistral 
- RAG with Mistral and LlamaIndex
- RAG with Mistral and Haystack

## RAG from scratch 

This section aims to guide you through the process of building a basic RAG from scratch. We have two goals: firstly, to offer users a comprehensive understanding of the internal workings of RAG and demystify the underlying mechanisms; secondly, to empower you with the essential foundations needed to build an RAG using the minimum required dependencies.


### Import needed packages 
The first step is to install the needed packages `mistralai` and `faiss-cpu` and import the needed packages: 



In [None]:
! pip install faiss-cpu==1.7.4 mistralai==0.0.12

In [None]:
! pip install requests

In [None]:
! pip install numpy

In [1]:
from mistralai.client import MistralClient, ChatMessage
import requests
import numpy as np
import faiss
import os
from getpass import getpass

api_key= getpass("Type your API Key")
client = MistralClient(api_key=api_key)

### Get data

Using only the FAQs - Must include the rest of the files at some point

In [None]:
def load_data_from_file(file_path: str) -> dict:
    with open(file_path, 'r') as file:
        text = file.read()
        data = {"text": text}
    return data

file_path = "/Users/acrobat/Documents/GitHub/cookbook/data_poppy/faqv3.txt"
data = load_data_from_file(file_path)

print("File contents:\n", data['text'])

In [15]:
#print first 100 chars of in faq poppy file
print("First 100 characters of file:\n", data['text'][:900])

First 100 characters of file:
 Q: When is a good time for a child's first dental visit and what should parents expect?
A: The American Academy of Pediatric Dentistry (AAPD) recommends that every child should have an initial oral evaluation by a pediatric dentist by Age 1 or within 6 months after their first tooth appears. At Poppy Kids Pediatric Dentistry, your child's first visit includes a facility tour, possible cleaning, dental radiographs, fluoride treatment, and an examination by Dr. Andrea. She'll discuss dental care, answer questions, and cover topics like brushing, flossing, and diet. We recommend follow-up visits every 3-6 months, based on individual dental needs. For children under three, we offer a complimentary first appointment with no obligation.

Q: What age should children start getting dental x-rays and why are they important?
A: Children typically start getting dental X-rays around the age of 6, whi


In [17]:
len(data['text'])

11005

In [18]:
text = data['text']
len(text)

11005

## Split document into chunks (Simple split text by character)

In a RAG system, it is crucial to split the document into smaller chunks so that it’s more effective to identify and retrieve the most relevant information in the retrieval process later. In this example, we simply split our text by character, combine 2048 characters into each chunk, and we get 37 chunks. 

In [19]:
chunk_size = 2048
chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

In [20]:
len(chunks)

6

In [32]:
chunks

["Q: When is a good time for a child's first dental visit and what should parents expect?\nA: The American Academy of Pediatric Dentistry (AAPD) recommends that every child should have an initial oral evaluation by a pediatric dentist by Age 1 or within 6 months after their first tooth appears. At Poppy Kids Pediatric Dentistry, your child's first visit includes a facility tour, possible cleaning, dental radiographs, fluoride treatment, and an examination by Dr. Andrea. She'll discuss dental care, answer questions, and cover topics like brushing, flossing, and diet. We recommend follow-up visits every 3-6 months, based on individual dental needs. For children under three, we offer a complimentary first appointment with no obligation.\n\nQ: What age should children start getting dental x-rays and why are they important?\nA: Children typically start getting dental X-rays around the age of 6, which coincides with the eruption of their permanent teeth. X-rays are important for several reas

## LATER - Split docs by Q and A (Turn the data into list of dictionaries - Q and A.)

In [None]:
#looks like I need a list of dictionaries:
def parse_qa_text(text: str) -> list:
    # Split the text into lines
    lines = text.split("\n")
    
    # Initialize an empty list to store the questions and answers
    data = []
    
    # Initialize empty strings to store the current question and answer
    current_question = ""
    current_answer = ""
    
    # Iterate over the lines
    for line in lines:
        # If the line starts with "Q: ", it's a question
        if line.startswith("Q: "):
            # If there's a current question and answer, add them to the data
            if current_question and current_answer:
                data.append({"Q": current_question, "A": current_answer})
            
            # Start a new question, removing the "Q: " prefix
            current_question = line[3:]
            
            # Clear the current answer
            current_answer = ""
        # If the line starts with "A: ", it's an answer
        elif line.startswith("A: "):
            # Start a new answer, removing the "A: " prefix
            current_answer = line[3:]
        # If the line is not empty, it's a continuation of the current answer
        elif line:
            current_answer += " " + line
    
    # If there's a current question and answer at the end, add them to the data
    if current_question and current_answer:
        data.append({"Q": current_question, "A": current_answer})
    
    return data

# Use the function to parse the text data
data = parse_qa_text(data['text'])

# Print the first question-answer pair to check the result
print(data[0])

#### Considerations:
- **Chunk size**: Depending on your specific use case, it may be necessary to customize or experiment with different chunk sizes and chunk overlap to achieve optimal performance in RAG. For example, smaller chunks can be more beneficial in retrieval processes, as larger text chunks often contain filler text that can obscure the semantic representation. As such, using smaller text chunks in the retrieval process can enable the RAG system to identify and extract relevant information more effectively and accurately.  However, it’s worth considering the trade-offs that come with using smaller chunks, such as increasing processing time and computational resources. 
- **How to split**: While the simplest method is to split the text by character, there are other options depending on the use case and document structure. For example, to avoid exceeding token limits in API calls, it may be necessary to split the text by tokens. To maintain the cohesiveness of the chunks, it can be useful to split the text by sentences, paragraphs, or HTML headers. If working with code, it’s often recommended to split by meaningful code chunks for example using an Abstract Syntax Tree (AST) parser. 


### Create embeddings for each text chunk
For each text chunk, we then need to create text embeddings, which are numeric representations of the text in the vector space. Words with similar meanings are expected to be in closer proximity or have a shorter distance in the vector space.
To create an embedding, use Mistral’s embeddings API endpoint and the embedding model `mistral-embed`. We create a `get_text_embedding` to get the embedding from a single text chunk and then we use list comprehension to get text embeddings for all text chunks. 


In [23]:
def get_text_embedding(input):
    embeddings_batch_response = client.embeddings(
          model="mistral-embed",
          input=input
      )
    return embeddings_batch_response.data[0].embedding

In [24]:
text_embeddings = np.array([get_text_embedding(chunk) for chunk in chunks])

In [25]:
text_embeddings.shape

(6, 1024)

In [26]:
text_embeddings

array([[-0.05307007,  0.00943756,  0.05322266, ...,  0.01525116,
        -0.00129986,  0.00590134],
       [-0.02661133,  0.04150391,  0.03811646, ...,  0.00809479,
         0.00528717, -0.00054073],
       [-0.03347778,  0.0280304 ,  0.0222168 , ...,  0.00720215,
         0.00587845, -0.01766968],
       [-0.04309082,  0.03826904,  0.02584839, ...,  0.0166626 ,
         0.00312233,  0.00087881],
       [-0.0475769 ,  0.03210449,  0.03369141, ...,  0.02264404,
         0.00108528,  0.00587845],
       [-0.03253174,  0.02947998,  0.03051758, ..., -0.00907898,
         0.00287437, -0.00576782]])

### Load into a vector database
Once we get the text embeddings, a common practice is to store them in a vector database for efficient processing and retrieval. There are several vector database to choose from. In our simple example, we are using an open-source vector database Faiss, which allows for efficient similarity search.  

With Faiss, we instantiate an instance of the Index class, which defines the indexing structure of the vector database. We then add the text embeddings to this indexing structure. 


In [27]:
d = text_embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(text_embeddings)

#### Considerations:
- **Vector database**: When selecting a vector database, there are several factors to consider including speed, scalability, cloud management, advanced filtering, and open-source vs. closed-source. 

### Create embeddings for a question
Whenever users ask a question, we also need to create embeddings for this question using the same embedding models as before. 


In [28]:
question = "When should my child a stop using a pacifier?"
question_embeddings = np.array([get_text_embedding(question)])
question_embeddings.shape

(1, 1024)

In [29]:
question_embeddings

array([[-0.04406738,  0.03460693,  0.00855255, ...,  0.02774048,
        -0.00434875,  0.00724411]])

#### Considerations:
- Hypothetical Document Embeddings (HyDE): In some cases, the user’s question might not be the most relevant query to use for identifying the relevant context. Instead, it maybe more effective to generate a hypothetical answer or a hypothetical document based on the user’s query and use the embeddings of the generated text to retrieve similar text chunks. 

### Retrieve similar chunks from the vector database
We can perform a search on the vector database with `index.search`, which takes two arguments: the first is the vector of the question embeddings, and the second is the number of similar vectors to retrieve. This function returns the distances and the indices of the most similar vectors to the question vector in the vector database. Then based on the returned indices, we can retrieve the actual relevant text chunks that correspond to those indices. 


In [30]:
D, I = index.search(question_embeddings, k=2) 
print(I)

[[2 4]]


In [31]:
retrieved_chunk = [chunks[i] for i in I.tolist()[0]]
print(retrieved_chunk)

["ng it out before it's ready, as this can cause pain and increase the risk of infection. Natural wiggling and day-to-day activities like eating will usually lead to the tooth coming out on its own. If the loose tooth is causing discomfort or if there are concerns about how it's coming out, it's advisable to consult with a pediatric dentist. Regular dental visits also help monitor the progress of tooth loss and the growth of permanent teeth.\n\nQ: Until what age does your pediatric dental practice provide care for patients?\nA: Our pediatric dental practice typically provides care for patients up to the age of 18. This includes the transition period from childhood into adolescence, ensuring continuous dental health supervision and treatment throughout their formative years. As young patients grow, their dental needs evolve, and our practice is equipped to address these changing needs, from preventive care and treatment of childhood cavities to guidance on orthodontics and wisdom teeth.

#### Considerations:
- **Retrieval methods**: There are a lot different retrieval strategies. In our example, we are showing a simple similarity search with embeddings. Sometimes when there is metadata available for the data, it’s better to filter the data based on the metadata first before performing similarity search. There are also other statistical retrieval methods like TF-IDF and BM25 that use frequency and distribution of terms in the document to identify relevant text chunks. 
- **Retrieved document**: Do we always retrieve individual text chunk as it is? Not always.
    - Sometimes, we would like to include more context around the actual retrieved text chunk. We call the actual retrieve text chunk “child chunk” and our goal is to retrieve a larger “parent chunk” that the “child chunk” belongs to.
    - On occasion, we might also want to provide weights to our retrieve documents. For example, a time-weighted approach would help us retrieve the most recent document.
    - One common issue in the retrieval process is the “lost in the middle” problem where the information in the middle of a long context gets lost. Our models have tried to mitigate this issue. For example, in the passkey task, our models have demonstrated the ability to find a "needle in a haystack" by retrieving a randomly inserted passkey within a long prompt, up to 32k context length. However, it is worth considering experimenting with reordering the document to determine if placing the most relevant chunks at the beginning and end leads to improved results.
  
### Combine context and question in a prompt and generate response

Finally, we can offer the retrieved text chunks as the context information within the prompt. Here is a prompt template where we can include both the retrieved text and user question in the prompt. 



In [33]:
prompt = f"""
Context information is below.
---------------------
{retrieved_chunk}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {question}
Answer:
"""

In [34]:
def run_mistral(user_message, model="mistral-medium-latest"):
    messages = [
        ChatMessage(role="user", content=user_message)
    ]
    chat_response = client.chat(
        model=model,
        messages=messages
    )
    return (chat_response.choices[0].message.content)

In [35]:
run_mistral(prompt)

"The ideal age for children to stop using a pacifier is typically around the age of 2 to 4 years. Prolonged pacifier use beyond this age range can lead to dental problems, such as misaligned teeth and changes in the roof of the mouth. It's important to gradually wean children off this habit to prevent potential orthodontic issues in the future. Every child is different, so the approach can vary. Encouragement, positive reinforcement, and providing comfort in other ways can be effective strategies. If the habit persists or if you're concerned about its impact on your child's teeth, consulting with a pediatric dentist can provide further guidance."

#### Considerations:
- Prompting techniques: Most of the prompting techniques can be used in developing a RAG system as well. For example, we can use few-shot learning to guide the model’s answers by providing a few examples. Additionally, we can explicitly instruct the model to format answers in a certain way. 


In the next sections, we are going to show you how to do a similar basic RAG with some of the popular RAG frameworks. We will start with LlamaIndex and add other frameworks in the future. 


## LangChain

In [None]:
!pip install langchain langchain-mistralai==0.0.4 

In [19]:
from langchain_community.document_loaders import TextLoader
from langchain_mistralai.chat_models import ChatMistralAI
from langchain_mistralai.embeddings import MistralAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain

# Load data
loader = TextLoader("essay.txt")
docs = loader.load()
# Split text into chunks 
text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(docs)
# Define the embedding model
embeddings = MistralAIEmbeddings(model="mistral-embed", mistral_api_key=api_key)
# Create the vector store 
vector = FAISS.from_documents(documents, embeddings)
# Define a retriever interface
retriever = vector.as_retriever()
# Define LLM
model = ChatMistralAI(mistral_api_key=api_key)
# Define prompt template
prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:

<context>
{context}
</context>

Question: {input}""")

# Create a retrieval chain to answer questions
document_chain = create_stuff_documents_chain(model, prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)
response = retrieval_chain.invoke({"input": "What were the two main things the author worked on before college?"})
print(response["answer"])

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

The two main things the author worked on before college were writing and programming. He wrote short stories and tried programming on an IBM 1401 using Fortran, but he found it difficult to figure out what to do with the machine due to the limited input options. His interest in programming grew with the advent of microcomputers, leading him to write simple games, a program to predict rocket trajectories, and a word processor.


## LlamaIndex

In [None]:
!pip install llama-index==0.10.13 llama-index-llms-mistralai==0.1.4 llama-index-embeddings-mistralai==0.1.3

In [None]:
import os
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.mistralai import MistralAI
from llama_index.embeddings.mistralai import MistralAIEmbedding

# Load data
reader = SimpleDirectoryReader(input_files=["essay.txt"])
documents = reader.load_data()
# Define LLM and embedding model
Settings.llm = MistralAI(model="mistral-medium")
Settings.embed_model = MistralAIEmbedding(model_name='mistral-embed')
# Create vector store index 
index = VectorStoreIndex.from_documents(documents)
# Create query engine
query_engine = index.as_query_engine(similarity_top_k=2)
response = query_engine.query(
    "What were the two main things the author worked on before college?"
)
print(str(response))

# Haystack

In [None]:
!pip install mistral-haystack==0.0.1

In [36]:
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses import ChatMessage
from haystack.utils.auth import Secret

from haystack.components.builders import DynamicChatPromptBuilder
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.embedders.mistral import MistralDocumentEmbedder, MistralTextEmbedder
from haystack_integrations.components.generators.mistral import MistralChatGenerator

document_store = InMemoryDocumentStore()

docs = TextFileToDocument().run(sources=["essay.txt"])
split_docs = DocumentSplitter(split_by="passage", split_length=2).run(documents=docs["documents"])
embeddings = MistralDocumentEmbedder(api_key=Secret.from_token(api_key)).run(documents=split_docs["documents"])
DocumentWriter(document_store=document_store).run(documents=embeddings["documents"])


text_embedder = MistralTextEmbedder(api_key=Secret.from_token(api_key))
retriever = InMemoryEmbeddingRetriever(document_store=document_store)
prompt_builder = DynamicChatPromptBuilder(runtime_variables=["documents"])
llm = MistralChatGenerator(api_key=Secret.from_token(api_key), 
                           model='mistral-small')

chat_template = """Answer the following question based on the contents of the documents.\n
                Question: {{query}}\n
                Documents: 
                {% for document in documents %}
                    {{document.content}}
                {% endfor%}
                """
messages = [ChatMessage.from_user(chat_template)]

rag_pipeline = Pipeline()
rag_pipeline.add_component("text_embedder", text_embedder)
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", llm)


rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.prompt", "llm.messages")

question = "What were the two main things the author worked on before college?"

result = rag_pipeline.run(
    {
        "text_embedder": {"text": question},
        "prompt_builder": {"template_variables": {"query": question}, "prompt_source": messages},
        "llm": {"generation_kwargs": {"max_tokens": 225}},
    }
)

print(result["llm"]["replies"][0].content)

Calculating embeddings:   0%|          | 0/3 [00:00<?, ?it/s]

Calculating embeddings: 100%|██████████| 3/3 [00:01<00:00,  1.75it/s]


The two main things the author worked on before college were writing and programming. He wrote short stories, which he admitted were awful, and essays about various topics. He also worked on spam filters and painted. Additionally, he started having dinners for a group of friends every Thursday night, which taught him how to cook for groups. He also bought a building in Cambridge to use as an office. The author was drawn to writing essays, which he started publishing online, and this helped him figure out what to work on. He also experimented with painting and studied AI in college.
