# Learn Retrieval-Augmented Generation (RAG) by Building a PDF Chatbot – A Beginner’s Guide with Mistral AI and LangChain

### What is Retrieval-Augmented Generation (RAG)?
Retrieval-Augmented Generation (RAG) is a technique used in AI that combines two things: a retrieval system and a language model. Instead of relying only on what the AI was trained on, RAG lets the AI look up relevant information from an external source (like a PDF) before generating a response. First, it retrieves the most relevant text snippets based on the input question, then it uses those snippets to help generate a more accurate and informed answer. This makes RAG especially useful for answering questions about recent or specific topics. It's like giving the AI a quick Google search to help it respond better.

## Retrieval-Augmented Generation (RAG) Architecture

![RAG Architecture](images/rag_architecture.png)

**Step 1: Data Collection**  
Gather your source documents (PDFs, Word files, websites, etc.) that contain the information you want the AI to use.

**Step 2: Chunking (Split the Data)**  
Break down large texts into smaller, manageable pieces called "chunks" (e.g., a few paragraphs per chunk).

**Step 3: Embedding (Vectorization)**  
Convert each chunk into a vector (a numerical representation of the text’s meaning) using an embedding model.

**Step 4: Store in Vector Database**  
Save the vectors in a vector database like FAISS, Pinecone, or Weaviate to allow fast and meaningful searches.

**Step 5: Retrieval**  
When a user asks a question, convert the question into a vector and search the database to retrieve the most relevant chunks.

**Step 6: Generation**  
Send both the original question and the retrieved chunks to a language model (like GPT or Mistral) to generate a response that is informed and accurate.

### Let Building a PDF Chatbot to understand Retrieval-Augmented Generation (RAG) with Mistral AI and LangChain.

In [None]:
%pip install langchain            
%pip install langchain_mistralai
%pip install python-dotenv        
%pip install langchain_community 
%pip install pypdf              


Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: '#': Expected package name at the start of dependency specifier
    #
    ^


Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: '#': Expected package name at the start of dependency specifier
    #
    ^


Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: '#': Expected package name at the start of dependency specifier
    #
    ^


Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: '#': Expected package name at the start of dependency specifier
    #
    ^


Note: you may need to restart the kernel to use updated packages.


ERROR: Invalid requirement: '#': Expected package name at the start of dependency specifier
    #
    ^


In [2]:
from dotenv import load_dotenv   # Import the function to load environment variables
import os                        # Import the os module to access environment variables

load_dotenv()                    # Load the variables from the .env file
api_key = os.getenv("MISTRALAI_API_KEY")  # Get your Mistral API key from the https://admin.mistral.ai/organization/api-keys


In [3]:
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_mistralai import ChatMistralAI
from langchain_mistralai import MistralAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters.character import CharacterTextSplitter

This code loads below a PDF file and splits it into pages. PyPDFLoader reads the file, and .load() returns each page as a chunk. print(len(...)) shows how many pages were loaded.

In [4]:
loader_pdf = PyPDFLoader("Introduction_to_Data_and_Data_Science.pdf")
pages_pdf = loader_pdf.load()
print(len(pages_pdf))

6


This code splits the loaded PDF pages into smaller chunks. It uses a period (".") to split text, creates chunks of 500 characters, and overlaps 50 characters to keep contex

In [5]:
char_splitter = CharacterTextSplitter(separator = ".", 
                                      chunk_size = 500, 
                                      chunk_overlap = 50)
pages_char_split = char_splitter.split_documents(pages_pdf)
print(len(pages_char_split))

22


This code creates embeddings (number-based text representations) using Mistral’s embedding model and your API key. Then, it converts the split text chunks into vectors and stores them in an in-memory vector database, allowing fast searching of similar text during retrieval.

In [6]:
embeddings=MistralAIEmbeddings( model="mistral-embed",api_key=api_key)
vector_store = InMemoryVectorStore.from_documents(documents= pages_char_split, 
                                                  embedding=embeddings)

  from .autonotebook import tqdm as notebook_tqdm


retriever = vector_store.as_retriever() creates a search tool from the vector store that finds the most relevant text chunks based on user input, helping the AI provide accurate, informed answers.

In [7]:
retriever = vector_store.as_retriever()

his prompt instructs the AI to use only the retrieved context (from the vector store) when answering the question, making sure the answer is based on the specific info found—not just guessing. It also asks the AI to mention which lecture the info came from.

In [8]:
TEMPLATE = '''
Answer the following question:
{question}

To answer the question, use only the following context:
{context}

At the end of the response, specify the name of the lecture this context is taken from in the format:
Resources: *Lecture Title*
where *Lecture Title* should be substituted with the title of all resource lectures.
'''

prompt_template = PromptTemplate.from_template(TEMPLATE)

This code initializes the Mistral chat model using your API key and selects the "mistral-large-latest" version, so you can start generating AI responses.

In [9]:
chat = model = ChatMistralAI(api_key=api_key, model="mistral-large-latest")

In [10]:
question = "Give me a summary of the lecture and the resources used in it."

This code creates a chain that retrieves relevant context, formats a prompt with the user’s question, sends it to the AI model for an answer, and returns the final response as a clean text output.

In [11]:
# Create the chain that links retriever, prompt, chat model, and output parser
chain = (
    {'context': retriever, 'question': RunnablePassthrough()}  # Get context + pass question
    | prompt_template                                         # Format prompt with template
    | chat                                                   # Generate answer from chat model
    | StrOutputParser()                                      # Parse output as string
)

# Use the chain to answer a question
response = chain.invoke(question)
print(response)


The lecture provides an introduction to data and data science, focusing on the concepts of analysis and analytics. It explains that analysis involves breaking down large datasets into smaller, manageable chunks to study them individually and understand their relationships. The lecture emphasizes that analysis is retrospective, focusing on past events to explain how or why something happened. It also discusses the applicability of various programming and software tools in data science, noting that learning software tools can be easier than learning programming languages, especially with relevant business and theoretical knowledge. However, strong theoretical preparation might make software tools feel restrictive.

Resources: Introduction to Data and Data Science


In [12]:
%pip install gradio


Note: you may need to restart the kernel to use updated packages.


In [13]:
import gradio as gr
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_mistralai import ChatMistralAI, MistralAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters.character import CharacterTextSplitter

# Assume api_key is set in an upper cell and available here

vector_store = None
retriever = None
chain = None
chat = ChatMistralAI(api_key=api_key, model="mistral-large-latest")

TEMPLATE = '''
Answer the following question:
{question}

To answer the question, use only the following context:
{context}
'''

prompt_template = PromptTemplate.from_template(TEMPLATE)

def embed_pdf(file):
    global vector_store, retriever, chain
    
    loader_pdf = PyPDFLoader(file.name)
    pages_pdf = loader_pdf.load()

    char_splitter = CharacterTextSplitter(separator=".", chunk_size=500, chunk_overlap=50)
    pages_char_split = char_splitter.split_documents(pages_pdf)

    embeddings = MistralAIEmbeddings(model="mistral-embed", api_key=api_key)
    vector_store = InMemoryVectorStore.from_documents(documents=pages_char_split, embedding=embeddings)
    retriever = vector_store.as_retriever()

    chain = (
        {'context': retriever, 'question': RunnablePassthrough()}
        | prompt_template
        | chat
        | StrOutputParser()
    )
    return "Embedding complete! Now you can ask questions."

def ask_question(question):
    global chain
    if not chain:
        return "Please upload and embed a PDF first!"
    response = chain.invoke(question)
    return response

with gr.Blocks() as demo:
    gr.Markdown("# PDF Chat with Retrieval-Augmented Generation (RAG)")
    pdf_input = gr.File(label="Upload a PDF")
    embed_btn = gr.Button("Embed PDF")
    output_embed = gr.Textbox(label="Status")
    
    question_input = gr.Textbox(label="Ask a question about the PDF")
    ask_btn = gr.Button("Ask")
    output_answer = gr.Textbox(label="Answer")

    embed_btn.click(embed_pdf, inputs=pdf_input, outputs=output_embed)
    ask_btn.click(ask_question, inputs=question_input, outputs=output_answer)

demo.launch()


* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.


