## **Experimental RAG Template**

> *Modular Retrieval and Generation Pipeline with LangChain + OpenAI*

**`Date:`** **July 2025**

**`Goal:`** This notebook serves as a **baseline template for building Retrieval-Augmented Generation (RAG) pipelines** with LangChain. 

**`Technical Team:`**


|<img src="https://i.imgur.com/HvkmiXf.png" width="100"> | <img src="https://i.imgur.com/27pmxnN.jpeg" width="100" > | <img src="https://i.imgur.com/enG4rPa.jpeg" width="100"> | <img src="https://i.imgur.com/FSL0u7D.jpeg" width="100" > |
|:--:|:--:|:--:|:--:|
| Maria Júlia Vidal | Éric Kauati | Milena Toledo | Fabio Contrera |



## **RAG (Retrieval-Augmented Generation)** 
##### **RAG** stands for *Retrieval-Augmented Generation*, a technique that enhances the response capabilities of language models (LLMs) by combining their internal knowledge with information retrieved from external sources.



The operation of RAG can be understood as a collaboration between two components:

- **Retrieval (Retriever):** searches for relevant information in external documents (such as PDFs, texts, or vector databases).
- **Generation (Generator):** the language model uses this information, together with its own knowledge, to produce an accurate and contextualized answer.


### How the process works in practice:

1. The user asks an initial question.  
2. The system identifies what information is relevant from external sources.  
3. The relevant information is passed along as context.  
4. That context is combined with the user’s original input and sent to the LLM.  
5. The language model analyzes everything and produces an integrated answer.

<img src="https://i.imgur.com/lqvlQfk.png" width="1000">




### Fundamental concepts:

- **LangChain:** a framework specialized in coordinating the use of LLMs, simplifying the development of complex applications with language models, such as RAG.
- **Chunking:** splitting long texts into smaller pieces (*chunks*), which makes the search more efficient and ensures the texts fit within the model’s context-window limit.
- **Embeddings:** vector representations of texts used to measure similarity between a question and documents. Embeddings enable semantic search.
- **Vector Store:** a vector database where document embeddings are stored, allowing quick and efficient retrieval of the most relevant chunks during a query.
- **LCEL:** stands for LangChain Expression Language, a modular, declarative, and chainable way to integrate all stages of the pipeline in LangChain.


`Sources:`
- What is RAG?: https://www.geeksforgeeks.org/nlp/what-is-retrieval-augmented-generation-rag/

- RAG: https://medium.com/blog-do-zouza/rag-retrieval-augmented-generation-8238a20e381d


> Next, we'll explore each step required to build our RAG pipeline using LangChain. Before that, **read the README.md** to perform the environment *setup* and understand the basic configuration of this *template*.  



### Installations:

First of all, it’s necessary to install the libraries that will be used. As explained in the README.md, we’ll do this through the requirements.txt file.


In [None]:

%pip install -r requirements.txt
%pip install ipykernel

---

## **`1`** - **Imports**

After installing the libraries, we will import the necessary modules.


In [None]:

import os  # Python standard library for interacting with the operating system (accessing environment variables, working with API keys, or file paths).
from langchain_openai import OpenAIEmbeddings  # OpenAI embedding generator, converting text into numerical vectors.
from langchain_community.document_loaders import PyPDFLoader  # Load content from PDF files, converting it to a LangChain-readable format.
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Chunking tool that splits text into smaller overlapping pieces.
from langchain_core.documents import Document  # Defines the basic document structure in LangChain, containing text and metadata.
from langchain_core.vectorstores import InMemoryVectorStore  # In-memory vector store implementation, enabling similarity search.
from langchain_core.prompts import ChatPromptTemplate  # Used to create custom prompts, combining variables, context, and questions within a single structured template.
from langchain_core.output_parsers import StrOutputParser  # Converts the model's generated response into a clean string.
from langchain_core.runnables import RunnablePassthrough  # Helper component used in LCEL to pass data forward unchanged, connecting pipeline parts that don't yet need processing.
from langchain.chains import create_retrieval_chain  # Creates the RAG retrieval chain, from user input to the final output.
from langchain.chains.combine_documents import create_stuff_documents_chain  # Creates a generation (stuffing) chain where the retrieved documents are packed into the prompt to generate the answer.

from dotenv import load_dotenv  # Library for loading environment variables from a .env file, easing the setup of API keys and other sensitive configs.
load_dotenv()  # Calls load_dotenv to load environment variables from the .env file.

# Retrieves the OPENAI_API_KEY environment variable and stores it in API_KEY_OPENAI.
API_KEY_OPENAI = os.getenv("OPENAI_API_KEY")
# Manually sets the API key in the OS environment, ensuring it has the correct value and works.
os.environ["OPENAI_API_KEY"] = API_KEY_OPENAI


---


## **`2`** - **Data Loading (Data Injection)**

In this step, we load the content that will serve as the knowledge base for our RAG system.





In our example, we will work with a scientific article on RAGs in PDF format. The content in the PDF file will be transformed into `Document` objects, which are structures used internally by LangChain to organize the text and its associated information (such as file name or page number).

- Create **one `Document` per page** when the content makes sense separately  
- **Concatenate all pages** into a single `Document` when the content depends on global context  

> Next, we load the PDF, extract the text from each page, and merge everything into a single `Document` with metadata.


In [None]:
# Data Injection with a PDF

loader = PyPDFLoader("./my_files/RAG_LLM.pdf")  # Creates a loader to read the file. Replace with your PDF path.
pages = loader.load_and_split()  # Loads the PDF and splits the content into pages, each becoming a Document. Now, pages is a list of Document objects (one Document per page).

pdf_source = pages[0].metadata['source']  # From the first page, grab the 'source' metadata, which is the PDF path. It will serve as the source of the final concatenated document.

list_document = []  # Create an empty list to store the final concatenated document.

# Join the content of all pages into a single text, separating each page with six line breaks.
concatenated_text = "\n\n\n\n\n\n".join(
    [page.page_content for page in pages])
# Create a new Document object with the concatenated content of the entire PDF and the 'source' metadata indicating the PDF origin.
document = Document(
    metadata={'source': pdf_source}, page_content=concatenated_text)
# Add the concatenated Document to the list of documents.
list_document.append(document)


**`What if there are multiple documents?`**

To load multiple files, you need to implement a for loop that iterates through the list, applying the same process to each file:

- Create a loader  
- Load the PDF and split its content into pages  
- Grab the metadata from the document’s first page  
- Merge the content of all pages into a single text  
- Create a new `Document` object with the concatenated content of the entire PDF and add it to `list_document`  
    
- **Note:** `list_document` must be created outside the loop



In [None]:
# Data Injection with multiple PDFs

list_document = []  # Creates an empty list to store the final concatenated documents.
folder = "./my_files"  # Path to the folder containing the PDF files.
pdf_paths = [os.path.join(folder, f) for f in os.listdir(folder) if f.endswith('.pdf')]  # List of all PDF file paths in the specified folder.

for path in pdf_paths:  # For each PDF in pdf_paths:
    loader = PyPDFLoader(path)  # Create a loader to read the file.
    pages = loader.load_and_split()  # Load the PDF and split its content into pages (each becomes a Document).

    pdf_source = pages[0].metadata['source']  # From the first page, grab the 'source' metadata (the PDF path).

    # Join the content of all pages into a single text, separating each page with six line breaks.
    concatenated_text = "\n\n\n\n\n\n".join([page.page_content for page in pages])

    # Create a new Document with the concatenated content of the entire PDF and the 'source' metadata.
    document = Document(metadata={'source': pdf_source}, page_content=concatenated_text)
    # Add the concatenated Document to the list of documents.
    list_document.append(document)


##### **`Metadata: What It Is, How We Handle It, and Why We Need It`**

As seen in the **Data Injection** section, when we load documents (such as PDF files), each page or text snippet is transformed into a `Document` object. These objects carry not only the textual content—they also include a crucial field: `metadata`.

##### **What Is Metadata?**

**Metadata** is **additional information about the content**, such as:
- The source file name (`'source'`)
- The original page number (`'page'`)
- Any other contextual information you may want to add (e.g., author, category, date)

Example of a `Document` with metadata:

```
Document(
    page_content="Texto da página 1...",
    metadata={
        'source': 'contrato.pdf',
        'page': 0
    }
)
```

##### **How do we handle it in the project?**
During document loading and pre-processing:

- Each page read with PyPDFLoader already comes with metadata automatically  
- When we concatenate pages into a single `Document`, we preserve the `source`  
- During chunking, the smaller pieces automatically inherit the metadata from the original document  

**This is important because it allows us to trace the origin of the information used in an answer, contributes to building more reliable and transparent responses, enables the display of the source (e.g., “Source: guide.pdf”), and makes it possible to filter by document type in more advanced queries.**

`Sources:`

- Large-scale data ingestion: https://www.crossml.com/build-a-rag-data-ingestion-pipeline/










---


## **`3`** - **Chunking (Splitting the Documents into Smaller Pieces)**

After loading the documents, the next step is chunking, which involves dividing long texts—such as the `Document`s we just generated—into smaller pieces called chunks.






##### **Why this step is essential:**
- LLMs have a token limit per input, so an entire long text would not be supported by the model.  
- Less is more: working with smaller pieces improves context-retrieval performance.  
- It preserves coherence and relevance in the answer, since each piece represents a unit of information.  

#### **`How do we perform chunking?`**

We use a LangChain tool called “RecursiveCharacterTextSplitter” that tries to break texts based on *hierarchical separators*, sets a maximum size for each chunk, and keeps an overlap between pieces to preserve context across them.  

#### **`But what are hierarchical separators, and how do we choose them?`**

They are *break symbols* used to split text naturally and logically, without breaking words.  
In the code below, we use the default list of separators:

(`"\n\n"`, `"\n"`, `" "`, `""`)  

##### **The process works like this:**

1. First, it tries to split by `\n\n` (two line breaks—paragraph).  
2. If that fails, it tries `\n` (single line break).  
3. Then by `" "` (space/end of sentence).  
4. If it can’t split logically anymore, the tool performs a forced break (`""`) according to the maximum size (`chunk_size`).



##### **How to choose good separators?**

The type of separators we choose varies according to the **type of document we are loading**.  
It is important to consider the **natural structure of the text**, the **logical breakpoints between ideas**, and the **data format**.  
Some guidelines when selecting your separators:

- If the text is visibly divided into paragraphs, use `\n\n` as the main separator.  
- If each line is a unit (such as lists, code, or logs), use `\n` as the main separator.  
- The default list sequence (`"\n\n"`, `"\n"`, `" "`, `""`) generally works best for articles and PDFs.  
- A good separator should split the text into blocks that make sense on their own, without cutting important sentences in half.  
- Use an ordered list of separators from strongest (paragraphs) to weakest (words).  
- If in doubt, start with the default sequence and refine it based on your results.  


#### **`Maximum Size (chunk_size) with Overlap: What It Is and How to Set It`**

This is the maximum size (in characters) each chunk can have, ensuring the text is short enough to be processed but long enough to retain context.

##### **How to define an ideal chunk_size?**

The ideal `chunk_size` depends on three main factors:

- The context window limit of the language model you are using  
- The structure and density of your documents’ content  
- The application’s goal (answering questions, summarizing, classifying, etc.)  

Ideally, analyze your document and set an initial parameter, then test different sizes on your real content. Check how many chunks are generated, the quality of retrieval, and whether there are abrupt cuts in sentences or paragraphs; then adjust until you find the optimal size for your document or document type.



#### **`What about overlap (chunk_overlap)?`**

`chunk_overlap` defines how many characters from the end of one chunk are repeated at the beginning of the next. This repetition is crucial for preserving the continuity of ideas—especially when information starts at the end of a chunk and continues into the next. Without this overlap, the model may interpret the pieces as disconnected blocks, reducing retrieval accuracy and answer quality.

For example, if `chunk_size = 300` and `chunk_overlap = 50`, the chunks will be created roughly as follows:

- Chunk 1 → characters 0 to 300  
- Chunk 2 → characters 250 to 550  
- Chunk 3 → characters 500 to 800  

Notice that the last 50 characters of the previous chunk reappear at the beginning of the next one, ensuring the model maintains the thread of the narrative or explanation.

##### **How to define an ideal chunk_overlap?**

The ideal `chunk_overlap` depends on how much each sentence or passage relies on the previous one to make sense: the more dependent the text, the larger the overlap should be.  
Similar to choosing `chunk_size`, start with an initial value, then test different sizes on your actual content—checking retrieval quality and whether there are abrupt cuts in sentences or paragraphs—until you find the optimal overlap for your document type.


In [None]:
chunk_size = 200  # Sets the maximum size of each chunk to 200 characters.
chunk_overlap = 100  # Sets the overlap between chunks to 100 characters.
separators = ["\n\n", "\n", " ", ""]  # Defines the hierarchical separators.

# Creates a 'text_splitter' object that will hold the rules for cutting the documents into smaller pieces (chunks),
# using LangChain's 'RecursiveCharacterTextSplitter'.
# This splitter is recursive: it tries to use the separators in order without exceeding the chunk_size limit.
text_splitter = RecursiveCharacterTextSplitter(  # The split respects the defined chunk_size, chunk_overlap, and hierarchical separators.
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=separators
)

# The splitting rules defined in 'text_splitter' are executed by 'split_documents' on list_document,
# and the resulting pieces (chunks) are stored in the 'chunks' variable.
chunks = text_splitter.split_documents(list_document)


`Sources:` 

- Definitive Guide to Chunking: https://www.robertodiasduarte.com.br/guia-definitivo-de-chunking-para-rag-e-llms-estrategias-essenciais/
- Five Levels of Chunking Strategies: https://medium.com/@anuragmishra_27746/five-levels-of-chunking-strategies-in-rag-notes-from-gregs-video-7b735895694d
- Guide for Chunking Phase: https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-chunking-phase


---

## **`4`** - **Vector Store – Creating and Storing Embeddings**

Now that we have split the documents into chunks, the next step is to transform these pieces into embeddings (numeric vectors) and store them in a structure called a “Vector Store.”




#### **`What is a vector store?`**

A vector store is a database that holds embeddings (numeric vectors) together with their original texts. It lets us transform a new question into a vector and **search for the most similar chunks** based on vector similarity.

#### **`But first: What are embeddings, and how do we create them?`**

Embeddings are numeric vectors that represent text (words, sentences, or paragraphs) so that **closeness between vectors reflects semantic closeness**. They are mathematical values derived from human language after being processed by algorithms.

**`Example:`** the embeddings for “car” and “automobile” will be close in the vector space, even though the words differ.

To generate embeddings, we use an OpenAI embedding model called **`text-embedding-3-small`**, which converts each chunk into a high-dimensional dense vector.


##### **`How was the OpenAI embedding model created?`**

There are several embedding models—like the OpenAI model we are using—that are machine-learning models built through neural networks trained on large volumes of text.

> Let’s understand how this OpenAI embedding model was created

The model is fed vast amounts of text and performs the task of automatically predicting the missing word in a sentence.  
Example: “The sky is ___” → the model tries to predict “blue.”

The training process works as follows:

1. **Text processing through neural layers:** The text is converted into numbers (tokens) and passes through multiple layers of the neural network, where each layer extracts increasingly complex patterns and relationships (such as grammatical structure, meaning, tone, and context).

2. **Generation of an internal vector (embedding):** Before the final layer—where the missing word is chosen—the model produces a vector representation of that text, called an embedding, which captures the text’s meaning.

3. **Error calculation:** In the final layer (prediction layer), the model compares the predicted word with the actual word and computes the error (loss function).

4. **Backpropagation and weight adjustment:** With the final error computed, backpropagation kicks in. Layer by layer, the model determines which connections (weights) helped or hindered the correct answer by calculating the gradient, which indicates whether to increase or decrease each weight, and by how much. The weights are then adjusted accordingly. This happens thousands of times until the model learns.

5. **Clustering of similar meanings in vector space:** After training, the model can generate similar embeddings for texts with similar meanings. For example, sentences like “the car is in the garage” and “the automobile is parked” will have nearby embeddings.

After training, the model is used directly—just as we are doing here: you send in a text and it returns an embedding, a vector that represents its meaning in a high-dimensional mathematical space.

 
**`ATTENTION`**: The explanation above refers solely to the training of the OpenAI embedding model used in this template. There are other ways to create a model.  


#### **`Why do we use Embeddings?`**

In a RAG pipeline, we want the language model to generate answers based on **relevant and specific information** from our documents.  
To achieve this, we need a way to **retrieve the passages most similar** to the user’s question—and that’s where **embeddings come in**.

##### **Problems with traditional searches**

A keyword-based search system (like Ctrl-F) **only finds passages containing the exact same terms** and does **not handle synonyms, variations, or indirect questions**, which are common in user queries. Therefore, it fails because it does not understand meaning.

Embeddings solve this by **converting texts into numeric vectors that capture their semantics**.  
With embeddings, we can:

- Convert the **user’s question into a vector**  
- Compare it with the **vectors of stored chunks**  
- Return the **most semantically similar ones**  

##### **Where are embeddings used? In the Retriever.**

The **retriever** is the component of the RAG pipeline responsible for:

1. Receiving the user’s question  
2. Generating its embedding (vector)  
3. Comparing this vector with the embeddings of the chunks (in the Vector Store)  
4. Bringing back the chunks closest in vector space  

In other words, the retriever **uses embeddings to find the most relevant content**, even when the question and the document use different words.


The retriever needs to **understand the context of the question**, not just match exact words.  
Embeddings make this possible, enabling searches by **meaning**, not just terms.  
This improves the application's accuracy, flexibility, and responsiveness.


##### **How do we compare embeddings?**

To determine **which embeddings are closest**, we use a metric called **cosine similarity**.

This metric measures the **angle between two vectors**, and the similarity value ranges from:

- `1` → identical vectors (maximum similarity)  
- `0` → completely different vectors  
- `-1` → opposite vectors (rare in text embeddings)  

This technique allows us to mathematically and precisely retrieve **the chunks that are semantically closest to the question**.

**`ATTENTION`**: There are other ways to compare embeddings, but in this *template* we use the cosine similarity method.


In [None]:
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")  # Creates the variable 'embeddings_model', which accesses the OpenAI embedding model via the OpenAIEmbeddings class, connecting you to the OpenAI API.
vectorstore = InMemoryVectorStore(embeddings_model)  # Creates the variable 'vectorstore' as an InMemoryVectorStore object, which uses the embedding model above to generate vectors from the previously created chunks and stores them in memory.
_ = vectorstore.add_documents(documents=chunks)  # Adds the text chunks to the Vector Store, enabling similarity searches among the text chunks.


#### **`Test the similarity search`**

The next cell acts as a manual test of the vector-store stage and similarity search; it is not part of the production RAG pipeline.

It provides a way to validate—before creating the retriever and the chain—that the vector base is ready and returning coherent results.

**`ATTENTION`**: It is recommended that test questions be asked in English to minimize context loss in translation.


In [None]:
# Performs a similarity search. The question is converted into a vector (embedding), 
# and this vector is compared with all vectors stored in the Vector Store (the chunks). 
# It computes cosine similarity and returns the 5 chunks most semantically similar to the question.
vectorstore.similarity_search(
    "How does the RAG method combine document retrieval with natural language generation?",
    k=5,
)

# Save the vector store locally so you don't have to keep running the embedding model


[Document(id='b8a9b11c-0561-4ef2-8bd6-659a90d1312e', metadata={'source': './meus_arquivos\\RAG_LLM.pdf'}, page_content='overcome challenges, Retrieval-Augmented Generation (RAG)\nenhances LLMs by retrieving relevant document chunks from\nexternal knowledge base through semantic similarity calcu-'),
 Document(id='e64ce71f-3a0e-4f9e-91ef-63ce7f165a29', metadata={'source': './meus_arquivos\\RAG_LLM.pdf'}, page_content='of developing specialized strategies to integrate retrieval with\nlanguage generation models, highlighting the need for further\nresearch and exploration into the robustness of RAG.'),
 Document(id='206aee48-cd0f-4ec8-8b74-b3ac1873815f', metadata={'source': './meus_arquivos\\RAG_LLM.pdf'}, page_content='3\nFig. 2. A representative instance of the RAG process applied to question answering. It mainly consists of 3 steps. 1) Indexing. Documents are split into chunks,'),
 Document(id='46f523eb-37e1-4391-9f38-61fd2f9a022a', metadata={'source': './meus_arquivos\\RAG_LLM.pdf'}, pa

`Sources:`

- Guide to generating embeddings: https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-generate-embeddings  
- How to select an embedding model: https://galileo.ai/blog/mastering-rag-how-to-select-an-embedding-model  
- Embeddings and Vectorization: https://joaomarcuraa.medium.com/rag-embeddings-e-vetoriza%C3%A7%C3%A3o-potencializando-a-ia-com-python-e704a39699dd  
- Vector Search and Embeddings: https://www.thecloudgirl.dev/blog/the-secret-sauce-of-rag-vector-search-and-embeddings  
- Embeddings: https://platform.openai.com/docs/guides/embeddings  
- OpenAI Embedding Model: https://platform.openai.com/docs/guides/embeddings#embedding-models



---

## **`5`** - **Retriever**

Now that we have all our **chunks stored in the Vector Store with their embeddings**, the next step is to set up the **component responsible for automatically retrieving relevant text passages** when the user asks a question: the **Retriever**.

**`ATTENTION`**: The test above that used the similarity_search tool was **manual**. Now this search will be automated and carried out internally by the retriever.


#### **`Recap: What Is a Retriever?`**

The **Retriever** is a layer that connects to the Vector Store to perform **semantic** (meaning-based) searches and **return the most relevant chunks** based on a text query.

> It is the bridge between the user’s question and the stored text pieces that most closely match that question in terms of content.

While the Vector Store **stores** the embeddings, the Retriever **searches through them** in an optimized way.

**`ATTENTION`**: There are several types of retrievers (Naive, Parent Document, Self-Query, Contextual Comprehension), but in this template we are implementing the Naive Retriever, the simplest one.



In [None]:
# Creates a variable 'retriever' that accesses the vector store and applies the method 'as_retriever'
# to transform the Vector Store into a retriever object.
# The parameter search_kwargs={"k": 5} specifies that, for each query, it should return the 5 most similar chunks.
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})


`Sources:`

- Retriever Techniques: https://towardsdatascience.com/advanced-retriever-techniques-to-improve-your-rags-1fac2b86dd61/


---

## **`6`** - **Prompt Generation – Assembling the Question with Context**

Now that we have our **retriever configured**, ready to automatically fetch the **most relevant chunks** from the vector store based on the user’s question, the next step is to **prepare the input that will be sent to the language model** (LLM).

This step is essential: the **model needs to receive both the context and the question**, organized clearly, to generate a **precise and faithful response** to the content.


#### **`But after all, what is a prompt?`**

A prompt is the input text you send to a language model (LLM), such as OpenAI’s.

It serves to instruct the model what it should do and what information it should use to generate a response.

The language model doesn’t “know” what to do on its own—it only reacts to what’s in the prompt.  
If the prompt is poorly formulated, it may:

- Ignore the context  
- Make up answers  
- Respond generically  

A good prompt ensures the model uses the context correctly and answers accurately.

#### **`And what are we going to do with it now?`**

We will create a **structured prompt**, i.e., a *template* that includes:

- System message → defines the model’s behavior (this is not shown to the end user)  
- User message → contains the question  
- Retrieved context → injected into the template  

First, we import LangChain’s `ChatOpenAI`, which lets us use OpenAI’s chat models inside LangChain by connecting to the OpenAI API, sending the prompt, and receiving the response.  




##### **Importing ChatOpenAI**


In [None]:
# Imports the tool that allows using OpenAI chat models within LangChain (connects to the API, sends a prompt, and receives the response)
from langchain_openai import ChatOpenAI


##### **Creating the Prompt Model**

As we saw above, the prompt model consists of the system message, the user message, and the retrieved context (from the retriever).

**`ATTENTION`**: The system message is usually adapted to the document and the application’s goal and should **not** be shown to the end user.

**Steps:**

1. First, we’ll create the `system_prompt` to hold the system message and the retrieved context.  
2. Then, we’ll combine this `system_prompt` with the user input and store it in `prompt`. We’ll do this using `ChatPromptTemplate`, which lets us build prompts in message format with variable parts.  
3. Following the message format, we set `system` to `system_prompt` (defining the AI’s role and behavior) and `human` to the user’s question.  


In [None]:
# Creating a long string called "system_prompt" to be used as the system message—one part of the prompt that defines the model's role and behavior during the response.
system_prompt = (
    "You are an assistant for question-answering tasks for compliance documents. "  # Defines the AI’s role
    "Use the following pieces of retrieved context to answer "  # Instructs where the AI should source information, in addition to its internal knowledge
    "the question. If you don't know the answer, say that you "  # Safety instruction to prevent the AI from inventing answers
    "don't know. Keep the answer as close to the retrieved context as possible."  # Additional safety instruction against fabricated responses
    "\n\n"
    "Context: {context}"  # Placeholder for inserting the retrieved context—the most relevant chunks for the user’s question, obtained from the retriever.
)

# Combining system_prompt (containing the system message and context) with the user’s question using ChatPromptTemplate,
# a tool to build structured prompts for models.
prompt = ChatPromptTemplate.from_messages(  # 'from_messages' is a method that creates a prompt template from a list of messages.
    [
        ("system", system_prompt),  # First message is of type "system", defining the AI’s role and the context it should use.
        ("human", "Question: {input}"),  # Second message is of type "human", representing the user’s question.
                                         # "{input}" will be replaced with the actual question.
    ]
)


##### **Test the Final Prompt Construction**

In this step, we will test if the prompt was built correctly before sending it to the language model (LLM). This includes checking whether:
- The retrieved context is in the right place.  
- The user’s question is in the right place.  
- The message format follows the expected structure.  



In [None]:
# We call the prompt using the 'invoke' method, passing a dictionary with the keys "context" and "input".
# The "context" key receives the context retrieved by the retriever, and the "input" key receives the user's question.
# Both are saved in 'message_list'.
message_list = prompt.invoke({
    "context": "overcome challenges, Retrieval-Augmented Generation (RAG)\n enhances LLMs by retrieving relevant document chunks from\nexternal knowledge base through semantic similarity calculation.",
    "input": "How does the RAG method combine document retrieval with natural language generation?"
})

# Iterate over the list of messages returned by the prompt and call the 'pretty_print' method
# to display each message in a readable format.
for msg in message_list.messages:
    msg.pretty_print()



You are an assistant for question-answering tasks for compliance documents. Use the following pieces of retrieved context to answer the question. If you don't know the answer, say that you don't know. Keep the answer as close as retrieved context as possible.

Context: overcome challenges, Retrieval-Augmented Generation (RAG)
 enhances LLMs by retrieving relevant document chunks from
external knowledge base through semantic similarity calculation.

Question: How does the RAG method combine document retrieval with natural language generation?


`Sources:`

- Prompt Engineering in RAG: https://ventiladigital.com.br/blog/estrategias-eficazes-para-engenharia-de-prompts-em-rag/  
- Prompt Guide – Techniques for RAG: https://www.promptingguide.ai/pt/techniques/rag




---

## **`7`** - **RAG Chain**

We have reached the final stage of the pipeline: the RAG Chain. This is the component that integrates all parts of the pipeline:
- **Retriever:** fetches the most relevant chunks for the context  
- **Prompt:** organizes the context, the system message, and the user’s question into a single input  
- **LLM:** generates the final answer  

This is where the system takes the user’s question, retrieves the context, assembles the prompt, and generates the answer in a fully automated manner.




The process will work as follows:

1. First we will **choose the model and set some parameters**.  
2. Then we will **connect the pipeline components and get the final answer from the LLM**. We can do this in two different ways, depending on the level of customization required for your application:  
   - **LCEL**: a manual and flexible way to build the pipeline. You connect components using the `|` operator, like chained functional blocks. Ideal for those who need **more control**, want to modify parts of the logic, integrate validations, or add intermediate steps.  
   - **Create Stuff**: using utility functions like `create_stuff_documents_chain` and `create_retrieval_chain(retriever, doc_chain)`, which **assemble and connect the main blocks automatically**, reducing code and speeding up development, without much customization and control.  

**`When should I use LCEL, and when should I use Create Stuff?`**

| Use case                                        | Recommendation   |
|-------------------------------------------------|------------------|
| I need control over every step of the pipeline  | **LCEL**         |
| I want to change the parser, prompt, or data flow | **LCEL**       |
| I am building a functional prototype            | **Create Stuff** |
| The project is simple and needs minimal code    | **Create Stuff** |
| I want custom logic (e.g., filters, logs)       | **LCEL**         |
| I want to scale with less manual effort         | **Create Stuff** |

--


> Now let’s put it into practice


##### **1- Choose the Model and Set Parameters**


In [None]:
# Create the variable 'llm' that accesses the OpenAI chat model using LangChain’s `ChatOpenAI`, which connects to the OpenAI API.
llm = ChatOpenAI(model_name="gpt-4.1-mini", temperature=0)  # Set the OpenAI model to use and the temperature, which controls how creative the model should be.


**`What is a model’s temperature?`**

Temperature is a parameter that controls the level of randomness or creativity in responses generated by models like GPT.  
It ranges from 0 to 1, where:  
- `0` → more direct, precise, and consistent answers  
- `1` → more creative, open-ended, and varied answers  

The default temperature when using ChatOpenAI is 0.7, which means the model will produce responses that are neither completely random nor entirely fixed.

**`ATTENTION`**: Temperature should be adjusted based on the type of document loaded and the application’s goal. For RAG applications, a temperature of 0 is recommended.  


##### **2.1- LCEL: LangChain Expression Language**

LCEL is a declarative, functional language built into LangChain, which means that:  
- You declare what should happen, e.g., input → prompt → model → output  
- It follows functional programming principles:  
  - Each component is a function  
  - Functions are chained with `invoke()`  
  - Functions do not mutate each other’s internal state but compose to form larger functions from smaller ones  

This lets you connect components in a chained manner, reducing boilerplate and making the pipeline reusable.  

##### **`But in practice, how does it work?`**

In practice, LCEL lets you assemble your RAG pipeline as a **continuous execution flow**, linking blocks with the `|` operator.  
Each block is a component that **transforms or uses the data** before passing it on:

1. Define the prompt template, invoking the retriever and the user input  
2. Replace the placeholders in the prompt with real values to generate the full prompt  
3. Pass the full prompt to the LLM, which produces a raw response  
4. Convert the LLM’s raw response into a clean string ready for display  


In [None]:
# Create a variable 'parser' by initializing the 'StrOutputParser' class,
# which converts the model's generated response into a clean string,
# removing unnecessary formatting and making the text ready for use.
parser = StrOutputParser()

# Create a complete RAG chain that connects all components (retriever, prompt, llm, and parser)
# to form a functional flow using LCEL.
rag_chain = (
    {
        # Define the input format expected by the chain.
        "context": retriever,                # Calls the retriever to fetch relevant context.
        "input": RunnablePassthrough()       # Uses RunnablePassthrough to pass the user input unchanged.
    }
    | prompt     # Passes the {input, context} dictionary to ChatPromptTemplate,
                # filling the {input} and {context} placeholders with real values
                # and generating the full prompt for the LLM.
    | llm        # Invokes the LLM with the full prompt, producing a raw response in an unstructured format.
    | parser     # Uses the parser to convert the LLM's raw response into a clean, structured string,
                # ready for use or display.
)


**`Functional Test of the Complete Pipeline`**

Now that all components are connected, it’s time to simulate the application’s real behavior with a sample question. To do this, the chain is executed using the `invoke` method by passing the question “What are the key issues related to ethics in Brazil?” as a string. The variable `response` holds the result of the chain, which is the LLM’s answer. The question is related to the loaded document and should be adapted according to the document.


In [None]:

# Call the RAG chain with the user's question
response = rag_chain.invoke("How does the RAG method combine document retrieval with natural language generation?")
print(response)  # Display the final answer generated by the RAG chain, which is a clean, structured string ready for use or display.


The Retrieval-Augmented Generation (RAG) method combines document retrieval with natural language generation by first retrieving relevant document chunks from an external knowledge base through semantic similarity calculations. This retrieval process is integrated with language generation models to enhance responses. Specifically, the RAG process involves indexing documents by splitting them into chunks, retrieving the most relevant chunks based on the input query, and then using these retrieved documents to augment the generation of natural language responses. This synergy between retrieval and generation improves the overall efficiency and quality of the system's outputs.


##### **2.2 - Create Stuff: Utility Functions**

Although we have already produced the final LLM answer in a structured format in the previous step, the LCEL method is a manual and controlled way to assemble the pipeline flow, where you define each step and can modify the flow as you like—ideal for customization.

However, there is another way to connect the components and generate this flow in a more simplified and automatic way, without much customization or control, using utility functions like `create_stuff_documents_chain` and `create_retrieval_chain`. Let’s dive into this method now:

**`1. First we use LangChain’s `create_stuff_documents_chain` function, which creates a “document chain,” i.e., a chain responsible for generating the final LLM answer based on context documents. Internally, the function:**

- Expects to receive documents (usually coming from the retriever)  
- Inserts these documents into the `{context}` field of the prompt  
- Inserts the user’s question into the `{input}` field  
- Sends the complete prompt to the language model (LLM)  
- Returns only the final answer generated by the LLM  

Think of it as the part that builds the final prompt and generates the answer.

**`2. Then we use the `create_retrieval_chain` function, which connects the retriever to the answer-generation chain (created above), forming the complete RAG flow. Internally, the function:**

- Uses the retriever to fetch relevant documents based on the question  
- Sends these documents to the `question_answer_chain`  
- Uses the LLM to generate the final answer  
- Returns a dictionary containing:  
  - **`answer`** → the answer generated by the LLM  
  - **`context`** → the documents used as support  

Think of it as the complete RAG structure, which combines:  
- context retrieval (retriever)  
- with answer generation (LLM)  







In [None]:
# Creates a "stuff" document chain (simple aggregation of documents) using the language model (LLM) and the defined prompt.
# This chain is responsible for generating the final RAG answer based on the retrieved context.
question_answer_chain = create_stuff_documents_chain(llm, prompt)

# Creates the complete RAG chain, connecting the retriever with the answer-generation chain (question_answer_chain) defined above.
rag_chain = create_retrieval_chain(retriever, question_answer_chain)


**`Functional Test of the Complete Pipeline`**

Now that all components are connected, it’s time to simulate the application’s real behavior with an example question.  
For this, the chain is executed via the `invoke` method by passing the question as a string. The variable `res` holds the result of the chain, which is the LLM’s answer.  
The question is related to the loaded document and should be adapted according to the document.


In [None]:
res = rag_chain.invoke(
    {"input": "How does the RAG method combine document retrieval with natural language generation?"}
)  # Call the RAG chain with the user's question
res["answer"]  # Display the final answer generated by the RAG chain, which is a clean, structured string ready for use or display.


'The Retrieval-Augmented Generation (RAG) method combines document retrieval with natural language generation by first retrieving relevant document chunks from an external knowledge base through semantic similarity calculations. This retrieval step provides pertinent information that is then integrated into the language generation process. By combining "Retrieval," "Generation," and "Augmentation," RAG enhances language models to produce more informed and accurate responses. The process typically involves indexing documents by splitting them into chunks, retrieving relevant chunks based on the query, and then generating responses that incorporate the retrieved information, thereby improving the overall efficiency and robustness of the system.'

**`Visualizing the Context`**

After executing the RAG chain and obtaining the answer, you can also inspect the documents that were retrieved as context and used as the basis for the final response.


In [None]:
# Iterate over each document in the context list returned by the RAG
for context in res["context"]:
    print(context.page_content)  # Print the text content (raw text) of each retrieved document


overcome challenges, Retrieval-Augmented Generation (RAG)
enhances LLMs by retrieving relevant document chunks from
external knowledge base through semantic similarity calcu-
of developing specialized strategies to integrate retrieval with
language generation models, highlighting the need for further
research and exploration into the robustness of RAG.
3
Fig. 2. A representative instance of the RAG process applied to question answering. It mainly consists of 3 steps. 1) Indexing. Documents are split into chunks,
to the RAG process, specifically focusing on the aspects
of “Retrieval”, “Generation” and “Augmentation”, and
delve into their synergies, elucidating how these com-
responses, thus improving the overall efficiency of the RAG
system. To capture the logical relationship between document
content and structure, KGP [91] proposed a method of building


`Sources:`

- Building the RAG Chain with LCEL: https://towardsdatascience.com/building-a-rag-chain-using-langchain-expression-language-lcel-3688260cad05/  
- Create Stuff Documentation: https://python.langchain.com/api_reference/langchain/chains/langchain.chains.combine_documents.stuff.create_stuff_documents_chain.html


---

## **`Conclusion`**

With this complete RAG (Retrieval-Augmented Generation) pipeline using LangChain and OpenAI, we achieved:







- Load and prepare PDF documents as a knowledge base.  
- Split the content into chunks optimized for retrieval.  
- Generate vector embeddings and store them in a vector store.  
- Configure a retriever for semantic search.  
- Assemble structured, customized prompts.  
- Create automated execution flows with LCEL and utility functions.  
- Simulate and test the application’s real behavior in a modular, reusable way.  

This pipeline provides a robust foundation for any document-based QA application—such as legal assistants, contract analyzers, intelligent technical support, and more.

**`References and Recommended Materials`**

**Official Documentation**

- LangChain Documentation – comprehensive documentation with tutorials and practical examples. (https://python.langchain.com/docs/introduction/)  
- LangChain Prompt Guide – official guide to understanding and building effective prompts. (https://js.langchain.com/docs/how_to/graph_prompting/)  
- OpenAI API Reference – API reference for generating embeddings and responses. (https://platform.openai.com/docs/overview)  

**Articles and Guides**

- LangChain: Getting Started (Towards Data Science) – a practical, well-explained introduction. (https://www.datacamp.com/tutorial/introduction-to-langchain-for-data-engineering-and-data-applications)  
- Advanced Retriever Techniques – explore retriever types beyond the Naive Retriever. (https://towardsdatascience.com/advanced-retriever-techniques-to-improve-your-rags-1fac2b86dd61/)  
- Vector Databases 101 (Pinecone) – excellent explanation of the role of vector stores in AI systems. (https://www.pinecone.io/learn/vector-database/)

**Extras for Deep Dive**

- LangChain YouTube Channel – short, focused videos to understand tool usage. (https://www.youtube.com/@LangChain)  
- LangChain RAG Complete Playlist – a comprehensive playlist on RAG with LangChain. (https://www.youtube.com/playlist?list=PLfaIDFEXuae2LXbO1_PKyVJiQ23ZztA0x)  
