# Simple financial analysis RAG
This notebook builds a lightweight Retrieval-Augmented Generation (RAG) pipeline using LangChain and an in-memory Qdrant vector store.

It’s designed to demonstrate how to:
1. Embed content from PDF documents.
2. Split it into manageable chunks for effective semantic search.
3. Store those chunks in a fast vector database (Qdrant) for retrieval.
4. Answer a user question by retrieving relevant chunks and prompting a language model.


<p align="center">
  <img src="./docs/simple_rag.png" alt="RAG pipeline" width="500"/>
</p>

## Set up

### Set up environment variables for LangSmith and OpenAI

This cell configures your environment with the necessary API keys for LangSmith and OpenAI:
- ```LANGSMITH_TRACING``` is enabled to track and visualize LangChain executions.
- ```LANGSMITH_API_KEY``` is securely requested via getpass and stored as an environment variable.
- If ```OPENAI_API_KEY``` is not already defined in the environment, it prompts the user to enter it securely.

Using getpass ensures that sensitive credentials are not exposed in plaintext in the notebook or logs.

In [None]:
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

### Initialize OpenAI LLM
We use GPT-4o-mini as the LLM — a fast, low-cost model well suited for focused, retrieval-enhanced tasks like this. It’s ideal when paired with a retriever because the LLM doesn’t need to “know everything” — it just needs to reason well with the provided context.

In [None]:
from langchain.chat_models import init_chat_model

llm = init_chat_model("gpt-4o-mini", model_provider="openai")

### Initialize OpenAI Embeddings
This cell sets up the embedding model used to convert text into numerical vector representations.
- It uses text-embedding-3-small, a lightweight model from OpenAI optimized for speed and cost.
- These embeddings are used to index and semantically retrieve relevant chunks of documents.
- The OpenAIEmbeddings class provides a simple wrapper for calling the embedding API behind the scenes.

This is a crucial component in a RAG pipeline, as it determines how well similar pieces of information are matched during retrieval.

In [None]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

### Set up Qdrant vector store (local, persistent)

This cell initializes a local Qdrant vector store to persist and manage document embeddings:
- QdrantClient(path="qdrant_data") creates (or reuses) a local database at the specified path. This enables reuse across notebook runs or from external APIs.
- A collection named "fin-docs" is created to store vectors with:
- Vector_size = 1536, which matches OpenAI’s text-embedding-3-small model dimensionality.
- Distance.COSINE as the similarity metric, ideal for measuring semantic closeness in high-dimensional spaces.

This setup enables fast and efficient semantic search over embedded documents.

In [None]:
from qdrant_client import QdrantClient
from qdrant_client.http import models

vector_size = 1536 # https://www.pinecone.io/learn/openai-embeddings-v3/

client = QdrantClient(url="http://localhost:6333")

# Create a collection with the specified vector size and distance metric
client.create_collection(
    collection_name="fin-docs",
    vectors_config=models.VectorParams(size=vector_size, distance=models.Distance.COSINE)
)

## Connect LangChain to Qdrant 
This cell links the Qdrant vector store with LangChain, allowing it to store and retrieve document embeddings using the specified collection and embedding model.

In [None]:
from langchain_qdrant import QdrantVectorStore

vector_store = QdrantVectorStore(
    client=client,
    collection_name="fin-docs",
    embedding=embeddings
)

## Prepare data
### Asynchronously load PDF documents from a folder
We define and run a function to load all PDF files found in the data/ directory using LangChain’s PyPDFLoader.
- Each PDF is processed page by page, and every page is returned as a Document object.
- The use of ``alazy_load()`` enables asynchronous loading, which improves efficiency when dealing with multiple or large files.
- The result is a list of pages (``all_pages``) that will later be split, embedded, and indexed.

This step prepares the raw document data for semantic search.

In [None]:
import os
from langchain_community.document_loaders import PyPDFLoader

async def load_all_pdfs_from_folder(folder_path: str):
    pages = []
    for filename in os.listdir(folder_path):
        if filename.endswith(".pdf"):
            file_path = os.path.join(folder_path, filename)
            loader = PyPDFLoader(file_path)
            async for page in loader.alazy_load():
                pages.append(page)
    return pages

all_pages = await load_all_pdfs_from_folder("data")

print(f"Loaded {len(all_pages)} pages")

Let's check the object types and content

In [None]:
type(all_pages), type(all_pages[0])

In [None]:
print(all_pages[20].page_content[4000:5000])

In [None]:
print(f"Total characters: {len(all_pages[20].page_content)}")

### Split long pages into smaller, overlapping chunks

Each page in the loaded PDFs contains approximately 5,000 characters, which is too large to send directly to the language model in a prompt. To handle this, we split the pages into smaller, overlapping text chunks using RecursiveCharacterTextSplitter:
- Each chunk is up to 1,000 characters long with a 200-character overlap, preserving context across chunk boundaries.
- ``add_start_index=True`` tracks the position of each chunk within the original document for potential traceability or highlighting.

This chunking strategy is essential for a RAG pipeline, allowing efficient semantic retrieval and ensuring the model receives focused, context-rich inputs without exceeding token limits.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    add_start_index=True,  # track index in original document
)
all_splits = text_splitter.split_documents(all_pages)

print(f"Split guidebook data into {len(all_splits)} sub-documents.")

Store all the previously split document chunks in the Qdrant vector store. Each chunk is assigned a unique identifier, which is returned as a list. These IDs can be used later for referencing, tracing, or debugging. Indexing the content in this way enables efficient semantic retrieval when answering queries.

In [None]:
document_ids = vector_store.add_documents(documents=all_splits)

print(document_ids[:3])

### Load and inspect a RAG prompt template from LangChain Hub

The code pulls a **ready-to-use RAG prompt** (rlm/rag-prompt) from the LangChain Hub. This prompt is designed to take in a retrieved context and a user question, format them appropriately, and prepare them for input to a language model.

Using hub.pull simplifies experimentation by providing a standardized prompt format, but:
- You can **fully customize this prompt** to match your domain, tone, or structure.
- Custom prompts are especially useful when targeting specific use cases or needing tighter control over model behavior.

To validate the structure, the prompt is invoked with dummy inputs and the resulting formatted message is printed. This helps ensure the prompt looks correct before plugging in real data.Podemos utilizar un prompt premade que se suele usar en este tipo de aplicaciones

In [None]:
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")

example_messages = prompt.invoke(
    {"context": "(context goes here)", "question": "(question goes here)"}
).to_messages()

assert len(example_messages) == 1
print(example_messages[0].content)

Let's prepare some quesetions

In [None]:
question1 = "Cuales son las cuentas que tiene que presentar una empresa?"
question2 = "¿Cómo se calcula y qué significa el umbral de rentabilidad? ¿Qué decisiones estratégicas pueden derivarse de él?"
question3 = "¿Qué es el EBITDA y cómo se calcula? ¿Por qué es importante para evaluar la rentabilidad de una empresa?"

### Retrieve the most relevant document chunks for a given question

A semantic **similarity search** is performed against the Qdrant vector store using the user’s query. The top 5 most relevant chunks are returned, along with their cosine similarity scores.
- Each result is a tuple containing a document and its corresponding score.
- Higher scores indicate higher similarity, since cosine similarity is used.
- The snippet prints the first 200 characters of each retrieved chunk to preview the content.

This step helps validate whether the retriever is surfacing the most contextually relevant information for the question.

In [None]:
retrieved_docs = vector_store.similarity_search_with_score(question1, k=5)

for doc, score in retrieved_docs:
    print("Score:", score)
    print("Content:", doc.page_content[:200], "...\n") 

### Format context and generate answer using the language model

The **retrieved document chunks** are concatenated into a **single context string** and passed, along with the user’s question, to the prompt template. The resulting prompt is then sent to the **language model** for completion.
- This approach ensures the model receives only the most relevant context, rather than the entire document set.
- By combining retrieval and generation, the RAG pipeline produces answers grounded in the source material, reducing hallucinations and improving reliability.

In [None]:
docs_content = "\n\n".join(doc.page_content for doc, _ in retrieved_docs)
prompt_invocation = prompt.invoke({"question": question1, "context": docs_content})
answer = llm.invoke(prompt_invocation)

In [None]:
answer.content