# RAG Recursive Retrieve

## Required API Keys

Before running this script, you **must** obtain API keys for the following services and configure them as environment variables:

1.  **`GEMINI_API_KEY`**: Your API key for Google Gemini.
2.  **`GROQ_API_KEY`**: Your API key for [Groq](https://groq.com/).
3.  **`LLAMA_CLOUD_API_KEY`**: Your API key for [Llama Cloud](https://cloud.llamaindex.ai/login.)

**Configuration:**

The script uses `python-dotenv` to load these keys. Create a file named `.env` in the same directory as the script and add your keys like this:

```plaintext
GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE"
GROQ_API_KEY="YOUR_GROQ_API_KEY_HERE"
LLAMA_CLOUD_API_KEY="YOUR_LLAMA_CLOUD_API_KEY_HERE"

In [5]:
from dotenv import load_dotenv
import os
from ingest_documents import convert_pdf

# Load environment variables from .env file
load_dotenv()


# Now try to get the key
gemini_api_key = os.getenv("GEMINI_API_KEY")

if not gemini_api_key:
    raise ValueError("Please set the GEMINI_API_KEY environment variable.")

if "GROQ_API_KEY" not in os.environ:
    raise ValueError("Please set the GROQ_API_KEY environment variable.")

    
if "LLAMA_CLOUD_API_KEY" not in os.environ:
    raise ValueError("Please set the LLAMA_CLOUD_API_KEY environment variable.")


## Load LLM's from groq
```
pip install torch torchvision torchaudio
pip install --upgrade jupyter ipywidgets
pip install llama-index-llms-groq
pip install llama-index-embeddings-huggingface
pip install llama-index-embeddings-instructor
```

In [6]:
from llama_index.llms.groq import Groq

llm = Groq(model="qwen-qwq-32b", temperature=0.3)

## Embedding Model

An embed model is used to convert text into numerical vectors (embeddings) that capture the semantic meaning of the text.

In [9]:
model_name = "intfloat/multilingual-e5-base"

In [10]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
import torch

embed_model = HuggingFaceEmbedding(
    model_name=model_name,
    embed_batch_size=32,  # Adjust based on your RAM
    device="cuda"
)

Settings.llm = llm
Settings.embed_model = embed_model

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/179k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

## From Document

In [36]:
import asyncio
import nest_asyncio
nest_asyncio.apply()

from llama_cloud_services import LlamaParse
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import HierarchicalNodeParser, get_leaf_nodes
from llama_index.core.retrievers import RecursiveRetriever

In [18]:
PDF_PATH = "./input_files/parsing_prube.pdf"
OUTPUT_DIR = "./output_files/" # YOU will save the file here

markdown_document_path = convert_pdf(PDF_PATH, OUTPUT_DIR, gemini_api_key)



In [19]:
try:
    documents = SimpleDirectoryReader(input_files=[markdown_document_path]).load_data()
except Exception as e:
    print(f"Error loading Markdown file: {e}")


In [21]:
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128] # Adjust these based on your content!
                                 # Larger numbers for bigger parent chunks.
)

all_nodes = node_parser.get_nodes_from_documents(documents)
leaf_nodes = get_leaf_nodes(all_nodes) # These are the smallest child nodes



In [22]:
print(f"Total len documets: {len(documents)}")
print(f"Total nodes created: {len(all_nodes)}")
print(f"Leaf nodes created: {len(leaf_nodes)}")

Total len documets: 1
Total nodes created: 12
Leaf nodes created: 9


## Setup ChromaDB

Se configura una base de datos vectorial utilizando **ChromaDB**

### Embedding models from HiggingFace

*Windows:*v2
ensure that have installed Microsoft c++ build tools, if not: 
Open powershell as administrator and run:
```
Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))

choco install visualstudio2022buildtools -y
```
Note: in windows you have to configure visual studio build tool with windows 10 SDK - MSVC v142 - VS 2019 C++ x64/x86 build tools and C++ CMake Tools for Windows.
Then in your environment terminal:
```
pip install chromadb
pip install llama-index-vector-stores-chroma
```
### Aviable models tested
- BGE-m3
- E5-base


In [37]:
import chromadb
from llama_index.core import (
    VectorStoreIndex,
    StorageContext,
)
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.storage.docstore import SimpleDocumentStore

In [38]:
# create client
vector_db = chromadb.PersistentClient(path="./databases/prube_RecursiveRetrieve_cromadb_v1")
chroma_collection = vector_db.get_or_create_collection("prube")

## MongoDB docstore database

Download mongoDB:
https://www.mongodb.com/try/download/community

```
pip install llama-index-storage-docstore-mongodb

pip install llama-index-storage-index-store-mongodb
```

In [39]:
from llama_index.storage.docstore.mongodb import MongoDocumentStore
from pymongo import MongoClient  



database_name = "nexos_RecursiveRetrieve_v1" # Nombre de la base de datos
mongo_uri = f"mongodb://127.0.0.1:27017/{database_name}"  # Database name in URI

mongo_docstore = MongoDocumentStore.from_uri(
    uri=mongo_uri,
    db_name=database_name,
)


In [41]:
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

storage_context = StorageContext.from_defaults(
    docstore=mongo_docstore, 
    vector_store=vector_store
)


In [45]:
storage_context.docstore.set_document_hash(documents[0].id_, documents[0].hash) # For the main doc
storage_context.docstore.add_documents(all_nodes, allow_update=True)

## Index Data 

* get_page_nodes:
  - Función que divide cada documento en "nodos" (fragmentos) utilizando un separador (\n---\n en este caso).
  - Cada fragmento se convierte en un TextNode: contiene el texto y los metadatos del documento original.
* MarkdownElementNodeParser:
  - Es un parser especializado en dividir documentos Markdown en elementos estructurados (como párrafos, encabezados, listas, etc.).
  - Utiliza un modelo de lenguaje (LLM) para ayudar en el proceso de parsing.
* get_nodes_from_documents:
  - get_nodes_from_documents: Aplica el parser a los documentos para generar nodos estructurados.
* get_nodes_and_objects: ara procesar los nodos (nodes) y generar dos tipos de salidas.
  - base_nodes: Son los nodos principales que contienen el contenido textual dividido en fragmentos más pequeños.
  - objects: Elementos estructurados adicionales que pueden incluir metadatos, relaciones entre nodos u otra información derivada del parsing.



In [46]:
vector_index = VectorStoreIndex(
    leaf_nodes, # Index the leaf nodes for initial retrieval
    storage_context=storage_context
)

In [47]:
# Create a base retriever for the leaf nodes
base_retriever = vector_index.as_retriever(similarity_top_k=5) # Retrieve top 5 leaf nodes

# The node_dict should contain all nodes (parents and children)
# so the retriever can look up parent nodes by their IDs.
node_dict = {node.node_id: node for node in all_nodes}

recursive_retriever = RecursiveRetriever(
    'vector', # A name for the root retriever type
    retriever_dict={'vector': base_retriever},
    node_dict=node_dict,
    verbose=True
)

In [None]:
##### from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.query_engine import RetrieverQueryEngine

# --- 6. Create Query Engine ---

# Create the alternative reranker
reranker = SentenceTransformerRerank(
    top_n=5,  # Changed from top_n to top_k based on the API
    model="cross-encoder/ms-marco-MiniLM-L-6-v2"  # This is a good alternative model for reranking
)

query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever,
    node_postprocessors=[reranker]
)



In [52]:
# --- 7. Query ---
query = "What does mean passenger VKM in the model? and why is raletd with the PKM with a convertion factor of 1.5"
response = query_engine.query(query)

print("\nResponse:")
print(response)

print("\nSource Nodes:")
for SourcedNode in response.source_nodes:
    print("----------------------------------")
    print(f"Node ID: {SourcedNode.node_id}, Score: {SourcedNode.score}")
    print(SourcedNode.text[:250] + "...") # Print snippet
    # Check for parent node information in metadata
    if 'parent_id' in SourcedNode.node.metadata:
        print(f"Parent Node ID: {SourcedNode.node.metadata['parent_id']}")
    print("----------------------------------")

[1;3;34mRetrieving with query id None: What does mean passenger VKM in the model? and why is raletd with the PKM with a convertion factor of 1.5
[0m[1;3;38;5;200mRetrieving text node: The car is labeled as "Car CAR" in the center of a rectangle. On the left side of the rectangle is "DSL" and "Activity in 'vehicle kilometers' VKM". On the right side of the rectangle is "TX1" and "Commodity unit 'Passenger kilometers' PKM". Below the rectangle is "Capacity in '# of cars' NOC". Below the diagram are the following equations: "Definition of process activity PRC\_ACTUNT(r,p,cg,u) = {UTOPIA.
[0m[1;3;38;5;200mRetrieving text node: Since the capacity and activity units are different (mtoe for the capacity and PJ for the activity), the user has to supply the conversion factor from the energy unit embedded in the capacity unit to the activity unit. This is done by specifying the parameter **prc\_capact(r,p)**. In the example **prc\_capact** has the value 41.868.

Image /page/0/Figure/1 descr