# Playground for parsing and extracting information from PDFs - Llama Index


## Parsing PDF document using LlamaParse

LlamaParse is a highly accurate parser for complex documents like financial reports, research papers, and scanned PDFs. It handles images, tables, and charts wiht ease.

It transforms complex documents into text, markdown, or JSON formats.

In [5]:
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [3]:
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader
import os

faq_file = os.path.abspath("../pdfs/novo_regime.pdf")

parser = LlamaParse(result_type="text")

file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
    input_files=[faq_file],
    file_extractor=file_extractor
    ).load_data()

documents

Started parsing the file under job_id 14157597-8996-4132-89c6-f66b161e7e96


[Document(id_='c9f1273f-6a50-4fe6-b394-89a82802085d', embedding=None, metadata={'file_path': '/Users/patriciaoliveira/repos/ai-assistant-self-employers-pt/pdfs/novo_regime.pdf', 'file_name': 'novo_regime.pdf', 'file_type': 'application/pdf', 'file_size': 486698, 'creation_date': '2025-07-28', 'last_modified_date': '2025-07-28'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='O limite mínimo da base de incidência contributiva dos trabalhadores independentes abrangidos pelo\nregime de contabilidade organizada é 1,5 vezes o valor do IAS.\nnovos\n\n\nPerguntas Frequentes\n\n\nNOVO REGIME DOS TRABALHADORES INDEPENDENTES\n\nINSTITUTO DA SEGURAN

## Indexing documents

Using an in-memory vector database (`VectorStoreIndex`) and a fully-managed vector database from LlmaCloud (`LlamaCloudIndex`).

Both classes - `VectorStoreIndex` and `LlamaCloudIndex` - inherit behavior from `BaseIndex` class, such as the ability to create a retriever, a query engine supported by a LLM, and a chat engine.

- `as_retriever()`: it creates a retriever from the index.

- `as_query_engine()`: it converts the index to a query engine.

- `as_chat_engine()`: it converts the index to a chat engine.

### Using `VectorStoreIndex`


`VectorStoreIndex` uses a in-memory `SimpleVectorStore` that is initialized as part of the default storage context.

[Documentation](https://docs.llamaindex.ai/en/stable/module_guides/indexing/vector_store_index/)


In [6]:
from llama_index.core import VectorStoreIndex

# Create an index from the documents. The VectorStoreIndex converts the parsed documents into a vectorized format that can be queried.
# This index will allow us to perform semantic searches over the content of the documents.
local_index = VectorStoreIndex.from_documents(documents)

local_index

<llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x121bda420>

In [8]:
# This retriever allows us to perform searches on the index using natural language queries.
retriever = local_index.as_retriever()

# It returns the nodes that match the query.
nodes = retriever.retrieve("Como é determinado o rendimento?")

nodes

[NodeWithScore(node=TextNode(id_='f43eb8d5-4206-46af-9062-e35b242561eb', embedding=None, metadata={'file_path': '/Users/patriciaoliveira/repos/ai-assistant-self-employers-pt/pdfs/novo_regime.pdf', 'file_name': 'novo_regime.pdf', 'file_type': 'application/pdf', 'file_size': 486698, 'creation_date': '2025-07-28', 'last_modified_date': '2025-07-28'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='179cd894-c767-4d7a-ace4-79eaf827ea51', node_type='4', metadata={'file_path': '/Users/patriciaoliveira/repos/ai-assistant-self-employers-pt/pdfs/novo_regime.pdf', 'file_name': 'novo_regime.pdf', 'file_type': 'application/pdf', 'file_size': 486698, 'creation_date': '2025-07-28', 'last_modified_date': '2025-07-28'}, has

In [12]:
# It creates a query search engine from the index.
# This engine will allow us to query the index using natural language questions and returns the most relevant results.
query_engine = local_index.as_query_engine()

response = query_engine.query("Como é determinado o rendimento?")

response.response

'O rendimento é determinado com base nos rendimentos obtidos nos três meses imediatamente anteriores ao mês da declaração trimestral, de acordo com percentagens específicas para diferentes tipos de rendimentos. Para os trabalhadores independentes abrangidos pelo regime de contabilidade organizada, o rendimento relevante corresponde ao lucro tributável apurado no ano civil imediatamente anterior, declarado no Anexo SS da Declaração Modelo 3 do IRS.'

### Accessing prompts running in the background

In [20]:
from llama_index.core.prompts import RichPromptTemplate

prompts = query_engine.get_prompts()

for k, q in prompts.items():
    print(f"Prompt for {k}:")
    print(q)
    print("\n")

Prompt for response_synthesizer:text_qa_template:
metadata={'prompt_type': <PromptType.QUESTION_ANSWER: 'text_qa'>} template_vars=['context_str', 'query_str'] kwargs={} output_parser=None template_var_mappings={} function_mappings={} default_template=PromptTemplate(metadata={'prompt_type': <PromptType.QUESTION_ANSWER: 'text_qa'>}, template_vars=['context_str', 'query_str'], kwargs={}, output_parser=None, template_var_mappings=None, function_mappings=None, template='Context information is below.\n---------------------\n{context_str}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {query_str}\nAnswer: ') conditionals=[(<function is_chat_model at 0x11a58b880>, ChatPromptTemplate(metadata={'prompt_type': <PromptType.CUSTOM: 'custom'>}, template_vars=['context_str', 'query_str'], kwargs={}, output_parser=None, template_var_mappings=None, function_mappings=None, message_templates=[ChatMessage(role=<MessageRole.SYSTEM: 'system'>, additi

### Using `LlamaCloudIndex`

LlmaCloud Index offers a full-managed option to store vectorized data in a vector database. It is also possible to connect our database instances in LlamaCloud.

[Documentation](https://docs.llamaindex.ai/en/stable/module_guides/indexing/llama_cloud_index/)

In [21]:
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex

# It creates a new LlamaCloudIndex instance
cloud_index = LlamaCloudIndex.from_documents(
    documents=documents,
    name="faqs_index_experiment",
    verbose=True,
)

cloud_index

Created project 2a57e3bb-7adb-4789-9321-ea85439c63a8 with name Default
Created pipeline 0d236c59-b759-4c26-be62-42ab74a4e998 with name faqs_index_experiment
Loading documents
Document ingestion finished for ec7b04ca-17b8-4ee9-a78d-b3ecb32eac36
Document ingestion finished for 8f04f320-b81d-4bd7-931b-65660f05965d
Document ingestion finished for 7c3cf2eb-e6db-4ed1-ac6e-fac11075dea9
Document ingestion finished for 05bd1635-1bd4-4ae2-925e-a664d034c8c7
Document ingestion finished for d9e29c90-04f9-4307-ac77-72773a8052ed
Document ingestion finished for 2be2937c-ed3e-409a-9f35-cd63236f1ff9
Document ingestion finished for 1ccaa777-f6c0-49bd-a946-1da708b31ea1
Document ingestion finished for 0034614b-e3fb-4e0d-8b96-4e3a4c37ef80
Document ingestion finished for 7e17d74b-10de-4ea0-95af-7a55f259f363
Document ingestion finished for d8466a58-7b6c-4bf5-8743-f46c9ca2e81b
Document ingestion finished for 5fb64f91-dcea-427c-813a-1bf20632d1a1
Document ingestion finished for ab0cb148-6f6a-4062-9396-9ef497bc66

<llama_index.indices.managed.llama_cloud.base.LlamaCloudIndex at 0x121d7dbe0>

In [24]:
# This retriever allows us to perform searches on the index using natural language queries.
retriever = cloud_index.as_retriever()

# It returns the nodes that match the query.
nodes = retriever.retrieve("Como é determinado o rendimento?")

nodes

[NodeWithScore(node=TextNode(id_='3fe2ab08-6bc5-42f8-8315-ac09148fd0ce', embedding=None, metadata={'page_label': '14', 'file_name': 'novo_regime.pdf', 'pipeline_id': '0d236c59-b759-4c26-be62-42ab74a4e998', 'document_2_page_label': 1, 'document_id': '0807c990-b4b6-47be-8615-ca203276d3ff'}, excluded_embed_metadata_keys=['document_id', 'file_id', 'pipeline_file_id'], excluded_llm_metadata_keys=['document_id', 'file_id', 'pipeline_file_id'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='0807c990-b4b6-47be-8615-ca203276d3ff', node_type='4', metadata={'page_label': '14', 'file_name': 'novo_regime.pdf', 'pipeline_id': '0d236c59-b759-4c26-be62-42ab74a4e998'}, hash='4df46699c6e864392a35061d6560d7bfbc52e1713b390cc399acff8f94055878')}, metadata_template='{key}: {value}', metadata_separator='\n', text='Perguntas Frequentes | Novo Regime dos Trabalhadores Independentes  \n \nISS, I.P  | DPC   Pág. 14/27 \n \ndeclaração de rendimen tos correspondente ao ano de 2023, ano em 

In [25]:
# It creates a query search engine from the index.
query_engine = cloud_index.as_query_engine()

# This engine will allow us to query the index using natural language questions and returns the most relevant results.
response = query_engine.query("Como é determinado o rendimento?")

response.response

'O rendimento relevante do trabalhador independente é determinado com base nos rendimentos obtidos nos três meses imediatamente anteriores ao mês da Declaração Trimestral. Esse rendimento relevante pode variar dependendo do regime em que o trabalhador independente se encontra, sendo calculado com base nos rendimentos obtidos no ano civil imediatamente anterior para aqueles abrangidos pelo regime de contabilidade organizada.'

## Extracting information from PDF document using Structured Output

Llama Index framework provides two ways to extract information from PDFs:
1. Llama Cloud - [LlamaExtract](https://docs.cloud.llamaindex.ai/llamaextract/getting_started/python)
2. Structured LLMs - [Structured Data Extraction](https://docs.llamaindex.ai/en/stable/understanding/extraction/structured_llms/)

### Common Dependencies

Pydantic class that represents FAQs structured output.

In [10]:
from pydantic import BaseModel, Field

class FAQs(BaseModel):
    question: str = Field(description="The question to be answered.")
    answer: str = Field(description="The answer to the question.")

### LlamaExtract

This tool transforms complex documents into well-types structured data with:
- Customizable extraction agents and schemas
- Batch processing capabilities for scale
- Iterative schema development

**LlmaExtract** provides a simple API for extracting structured data from unstructured documents like PDFs, text files, and images.

In [11]:
from llama_cloud_services import LlamaExtract
import os

In [12]:
prompt = """
You are an AI that extracts structured data from text. The FAQ contains a list of questions and answers. There are a total of 55 questions, so make sure you extract all question and answers.

Extraction Rules:
1. Every question starts with an incremental number followed by a dot and ends with a question mark.

2. The answer to that question starts with R, (letter "R" followed by a comma) and continues until the next question or the end of the document.

3. Return a JSON array where each element has:
- question: the full question text.
- answer: the full answer text.

Extract ONLY the questions and answers following these rules. Ignore any other text.
"""

extractor = LlamaExtract()
agents = [agent.name for agent in extractor.list_agents()]

if not "FAQs Seguranca Social" in agents:
    agent = extractor.create_agent(
        name="FAQs Seguranca Social",
        data_schema=FAQs,
        config={
            "extraction_target": "PER_PAGE",
            "extraction_mode": "MULTIMODAL",
            "multimodal_fast_mode": false,
            "system_prompt": prompt,
            "use_reasoning": false,
            "cite_sources": false,
            "confidence_scores": false,
            "chunk_mode": "PAGE",
            "high_resolution_mode": false,
            "invalidate_cache": false,
            "page_range": "3-27"
        }  
    )

agent = extractor.get_agent("FAQs Seguranca Social")
pdf_path = os.path.abspath("../pdfs/novo_regime.pdf")
result = agent.extract(pdf_path)
print(result.data)

Uploading files: 100%|██████████| 1/1 [00:01<00:00,  1.18s/it]
Creating extraction jobs: 100%|██████████| 1/1 [00:00<00:00,  1.05it/s]
Extracting files: 100%|██████████| 1/1 [00:21<00:00, 21.40s/it]

[{'question': '1. Quem está abrangido pelo novo regime dos trabalhadores independentes?', 'answer': 'R, Estão abrangidos pelo novo regime dos trabalhadores independentes todos os trabalhadores que exerçam atividade por conta própria, incluindo empresários em nome individual, titulares de estabelecimentos individuais de responsabilidade limitada, profissionais liberais, entre outros, que não estejam expressamente excluídos por lei.'}, {'question': '1.  Quando produz efeito o enquadramento no regime?', 'answer': 'R: O primeiro enquadramento no regime dos trabalhadores independentes só produz efeitos no 1.° dia do 12.º mês posterior ao do início de atividade.\n\nExemplo:\n\nSe o trabalhador independente iniciou a sua atividade a 1 de março de 2024, o seu enquadramento produz efeitos a 1 de março de 2025.\n\nSe o trabalhador independente iniciar a sua atividade independente na Autoridade Tributária e Aduaneira a 10 de janeiro de 2024, o seu enquadramento produz efeitos a 1 de janeiro de 20




In [13]:
for item in result.data:
    print(f"Question: {item['question']}")
    print(f"Answer: {item['answer']}")
    print("-" * 40)

Question: 1. Quem está abrangido pelo novo regime dos trabalhadores independentes?
Answer: R, Estão abrangidos pelo novo regime dos trabalhadores independentes todos os trabalhadores que exerçam atividade por conta própria, incluindo empresários em nome individual, titulares de estabelecimentos individuais de responsabilidade limitada, profissionais liberais, entre outros, que não estejam expressamente excluídos por lei.
----------------------------------------
Question: 1.  Quando produz efeito o enquadramento no regime?
Answer: R: O primeiro enquadramento no regime dos trabalhadores independentes só produz efeitos no 1.° dia do 12.º mês posterior ao do início de atividade.

Exemplo:

Se o trabalhador independente iniciou a sua atividade a 1 de março de 2024, o seu enquadramento produz efeitos a 1 de março de 2025.

Se o trabalhador independente iniciar a sua atividade independente na Autoridade Tributária e Aduaneira a 10 de janeiro de 2024, o seu enquadramento produz efeitos a 1 de 

### Structured Data Extraction

The highest-level way to extract structured data in LlmaIndex is to instantiate a Structured LLM. It refers to the practice of guiding LLMs to produce output in a defined format, like JSON or SML, rather than free-form test.

This structured output is more easily parsed and utilized by other systems, making LLM-driven pipelines more deterministic and efficient.

In [14]:
from llama_index.readers.file.docs.base import PDFReader
from pathlib import Path

pdf_reader = PDFReader()
documents = pdf_reader.load_data(file=pdf_path)

documents[0].text.strip()

'O limite mínimo da base de incidência contributiva dos trabalhadores independentes abrangidos pelo \nregime de contabilidade organizada é 1,5 vezes o valor do IAS.   \nnovos   \nPerguntas Frequentes \n  \n \nNOVO REGIME DOS TRABALHADORES INDEPENDENTES \n \n \n \nINSTITUTO DA SEGURANÇA SOCIAL, I.P'

In [15]:
from llama_index.llms.openai import OpenAI
from llama_index.core.prompts import PromptTemplate

llm = OpenAI(model="gpt-4o")
prompt_template = PromptTemplate("You are an AI that extracts structured data from text. The FAQ contains a list of questions and answers: '{document}'. There are a total of 55 questions, so make sure you extract all question and answers. Extraction Rules: Every question starts with an incremental number followed by a dot and ends with a question mark. The answer to that question starts with R, (letter "R" followed by a comma) and continues until the next question or the end of the document.")

response = llm.structured_predict(
    FAQs,
    prompt_template,
    document=documents[0].text.strip()
)