# LlamaParse - Parsing Complex Documents

## Load and Parse PDFs


In [None]:
!pip install -qU llama-index llama-parse ragas

In [2]:
import os
import getpass

os.environ["LLAMA_CLOUD_API_KEY"] = getpass.getpass("LLamaParse API Key:")

LLamaParse API Key:··········


In [3]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key:··········


In [4]:
import nest_asyncio

nest_asyncio.apply()

### LlamaParse Initialization


In [18]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type="markdown",
    verbose=True,
    language="en",
    num_workers=2,
)

### Uploading Files

In [6]:
from google.colab import files

ships_manual = files.upload()

Saving Ships_3m_manual_tables.pdf to Ships_3m_manual_tables.pdf


### Parsing Our Files



In [7]:
documents = parser.load_data(["/content/Ships_3m_manual_tables.pdf"])

Parsing files: 100%|██████████| 1/1 [00:39<00:00, 39.48s/it]


In [None]:
documents

## LlamaIndex Recursive Query Engine

In [9]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import MarkdownElementNodeParser

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")

node_parser = MarkdownElementNodeParser(llm=OpenAI(model="gpt-3.5-turbo"), num_workers=8)

In [None]:
nodes = node_parser.get_nodes_from_documents(documents=[documents[0]])

In [11]:
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

In [12]:
from llama_index.core import VectorStoreIndex

recursive_index = VectorStoreIndex(nodes=base_nodes+objects)

### Recursive Query Engine

In [None]:
!pip install -qU llama-index-postprocessor-flag-embedding-reranker git+https://github.com/FlagOpen/FlagEmbedding.git

In [None]:
ships_manual_nodes = node_parser.get_nodes_from_documents(documents=[documents[0]])

In [15]:
ships_base_nodes, ships_objects = node_parser.get_nodes_and_objects(ships_manual_nodes)

In [16]:
ships_recursive_index = VectorStoreIndex(nodes=ships_base_nodes + ships_objects, include_content=True)

In [20]:
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

ships_recursive_query_engine = ships_recursive_index.as_query_engine(
    similarity_top_k=15,
    node_postprocessors=[reranker],
    verbose=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

## **RAGAS Evaluation**

In [None]:
!pip install langchain pypdf

In [24]:
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("/content/Ships_3m_manual_tables.pdf")
docs = loader.load()

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(docs)

In [26]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.with_openai()

testset = generator.generate_with_langchain_docs(documents, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})


  generator = TestsetGenerator.with_openai()


embedding nodes:   0%|          | 0/142 [00:00<?, ?it/s]



Generating:   0%|          | 0/10 [00:00<?, ?it/s]



In [27]:
test_df = testset.to_pandas()

In [28]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What is the purpose of the Screening Action in...,[NAVSEAINST 4790.8D \n ...,The purpose of the Screening Action in mainten...,simple,[{'source': '/content/Ships_3m_manual_tables.p...,True
1,"What does the ""W"" in Whisky ""W"" Data Elements ...",[craft or boats without a UIC use the UIC of ...,,simple,[{'source': '/content/Ships_3m_manual_tables.p...,True
2,What happens if the serial number exceeds 12 c...,"[on a maintenance action; enter ""VARIOUS"". \n...",If the serial number exceeds 12 characters in ...,simple,[{'source': '/content/Ships_3m_manual_tables.p...,True
3,What is the purpose of Gyro Inspection and Rep...,[51B Outside Electrical Outside Electrical 3...,The purpose of Gyro Inspection and Repair is t...,simple,[{'source': '/content/Ships_3m_manual_tables.p...,True
4,What is the role of Fire Control Test and Repa...,[67B Electronics Calibration \nLab (FECL) Ele...,The role of Fire Control Test and Repair in th...,simple,[{'source': '/content/Ships_3m_manual_tables.p...,True
5,What happens to the operational capability of ...,[malfunction on the operational capability of ...,,reasoning,[{'source': '/content/Ships_3m_manual_tables.p...,True
6,What is the significance of Ship's Force Man-H...,"[components, assemblies, etc., according to a ...",The Ship's Force Man-Hours is the total man-ho...,multi_context,[{'source': '/content/Ships_3m_manual_tables.p...,True
7,How does a configuration change affect Ship's ...,[time cannot exceed “ 1” hour.” \n \n (13) ...,A configuration change affects Ship's Force's ...,multi_context,[{'source': '/content/Ships_3m_manual_tables.p...,True
8,Which fields indicate job association with key...,[(17) Special Requirements (2P) . \n \n (a) ...,The fields that indicate job association with ...,reasoning,[{'source': '/content/Ships_3m_manual_tables.p...,True


In [29]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

In [None]:
answers = []
contexts = []
count = 0
for question in test_questions:
  response = ships_recursive_query_engine.query(question)
  answers.append(response.response)
  count += 1
  for node_with_score in response.source_nodes:
    node = node_with_score.node
    if len(contexts) < len(answers):
        contexts.append([node.text])

In [31]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [32]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

In [33]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/45 [00:00<?, ?it/s]

In [34]:
results

{'faithfulness': 0.8889, 'answer_relevancy': 0.9589, 'context_recall': 0.6667, 'context_precision': 0.6667, 'answer_correctness': 0.7768}

In [35]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What is the purpose of the Screening Action in...,The purpose of the Screening Action in mainten...,"[NOTE: Codes ""6"" through ""9"" may be locally as...",The purpose of the Screening Action in mainten...,0.666667,1.0,1.0,1.0,0.617554
1,"What does the ""W"" in Whisky ""W"" Data Elements ...","The ""W"" in Whisky ""W"" Data Elements stands for...",[Table C-21 discovered. (2) When Discovered Da...,,1.0,1.0,0.0,0.0,0.933887
2,What happens if the serial number exceeds 12 c...,If the serial number exceeds 12 characters in ...,[g. Golf “G” Data Elements. Table C-9 (Not cur...,If the serial number exceeds 12 characters in ...,1.0,0.95582,1.0,1.0,0.999814
3,What is the purpose of Gyro Inspection and Rep...,The purpose of Gyro Inspection and Repair is t...,[This table provides a list of repair shops an...,The purpose of Gyro Inspection and Repair is t...,1.0,0.999999,0.0,0.0,0.614346
4,What is the role of Fire Control Test and Repa...,Fire Control Test and Repair is responsible fo...,[This table provides information about differe...,The role of Fire Control Test and Repair in th...,0.333333,0.91292,1.0,1.0,0.619022
5,What happens to the operational capability of ...,The operational capability of equipment or sys...,[g. Golf “G” Data Elements. Table C-9 (Not cur...,,1.0,0.90268,0.0,0.0,0.928372
6,What is the significance of Ship's Force Man-H...,Ship's Force Man-Hours in maintenance represen...,[* This screening code disapproves the accompl...,The Ship's Force Man-Hours is the total man-ho...,1.0,0.958326,1.0,1.0,0.741943
7,How does a configuration change affect Ship's ...,A configuration change affects Ship's Force's ...,[* This screening code disapproves the accompl...,A configuration change affects Ship's Force's ...,1.0,0.97627,1.0,1.0,0.540897
8,Which fields indicate job association with key...,"Fields (a) Key Event, (b) Special Interest, an...",[Ship’s Force Man-Hours Remaining (S/F MHRS. R...,The fields that indicate job association with ...,1.0,0.923679,1.0,1.0,0.995086
