# PDF Parser using LLAMA-Parser

### Method 1: 

- https://www.youtube.com/watch?v=7DJzHncUlpI

- https://colab.research.google.com/drive/18KB9yXxDUeQGrEZEP1eCrXQ0dNB-Oazm?usp=sharing

- https://docs.llamaindex.ai/en/latest/examples/cookbooks/llama3_cookbook_ollama_replicate/

- https://docs.cloud.llamaindex.ai/

Set up the account here:

- https://cloud.llamaindex.ai/

- Install llama parser: 

%pip install llama-parse

In [1]:
import sys, os
sys.path.append(os.path.abspath(os.path.join('..', 'secret')))
from secret_info import llama_parser
# The video says that this is not needed outside Collab, but I got this error: 
#       RuntimeError: The event loop is already running. Add `import nest_asyncio; nest_asyncio.apply()` to your code to fix this issue.
# The github instructions have this step added
# https://github.com/run-llama/llama_parse
import nest_asyncio
nest_asyncio.apply()

In [2]:
from llama_parse import LlamaParse

document = LlamaParse(api_key=llama_parser, result_type="markdown").load_data("BillExample.pdf")
print(document[0].text[:1000])
file_name = "paper.md"
with open(file_name, 'w') as file:
    file.write(document[0].text)

Started parsing the file under job_id 249c7aee-41c4-4af4-89c5-100bafec4799
# TOYOTA

# TOYOTA (NIGERIA) LIMITED

CFQR+3R9 Ojuari Scheme

Lekki Penninsulla Il, Lekki, Lagos

01 227 2250

# VEHICLE SALES INVOICE

|Bill To|Invoice|
|---|---|
|Invoice Payer:|2149030397|
|Payer Name:|Stanbic IBTC Bank PLC|
|Address:|Plot 1712, Idejo Street Victoria Island; Lagos State|
|Telephone:|08106646964|

# Vehicle Details

|Make:|Toyota|Year Model:|2012|
|---|---|---|---|
|Model:|Camry|Weight:|4630LB|
|VIN:|4TIBFIFK4CU061902|CITR:|218 FBOO|

# Vehicle Cost

|Description:|Toyota Camry 2012|Qty|Price|Total|
|---|---|---|---|---|
| |4TIBFIFK4CU061902| |3,893,600.00|N3,893,600.00|
| |5% VAT| |194,680.00|N194,680.00|
| |Total Amount in Words:|FOUR MILLION; EIGHTY-EIGHT THOUSAND, TWO HUNDRED AND EIGHTY NAIRA ONLY|FOUR MILLION; EIGHTY-EIGHT THOUSAND, TWO HUNDRED AND EIGHTY NAIRA ONLY|FOUR MILLION; EIGHTY-EIGHT THOUSAND, TWO HUNDRED AND EIGHTY NAIRA ONLY|

Note

Received the above Vehicle/Unit

Thank you for

In [3]:
documents_with_instruction = LlamaParse(api_key=llama_parser,
    result_type="markdown",
    parsing_instruction="""
    What are the 5 most important points of the article. Present in bullet points. 
    """
    ).load_data("CoralesCuba.pdf")

Started parsing the file under job_id dd37d817-8369-4f66-8122-0865967ee1fb


In [4]:
print(documents_with_instruction[0].text[:1000])

# Article Summary

# Article Summary: AquaDacs

# Salud de los corales y su investigación en el Caribe y en Cuba

- Authors: Aguilera-Pérez, Gabriela C.; González-Díaz, Patricia
- Rights: Attribution-NonCommercial-NoDerivatives 4.0 International
- Download date: 04/07/2024 18:48:07
- Item License: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
- Link to Item
---
# Article Summary

# 5 Most Important Points of the Article:

- The health of corals in the Caribbean and Cuba is crucial for biodiversity and the region's economy.
- Coral reefs are facing threats from climate change, pollution, overfishing, and ocean acidification.
- The Caribbean is a hotspot for coral diseases with rapid emergence of new syndromes.
- Cuba, like other Caribbean regions, is also affected by coral diseases, but the impact is relatively low compared to other areas.
- Consistent monitoring of coral reefs is essential to understand their health status, manage them sustainably, and prev

____ 

### Method 2: 

Essentially is the same as the first method as the default to method 1 is SimpleDirectoryReader

- https://docs.cloud.llamaindex.ai/llamaparse/getting_started/python


In [5]:
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

# set up parser
parser = LlamaParse(api_key=llama_parser,
    result_type="markdown"  # "markdown" and "text" are available
)
# use SimpleDirectoryReader to parse our file
file_extractor = {".pdf": parser}
document2 = SimpleDirectoryReader(input_files=['MyLatestPaper.pdf'], file_extractor=file_extractor).load_data()

file_name = "paper2.md"
with open(file_name, 'w') as file:
    file.write(document2[0].text)

Started parsing the file under job_id 0de2cf8e-557d-4708-8627-e8c0b8b57db4
..

In [6]:
print(document2[0].text)

Downloaded from https://royalsocietypublishing.org/ on 19 July 2023

royalsocietypublishing.org/journal/rspb

Research

Cite this article: Turnham KE et al. 2023 High physiological function for corals with thermally tolerant, host-adapted symbionts. Proc. R. Soc. B 290: 20231021. https://doi.org/10.1098/rspb.2023.1021

Received: 09 May 2023

Accepted: 23 June 2023

Subject Category: Ecology

Subject Areas: ecology, ecosystems, evolution

Keywords: functional ecology, mutualism, Pocillopora, thermal tolerance, vertical symbiont transmission

Author for correspondence:

Kira E. Turnham

e-mail: keturnham@gmail.com

# High physiological function for corals with thermally tolerant, host-adapted symbionts

Kira E. Turnham1, Matthew D. Aschaffenburg 2, D. Tye Pettay 3, David A. Paz-García4, Héctor Reyes-Bonilla5, Jorge Pinzón 1, Ellie Timmins 1, Robin T. Smith6, Michael P. McGinley 2, Mark E. Warner 2 and Todd C. LaJeunesse1

Author Affiliations1Department of Biology, The Pennsylvania State 

 ____
 
 ### Method 3

 From the instructions in the github repository 

 - https://github.com/run-llama/llama_parse

In [7]:
import nest_asyncio
nest_asyncio.apply()
from llama_parse import LlamaParse

parserspparser = LlamaParse(
    api_key=llama_parser,  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    num_workers=4,  # if multiple files passed, split in `num_workers` API calls
    verbose=True,
    language="en",  # Optionally you can define a language, default=en
)
# sync
documents = parser.load_data("MyLatestPaper.pdf")

Started parsing the file under job_id 4a1bf7f4-c209-4f11-93a1-9f06442c7ae0


In [8]:
documents

[Document(id_='0876137f-1f49-4086-9c79-d7066d0d9a24', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Downloaded from https://royalsocietypublishing.org/ on 19 July 2023\n\nroyalsocietypublishing.org/journal/rspb\n\nResearch\n\nCite this article: Turnham KE et al. 2023 High physiological function for corals with thermally tolerant, host-adapted symbionts. Proc. R. Soc. B 290: 20231021. https://doi.org/10.1098/rspb.2023.1021\n\nReceived: 09 May 2023\n\nAccepted: 23 June 2023\n\nSubject Category: Ecology\n\nSubject Areas: ecology, ecosystems, evolution\n\nKeywords: functional ecology, mutualism, Pocillopora, thermal tolerance, vertical symbiont transmission\n\nAuthor for correspondence:\n\nKira E. Turnham\n\ne-mail: keturnham@gmail.com\n\n# High physiological function for corals with thermally tolerant, host-adapted symbionts\n\nKira E. Turnham1, Matthew D. Aschaffenburg 2, D. Tye Pettay 3, David A. Paz-García4, Héctor 

In [10]:
parsersp = LlamaParse(
    api_key=llama_parser,  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    num_workers=4,  # if multiple files passed, split in `num_workers` API calls
    verbose=True,
    language="es",  # Optionally you can define a language, default=en
)
# sync
documento = parsersp.load_data("CoralesCuba.pdf")

Started parsing the file under job_id 7cc4d0df-14c2-4775-bc2c-959285bea61e
....

In [11]:
documento 

[Document(id_='4b3a64c3-11d6-4b1e-ab74-6b85b1b0dcde', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# AquaDacs\n\n|Item Type|Journal Contribution|\n|---|---|\n|Authors|Aguilera-Pérez, Gabriela C.; González-Díaz, Patricia|\n|Rights|Attribution-NonCommercial-NoDerivatives 4.0 International|\n|Download date|04/07/2024 18:48:07|\n|Item License|http://creativecommons.org/licenses/by-nc-nd/4.0/|\n|Link to Item|http://hdl.handle.net/1834/43097|\n---\n# Centro de Investigaciones Marinas Universidad de La Habana\n\nREVISIÓN BIBLIOGRÁFICA\n\nSalud de los corales y su investigación en el Caribe y en Cuba Coral health and its research in the Caribbean and Cuba Gabriela C. Aguilera-Pérez Patricia González-Díaz\n\n1 Centro de Investigaciones Marinas de la Universidad de La Habana (CIM-UH). Calle 16 No. 114. Playa. CP 11300. Ciudad Habana. Cuba.\n\nResumen\n\nLa salud de los corales es un tema de gran importancia en el Caribe y en

_____ 

### Method 4: 

Good to extract tables 

- https://www.llamaindex.ai/blog/mastering-pdfs-extracting-sections-headings-paragraphs-and-tables-with-cutting-edge-parser-faea18870125

- https://github.com/nlmatics/llmsherpa

To install llsherpa

!pip install llmsherpa 



In [12]:
from llmsherpa.readers import LayoutPDFReader

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_path = "CoralesCuba.pdf" 
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_path)
file_name = "paper3.md"
with open(file_name, 'w') as file:
    file.write(doc.to_text())

MaxRetryError: HTTPSConnectionPool(host='readers.llmsherpa.com', port=443): Max retries exceeded with url: /api/document/developer/parseDocument?renderFormat=all (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x1495ef860>, 'Connection to readers.llmsherpa.com timed out. (connect timeout=None)'))

In [None]:
print(doc.to_text()[:1000])

In [None]:
from IPython.core.display import HTML
HTML(doc.tables()[6].to_html())

In [None]:
# Show the introduction only 
HTML(doc.sections()[20].to_html(include_children=True, recurse=True))


In [None]:
# Ask questions about the document 
from langchain_community.llms import Ollama
import pandas as pd
# Generate the model
llm3 = Ollama(model ='llama3')
sect = doc.sections()[20].to_text(include_children=True, recurse=True)
answer = llm3.invoke(f"given {sect} what is the main conclusion of this document ")

In [None]:
print(answer)

In [None]:
sect = doc.to_text()
llm3.invoke(f"given {sect} what is the main point of the introduction")

In [None]:
#using the document in spanish 
llm3.invoke(f"given {documento} what is the main conclusion of this document ")