#### PDF Parser

- [LayourPDFReader for "Context-aware" chunking](https://blog.llamaindex.ai/mastering-pdfs-extracting-sections-headings-paragraphs-and-tables-with-cutting-edge-parser-faea18870125)
    - Identifying sections and subsections, along with their respective hierarchy levels.
    - Merging lines into coherent paragraphs.
    - Establishing connections between sections and paragraphs.
    - Recognizing tables and associating them with their corresponding sections.
    - Handling lists and nested list structures with precision.

In [3]:
from llmsherpa.readers import LayoutPDFReader
from IPython.core.display import display, HTML

  from IPython.core.display import display, HTML


In [4]:
llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"
pdf_url = '/root/workspace/data/DOCs/PDF/USA_2022.pdf' # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

In [55]:
## export to txt to check consistency 
def to_txt(strings,out_path):
    with open(out_path, 'w') as file:
        for string in strings:
            file.write(string + '\n')
    print("export to {}".format(out_path))

In [56]:
def extract_chunks(chunks):
    contents = []
    for idx,c in enumerate(chunks):
        contents.append("{} : {}".format(idx,c.to_context_text()))  ## for qa purpose, often need to use to_context_text
    return contents 

In [57]:
chunks = doc.chunks()
contents = extract_chunks(chunks)
print(contents[:10])
to_txt(contents,'/root/workspace/data/DOCs/PDF/chunks.txt')

['0 : UNITED STATES > July 2022\n2022 ARTICLE IV CONSULTATION—PRESS RELEASE; STAFF REPORT; AND STATEMENT BY THE EXECUTIVE DIRECTOR FOR THE UNITED STATES', '1 : UNITED STATES > July 2022\nUnder Article IV of the IMF’s Articles of Agreement, the IMF holds bilateral discussions with members, usually every year.\nIn the context of the 2022 Article IV consultation with the United States, the following documents have been released and are included in this package:\n• A Press Release summarizing the views of the Executive Board as expressed during its July 11, 2022 consideration of the staff report that concluded the Article IV consultation with the United States.\n• The Staff Report prepared by a staff team of the IMF for the Executive Board’s consideration on July 11, 2022, following discussions that ended on June 15, 2022, with the officials of the United States on economic developments and policies.\nBased on information available at the time of these discussions, the staff report was com

In [74]:

def extract_paragraphs(sections):
    section_counter=0
    p_counter = 0
    contents=[]
    for s in sections:
        #print("Section: {} \n".format(section_counter))
        contents.append("Section:{} \n".format(section_counter))
        section_counter+=1
        for p in s.sentences:
            contents.append("     {}".format(p))
    
    return contents


In [73]:
sections[1].sentences

['UNITED STATES']

In [75]:
sections = doc.sections()
contents = extract_paragraphs(sections)
print(contents[:10])
to_txt(contents,'/root/workspace/data/DOCs/PDF/paragraphs.txt')

['Section:0 \n', '     IMF Country Report No. 22/220', 'Section:1 \n', '     UNITED STATES', 'Section:2 \n', '     July 2022', 'Section:3 \n', '     International Monetary Fund Washington, D.C.', 'Section:4 \n', '     PR22/254']
export to /root/workspace/data/DOCs/PDF/paragraphs.txt


In [130]:
s = sections[25]

In [131]:
s.block_json

{'block_class': 'cls_19',
 'block_idx': 121,
 'level': 3,
 'page_idx': 7,
 'sentences': ['FIGURE'],
 'tag': 'header'}

In [132]:
s.children

[<llmsherpa.readers.layout_reader.Table at 0x7f6e4ddea280>,
 <llmsherpa.readers.layout_reader.Section at 0x7f6e4ddea310>]

In [133]:
for c in s.children:
    print(c.to_text())
    

 | 1. Two Scenarios for the Path of the Federal Funds Rate | 17
 | --- | ---
 | TABLES 1. Selected Economic Indicators | 35
 | 2. Balance of Payments | 36
 | 3. Federal and General Government Finances | 37

4. Core Financial Soundness Indicators for Deposit Takers


In [45]:
HTML(doc.tables()[6].to_html())

0,1,2,3,4,5,6,7,8,9
Goods,-0.1,-10.2,7.6,3.1,1.7,0.9,2.1,2.3,2.2
Services,-0.1,-19.8,-1.6,9.1,8.8,4.9,3.5,2.9,2.7
Real Imports Growth,,,,,,,,,
Goods and services,1.1,-8.9,14.0,9.3,1.0,-0.6,0.9,1.9,1.9
Goods,0.5,-5.6,14.6,8.9,0.1,-1.3,0.4,1.5,1.6
Nonpetroleum goods,1.2,-5.1,15.0,8.4,-0.1,-1.2,0.7,1.8,1.8
Petroleum goods,-5.8,-12.4,5.5,12.2,1.7,-2.5,-3.0,-1.7,-1.6
Services,3.9,-22.6,11.5,11.5,5.4,2.7,2.8,3.4,3.4
Net Exports (contribution to real GDP growth),-0.2,-0.3,-1.4,-0.9,0.3,0.4,0.2,0.0,0.0
Nominal Exports,,,,,,,,,


### get Raw Json file

In [97]:
doc.json[:5]

[{'block_class': 'cls_1',
  'block_idx': 0,
  'level': 0,
  'page_idx': 0,
  'sentences': ['IMF Country Report No. 22/220'],
  'tag': 'header'},
 {'block_class': 'cls_2',
  'block_idx': 1,
  'level': 0,
  'page_idx': 0,
  'sentences': ['UNITED STATES'],
  'tag': 'header'},
 {'block_class': 'cls_0',
  'block_idx': 2,
  'level': 1,
  'page_idx': 0,
  'sentences': ['July 2022'],
  'tag': 'header'},
 {'block_class': 'cls_3',
  'block_idx': 3,
  'level': 1,
  'page_idx': 0,
  'sentences': ['2022 ARTICLE IV CONSULTATION—PRESS RELEASE; STAFF REPORT; AND STATEMENT BY THE EXECUTIVE DIRECTOR FOR THE UNITED STATES'],
  'tag': 'para'},
 {'block_class': 'cls_0',
  'block_idx': 4,
  'level': 1,
  'page_idx': 0,
  'sentences': ['Under Article IV of the IMF’s Articles of Agreement, the IMF holds bilateral discussions with members, usually every year.',
   'In the context of the 2022 Article IV consultation with the United States, the following documents have been released and are included in this pack

## Use [Unstructured](https://github.com/Unstructured-IO/unstructured) 
- for PDF you need to install some system dependencies [instructions](https://github.com/Unstructured-IO/unstructured?tab=readme-ov-file#installing-the-library)
- installiation issues with pdf2image [link](https://unix.stackexchange.com/questions/754574/pdfinfonotinstallederror-unable-to-get-page-count-is-poppler-installed-and-in)
- arguments explainations [here](https://unstructured-io.github.io/unstructured/core/partition.html)

In [16]:
import os, sys 
#from unstructured.partition.auto import partition
from unstructured.partition.pdf import partition_pdf

In [17]:
pdf_url = '/root/workspace/data/DOCs/PDF/USA_2022.pdf'
elements = partition_pdf(filename=pdf_url,strategy="hi_res")

- I am not sure if this is any better than just used Docx
- As table extraction is often not perfectly accurage

In [68]:
for e in elements[250:280]:
    e_dict = e.to_dict()
    print(e_dict['type']," : ",e_dict['text'])

NarrativeText  :  21. Contending with these wage and price pressures will require a rapid withdrawal of monetary accommodation. Given the broad-based nature of wage and price inflation, a range of model simulations indicate that quickly bringing inflation back to 2 percent requires an increase in the ex-ante real policy rate to above neutral (Box 4). There is also some evidence to suggest that a secular increase in market concentration among U.S. corporates may dampen the transmission of monetary tightening, requiring a more decisive cooling of labor markets to stabilize the system (Box
NarrativeText  :  INTERNATIONAL MONETARY FUND 15
Title  :  UNITED STATES
NarrativeText  :  5). Staff’s baseline forecast is predicated on the median projection for the federal funds rate published at the June FOMC meeting. That rate path would push the ex ante real policy rate above zero by late-2022 and is expected to bring inflation back to 2 percent by late 2023/early 2024. However, if there is more 

## User docx2python
- https://github.com/ShayHill/docx2python

In [166]:
from docx2python import docx2python
from docx2python.iterators import get_html_map

- seems difficult to remove data tables automatically 

In [171]:
fp = '/root/workspace/data/DOCs/Word/USA_2022.docx'
docx_content = docx2python(fp)
print(docx_content.text[:1000]) ## this will extract all text, include tables 


----media/image1.jpeg----



UNITED STATES


IMF Country Report No. 22/220





July 2022


2022 ARTICLE IV CONSULTATION—PRESS RELEASE; STAFF REPORT; AND STATEMENT BY THE EXECUTIVE DIRECTOR FOR THE UNITED STATES

Under Article IV of the IMF’s Articles of Agreement, the IMF holds bilateral discussions with members, usually every year. In the context of the 2022 Article IV consultation with the United States, the following documents have been released and are included in this package:



--		A Press Release summarizing the views of the Executive Board as expressed during its July 11, 2022 consideration of the staff report that concluded the Article IV consultation with the United States.

--		The Staff Report prepared by a staff team of the IMF for the Executive Board’s consideration on July 11, 2022, following discussions that ended on June 15, 2022, with the officials of the United States on economic developments and policies. Based on information available at the time of these discuss

In [187]:
docx_content.close()

## Use python-docx

In [177]:
from docx import Document

In [178]:
def extract_text_and_tables(doc_path):
    # Load the document
    doc = Document(doc_path)

    # Extract paragraphs
    paragraphs = [para.text for para in doc.paragraphs]

    # Extract tables
    tables = []
    for table in doc.tables:
        table_data = []
        for row in table.rows:
            row_data = []
            for cell in row.cells:
                row_data.append(cell.text)
            table_data.append(row_data)
        tables.append(table_data)

    return paragraphs, tables

- this is fine but not everything is in right order

In [181]:
doc_path = '/root/workspace/data/DOCs/Word/USA_2022.docx'
paragraphs, tables = extract_text_and_tables(doc_path)

- maybe you can also try this :
- https://github.com/kmrambo/Python-docx-Reading-paragraphs-tables-and-images-in-document-order-

### Slightly customized stripts 

In [206]:
from clean_text_utils import Timelimit, cleanup_table,process_doc,process_text,filter_by_tag

In [204]:
exclude_list = ['Download Date','Current Classification', 'MASTER FILES', 'ROOM', 'SM/', 'PUBLIC USE', 'N.A.', '(-)', '+', 'Very truly yours,'
                'INTERNATIONAL MONETARY FUND', 'CONFIDENTIAL', 'INFORMATION', 'Att:', 'Other Distribution:', 'Washington, DC 20431', 'Yours sincerely,',
               'Departments Heads', 'Department Heads', 'ATTACHMENT', 'ARCHIVES', 'Classification', '□JU', '(right axis)', 'RHS', 'LHS'] 

remove_list = ['', 'I', 'II', 'Mexican peso per', 'Inflation', 'indexed,', 'Fixed rate,', 'Floating rate, 61%', 
               'LO', 'or projections.', 'ztTJ', 's s', '8 .J¼l J U.S. dollar peJ \ I •,']

title_list = ['macroeconomic', 'fiscal', 'structural', 'monetary', 'exchange', 'civil', 'external', "enterprise", 'decision', 'capacity building', 'public finance',
              'legal', 'poverty', 'alternative', 'financial', 'reform', 'government', 'real', 'statistical', "prices", 'vulnerability', 'consumption', 'text box']

In [207]:
doc_path = '/root/workspace/data/DOCs/Word/USA_2022.docx'
cleaned_text = []
doc = Document(doc_path)
temp = cleanup_table(doc)
t = [process_doc(p, exclude_list = exclude_list) for p in temp]
t = [t for t in t if t != '']
text = process_text(t, remove_list=remove_list, title_list=title_list)

In [213]:
prefixes = ["<Para>"]
clean_text = filter_by_tag(text,prefixes,replace_prefix=True)

In [215]:
clean_text[:10]

['IMF Country Report No. 22/220 Under Article IV of the IMF’s Articles of Agreement, the IMF holds bilateral discussions with members, usually every year. In the context of the 2022 Article IV consultation with the United States, the following documents have been released and are included in this package:',
 'A Press Release summarizing the views of the Executive Board as expressed during its July 11, 2022 consideration of the staff report that concluded the Article IV consultation with the United States.',
 'The Staff Report prepared by a staff team of the IMF for the Executive Board’s consideration on July 11, 2022, following discussions that ended on June 15, 2022, with the officials of the United States on economic developments and policies. Based on information available at the time of these discussions, the staff report was completed on June 24, 2022.',
 'An Informational Annex prepared by the IMF staff.',
 'A Staff Supplement updating information on recent developments.',
 'A St