### Objective

* Create IFT dataset from economic survey of India and RBI annual report
* SFT using LoRA on some Llama model
* Build RAG over new ESI when budget and RBI reports are released

##### Considerations
* Both documents have text, tables and images within them.
* Information may not necessarily be repeated across all formats
* Hence, we need to parse all formats.

##### We will look at multiple approaches
* Using Llamaparse - since it processes images, text and tables 
* Using PDFPlumber for text and tables since some tables were not recognized with llamaparse 

##### Llamaparse approach - based on results from llamaparse_uber.ipynb

* Use Llamaparse in this notebook
* Images and tables
    * Parse as JSON
    * Separate out images and tables into separate dictionaries
* Text
    * Parse as markdown
    * Replace tables with space using regex
    * Chunk text and store as dictionary

In [20]:
import nest_asyncio
nest_asyncio.apply()

import os
from llama_parse import LlamaParse

In [21]:
# check for key
LLAMAPARSE_API_KEY = os.environ.get('LLAMAPARSE_API_KEY')
if LLAMAPARSE_API_KEY is not None:
    print('API key found')
else:
    print('Check for API key in environment variable')

API key found


In [23]:
# check if file exists else download it
if not os.path.exists('../data/esi/esi2023.pdf'):
    !wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" https://www.indiabudget.gov.in/economicsurvey/doc/echapter.pdf -O '../data/esi/esi2023.pdf'
else:
    print('file exists, skipping')

file exists, skipping


In [24]:
import PyPDF2

In [25]:
def extract_pages(input_pdf_path, output_pdf_path, pages_to_extract):
    # open input_pdf_file
    with open(input_pdf_path, 'rb') as input_pdf:
        reader = PyPDF2.PdfReader(input_pdf)
        writer = PyPDF2.PdfWriter()
        
        # extract the specified pages
        for page_num in pages_to_extract:
            writer.add_page(reader.pages[page_num])
            
        # save the output_pdf_file
        with open(output_pdf_path, 'wb') as output_pdf:
            writer.write(output_pdf)   

In [26]:
input_pdf_path = '../data/esi/esi2023.pdf'
output_pdf_path = '../data/esi/esi2023_pages.pdf'
pages_to_extract = list(range(42,100))

In [27]:
extract_pages(input_pdf_path, output_pdf_path, pages_to_extract)

In [28]:
# instantiate parser for json results
parser = LlamaParse(verbose=True,
                     api_key=LLAMAPARSE_API_KEY,
                     language='en',
                     # result_type="markdown", # or text; no json
                     parsing_instruction="You are parsing an economic survey report released by the government of a developing economy. The document has text, tables and graphs. Try to recognise the graphs and return them as images"
                     )

In [29]:
esi2023_json = parser.get_json_result('../data/esi/esi2023_pages.pdf')

Started parsing the file under job_id 6b4a5641-a621-4e73-8f7d-ba9172bfe294
..

In [30]:
# separate dictionary within pages for each page
len(esi2023_json[0]['pages'])

58

In [32]:
no = 5
print(esi2023_json[0]['pages'][no]['md'])

#

# Economic Survey Report

# Economic Survey 2022-23

Per cent hardening of bond yields across economies

| |Euro Area|France|Germany|Brazil|China|India|
|---|---|---|---|---|---|---|
|Jan-20| | | | | | |
|Feb-21| | | | | | |
|Apr-21| | | | | | |
|Jun-20| | | | | | |
|Jun-21| | | | | | |
|Aug-21| | | | | | |
|Oct-21| | | | | | |
|Dec-21| | | | | | |
|Feb-22| | | | | | |
|Apr-22| | | | | | |
|Jun-22| | | | | | |
|Aug-22| | | | | | |
|Oct-22| | | | | | |
|Dec-22| | | | | | |

Per cent Source: Bloomberg

Rising inflation and monetary tightening led to a slowdown in global output beginning in the second half of 2022. The global PMI composite index has been in the contractionary zone since August 2022, while the yearly growth rates of global trade, retail sales, and industrial production have significantly declined in the second half of 2022. The consequent dampening of the global economic outlook, also compounded by expectations of a further increase in borrowing costs, was reflected in 

In [33]:
print(esi2023_json[0]['pages'][no]['text'])

                   6           Economic Survey 2022-23
Per centPer centhardening of bond yields across economies
Jan-20figure I.4a: 10-year Bond yield in aes                                            figure I.4b: 10-year Bond yield in emesFeb-21
Apr-21Euro Area                                   France              Germany                  Brazil              China              IndiaJun-20
                              UK                  US                  Japan                    Indonesia           Mexico             RussiaJun-21
                         6                                                               14Nov-20Aug-21
Oct-214Apr-21
Dec-2110
Feb-22Sep-212
Apr-226Feb-220Jun-22
Aug-22Jul-22
                        -2                                                                2Oct-22
Dec-22Dec-22


Per centSource: Bloomberg
                                         figure I.5: the federal funds Rate was raised by a cumulativeFeb-21425 basis points since Jan 2022 leadin

In [34]:
print(esi2023_json[0]['pages'][no]['images'])

[]
