### Objective

* Create IFT dataset from economic survey of India and RBI annual report
* SFT using LoRA on some Llama model
* Build RAG over new ESI when budget and RBI reports are released

##### Considerations
* Both documents have text, tables and images within them.
* Information may not necessarily be repeated across all formats
* Hence, we need to parse all formats.

##### We will look at multiple approaches
* Using Llamaparse - since it processes images, text and tables 
* Using unstructured - can it do the same thing?
* Using PDFPlumber for text and tables since some tables were not recognized with llamaparse 

##### Llamaparse approach - based on results from llamaparse_uber.ipynb

* Use Llamaparse in this notebook
* Images and tables
    * Parse as JSON
    * Separate out images and tables into separate dictionaries
* Text
    * Parse as markdown
    * Replace tables with space using regex
    * Chunk text and store as dictionary

In [1]:
import nest_asyncio
nest_asyncio.apply()

import os
from llama_parse import LlamaParse

In [2]:
# check for key
LLAMAPARSE_API_KEY = os.environ.get('LLAMAPARSE_API_KEY')
if LLAMAPARSE_API_KEY is not None:
    print('API key found')
else:
    print('Check for API key in environment variable')

API key found


In [5]:
!wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" https://www.indiabudget.gov.in/economicsurvey/doc/echapter.pdf -O '../data/esi/esi2023.pdf'

--2024-06-24 17:59:31--  https://www.indiabudget.gov.in/economicsurvey/doc/echapter.pdf
Resolving www.indiabudget.gov.in (www.indiabudget.gov.in)... 2600:140f:2a00::17c6:410, 2600:140f:2a00::17c6:430, 23.52.73.101, ...
Connecting to www.indiabudget.gov.in (www.indiabudget.gov.in)|2600:140f:2a00::17c6:410|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14302499 (14M) [application/pdf]
Saving to: ‘../data/esi/esi2023_1.pdf’


2024-06-24 17:59:32 (11.7 MB/s) - ‘../data/esi/esi2023_1.pdf’ saved [14302499/14302499]



In [24]:
# instantiate parser for json results
parser = LlamaParse(verbose=True,
                     api_key=LLAMAPARSE_API_KEY,
                     language='en',
                     # result_type="markdown", # or text; no json
                     parsing_instruction="You are parsing an economic survey report released by the government of a developing economy. The document has text, tables and graphs. Try to recognise the graphs"
                     )

In [25]:
esi2023_json = parser.get_json_result('../data/esi/esi2023.pdf')

Started parsing the file under job_id a89defbb-a174-4f2a-bb9f-da1ed6be37fb
.......

In [26]:
# separate dictionary within pages for each page
len(esi2023_json[0]['pages'])

414

In [51]:
no = 51
print(esi2023_json[0]['pages'][no]['md'])

# Economic Survey 2022-23

|Per cent (YoY)|Per cent|Broad-based growth driven by Demand and Investment|
|---|---|---|
|figure I.10a: yoy growth of Real|figure I.10b: Share of Real|Feb-21|
| |Gva components|GDp components|
|FY19 (3rd RE)|Apr-21| |
|Agriculture and allied activities|FY19 (3rd RE)|FY20 (2nd RE)|
| | |FY21 (1st RE)|
|Jun-21|Industry| |
|FY22 (PE)|FY23 (1st AE)|Aug-21|
|Services| |FY20 (2nd RE)|
|Oct-21| | |
|Dec-21| | |
|FY21 (1st RE)|Feb-22| |
| | | |
|-2| | |
|Apr-22| | |
|-7|Jun-22|FY22 (PE)|
| | |Aug-22|
|-12|Oct-22|FY23 (1st AE)|
|Dec-22| | |

|PFCE|GFCF|Exports of goods and services|Imports of goods and services|
|---|---|---|---|
|Per cent of GDP|Source: NSO, MoSPI| | |

Note: AE stands for Advanced Estimates, PE stands for Provisional Estimates, RE stands for Revised Estimates

|figure I.11: cpI Inflation eased back|figure I.12: Indian Rupee performed|
|---|---|
|to RBI’s target range|well compared to other emes|

|CPI|CPI-Food|Depreciation (+)/Appreciation(-)|
|--

In [52]:
print(esi2023_json[0]['pages'][no]['text'])

                                                                           10                                            Economic Survey 2022-23
Per cent (YoY)Per centBroad-based growth driven by Demand and Investment
                                                                                                                  figure I.10a: yoy growth of Real                                                                                                        figure I.10b: Share of RealFeb-21
                                                                                                                                                           Gva components                                                                                                           GDp componentsFY19 (3rd RE)Apr-21
                                                                                                                                                           Agriculture and allied activities

In [54]:
print(esi2023_json[0]['pages'][no]['images'])

[]
