<a href="https://colab.research.google.com/github/isamdr86/towards-ai/blob/main/notebooks/LlamaParse_ir.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -U llama-index llama-parse

Collecting llama-index
  Downloading llama_index-0.12.10-py3-none-any.whl.metadata (11 kB)
Collecting llama-parse
  Downloading llama_parse-0.5.19-py3-none-any.whl.metadata (7.0 kB)
Collecting llama-index-agent-openai<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.1-py3-none-any.whl.metadata (726 bytes)
Collecting llama-index-cli<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_cli-0.4.0-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.13.0,>=0.12.10 (from llama-index)
  Downloading llama_index_core-0.12.10.post1-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.3.1-py3-none-any.whl.metadata (684 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.6.3-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-llms-openai<0.4.0,>=0.3.0 (from ll

In [2]:
import os

from google.colab import userdata
os.environ["LLAMA_CLOUD_API_KEY"] = userdata.get('llama_api_key')

In [3]:
# Downloading Research paper dataset from HuggingFace Hub
from huggingface_hub import hf_hub_download
file_path = hf_hub_download(repo_id="jaiganesan/ai_tutor_knowledge", filename="research_papers_llamaparse.zip",repo_type="dataset",local_dir="/content")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


research_papers_llamaparse.zip:   0%|          | 0.00/13.6M [00:00<?, ?B/s]

In [4]:
!unzip research_papers_llamaparse.zip

Archive:  research_papers_llamaparse.zip
   creating: research_papers_llamaparse/
  inflating: research_papers_llamaparse/2106.09685v2.pdf  
  inflating: research_papers_llamaparse/2404.19756v2.pdf  
  inflating: research_papers_llamaparse/2405.07437v2.pdf  


In [5]:
import nest_asyncio

nest_asyncio.apply()

## Parse directory to LlamaParse

In [6]:
# LlamaParse Implemetation
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

#Parser
parser = LlamaParse(
    result_type="markdown",
    verbose=True,
)

In [7]:
file_extractor = {".pdf": parser}

documents = SimpleDirectoryReader("/content/research_papers_llamaparse", file_extractor=file_extractor).load_data()

Started parsing the file under job_id 48c4c964-ff69-455d-863c-168966585a2c
......Started parsing the file under job_id 9d8a92a8-683e-406c-a3c8-e8559344b58d
.....................Started parsing the file under job_id 768fbe1c-cf2a-4952-978f-e52dddc0ec40
..

In [8]:
documents[0]

Document(id_='7101c2cd-7424-493a-969e-fa5094aaca03', embedding=None, metadata={'file_path': '/content/research_papers_llamaparse/2106.09685v2.pdf', 'file_name': '2106.09685v2.pdf', 'file_type': 'application/pdf', 'file_size': 1609513, 'creation_date': '2025-01-09', 'last_modified_date': '2024-07-06'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='# LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS\n\nEdward Hu∗ Yelong Shen∗ Phillip Wallis Zeyuan Allen-Zhu Yuanzhi Li Shean Wang Lu Wang Weizhu Chen\n\nMicrosoft Corporation\n\n{edwardhu, yeshe, phwallis, zeyuana, yuanzhil, swang, luw, wzchen}@microsoft.com\n\nyuanzhil@andrew.cmu.edu\n\n(Ve

## LlamaParse JSON Output

In [9]:
# Using LlamaParse in JSON Mode for PDF Reading

import glob
pdf_files = glob.glob("/content/research_papers_llamaparse/*.pdf")

parser = LlamaParse(verbose=True)

json_objs=[]

for pdf_file in pdf_files:
  json_objs.extend(parser.get_json_result(pdf_file))

Started parsing the file under job_id 31155ac7-5814-42fd-ac91-248c6aff7f40
Started parsing the file under job_id 0d494fad-791d-46ea-8328-10f4f198d963
Started parsing the file under job_id 137c3221-61da-4b64-b7a5-eaa9452527c1


In [10]:
json_objs[0]['pages'][0]['text']

'                              LORA:              LOW-RANK                     ADAPTATION                       OF       LARGE              LAN-\n                              GUAGE MODELS\n                               Edward Hu∗             Yelong Shen∗            Phillip Wallis          Zeyuan Allen-Zhu\n                               Yuanzhi Li            Shean Wang             Lu Wang            Weizhu Chen\n                               Microsoft Corporation\n                               {edwardhu, yeshe, phwallis, zeyuana,\n                               yuanzhil, swang, luw, wzchen}@microsoft.com\n                               yuanzhil@andrew.cmu.edu\n                               (Version 2)\narXiv:2106.09685v2  [cs.CL]  16 Oct 2021\n                                                                                  ABSTRACT\n                                        An important paradigm of natural language processing consists of large-scale pre-\n                          

In [11]:
json_objs[0]['pages'][4]['text']

'guarantees that we do not introduce any additional latency during inference compared to a fine-tuned\nmodel by construction.\n\n4.2     APPLYING LORA TO TRANSFORMER\nIn principle, we can apply LoRA to any subset of weight matrices in a neural network to reduce the\nnumber of trainable parameters. In the Transformer architecture, there are four weight matrices in\nthe self-attention module (Wq , Wk, Wv × dmodel, even though the output dimension is usually sliced\n                                                                                                   as a single matrix of dimension dmodel , Wo) and two in the MLP module. We treat Wq (or Wk, Wv )\ninto attention heads. We limit our study to only adapting the attention weights for downstream\ntasks and freeze the MLP modules (so they are not trained in downstream tasks) both for simplicity\nand parameter-efficiency.We further study the effect on adapting different types of attention weight\nmatrices in a Transformer in Section 

In [12]:
# KAN 7th page complete extracted information
json_objs[0]['pages'][6]['text']

'   Model & Method                 # Trainable                            E2E NLG Challenge\n                                   Parameters        BLEU          NIST         MET         ROUGE-L     CIDEr\n   GPT-2 M (FT)*                     354.92M          68.2          8.62         46.2           71.0     2.47\n   GPT-2 M (AdapterL)*                  0.37M         66.3          8.41         45.0           69.8     2.40\n   GPT-2 M (AdapterL)*                 11.09M         68.9          8.71         46.1           71.3     2.47\n   GPT-2 M (AdapterH)                  11.09M       67.3±.6      8.50±.07      46.0±.2        70.7±.2  2.44±.01\n   GPT-2 M (FTTop2)*\n                                       25.19M         68.1          8.59         46.0           70.8     2.41\n   GPT-2 M (PreLayer)*                  0.35M         69.7          8.81         46.1           71.4     2.49\n   GPT-2 M (LoRA)                       0.35M       70.4±.1      8.85±.02      46.8±.2        71.8±.1  2.5

In [13]:
# Table information
json_objs[0]['pages'][6]['items'][1]

{'type': 'text',
 'value': 'RoBERTa (Liu et al., 2019) optimized the pre-training recipe originally proposed in BERT (Devlin et al., 2019a) and boosted the latter’s task performance without introducing many more trainable parameters. While RoBERTa has been overtaken by much larger models on NLP leaderboards such as the GLUE benchmark (Wang et al., 2019) in recent years, it remains a competitive and popular pre-trained model for its size among practitioners. We take the pre-trained RoBERTa base (125M) and RoBERTa large (355M) from the HuggingFace Transformers library (Wolf et al., 2020) and evaluate the performance of different efficient adaptation approaches on tasks from the GLUE benchmark. We also replicate Houlsby et al. (2019) and Pfeiffer et al. (2021) according to their setup. To ensure a fair comparison, we make two crucial changes to how we evaluate LoRA when comparing with adapters. First, we use the same batch size for all tasks and use a sequence length of 128 to match the a

In [14]:
json_objs[0]['pages'][3]['items'][2]

{'type': 'heading',
 'lvl': 1,
 'value': '4 OUR METHOD',
 'md': '# 4 OUR METHOD',
 'bBox': {'x': 108, 'y': 252, 'w': 199.98, 'h': 508.96}}