In [None]:
%pip install llama-index
%pip install llama-index-core
%pip install llama-index-embeddings-openai
%pip install llama-index-postprocessor-flag-embedding-reranker
%pip install git+https://github.com/FlagOpen/FlagEmbedding.git
%pip install llama-parse

In [None]:
%pip install ipywidgets

In [None]:
#create a folder and then download and update its name:
!mkdir -p 'data/'
!wget 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf' -O 'data/meta_10k_filings.pdf'

In [2]:
# Uncomment if you are in a Jupyter Notebook - I did.
import nest_asyncio

nest_asyncio.apply()

In [3]:
import os
import getpass

os.environ["LLAMA_CLOUD_API_KEY"] = getpass.getpass("LLamaParse API Key:")

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [5]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-3.5-turbo-0125", temperature=0)   # I used the updated GPT-3.5 model since current 3.5 points to 0613 and will be depreciated. 

Settings.llm = llm
Settings.embed_model = embed_model

# Parsing with Instructions:  
Ref: https://github.com/run-llama/llama_parse/blob/main/examples/demo_parsing_instructions.ipynb

Instead of vanialla parsing I decide to use a prompt in my parsing. I saw the option in LlamaParse website: https://cloud.llamaindex.ai/parse


The below instructions is what I tried in my POC. It did the job however it also added signatures, title and dates as field names whenever there are signatures in the document. This did not impact the end results. I am able to return answers correctly to assignment questions as well any other question I have tested. However, I can make th einstructions a bit more tailored to avoid the unnecessary title, signatures and dates in every signature page. I dont want to specify the page name as "Power of Attorney" although pretty much in every 10K doc there will be a Power of Attorney page. 

parsingInstructionMetaV0 = """The provided document contains a table listing signatures, titles, and dates. Extract the data from this table and create a Markdown table with the following columns: Name, Title, and Date. For the Name column, remove any signature prefixes (e.g., '/s/' or '/s') and only include the actual name. Preserve the original titles and dates as they appear in the image. The resulting Markdown table should be formatted properly with pipes (|) separating the columns and dashes (-) separating the header row from the data rows."""

If I want to create tailored template I can feed in specific non text pages to llama parse website and tailor a prompt for each page then combine. For this exercise I will only treat what is asked. Also I can return all values correctly anyway. 

In [14]:
parsingInstructionMeta = """ 
The provided document is a 10K filing that contains various sections of text and tables. Parse the document as follows:

Signature Table:

Locate the table with columns for Signature, Title, and Date.
Extract the data from each row of the table and create a structured Markdown table with the following columns: Name, Title, and Date.
For the Name column, remove any signature prefixes (e.g., '/s/' or '/s') and only include the actual name.
Preserve the original titles and dates as they appear in the table.
The resulting Markdown table should be formatted properly with pipes (|) separating the columns and dashes (-) separating the header row from the data rows.


Other Sections:

Identify and extract the main sections of the document, such as the introduction, business description, risk factors, financial statements, etc.
For each section, create a Markdown heading with the section title.
Extract the text content of each section and include it under the corresponding heading.
If a section contains subsections, create subheadings for each subsection and include the relevant text content.
Preserve the original formatting of the text, such as paragraphs, lists, and tables, as much as possible.


Output:

Combine the parsed signature table and the other sections into a single Markdown document.
Ensure that the signature table appears at the appropriate location within the document, typically towards the end.
Use proper Markdown syntax for headings, subheadings, paragraphs, lists, tables, and any other formatting elements.
The final output should be a well-structured Markdown representation of the entire 10K document, including the signature table and all other relevant sections.
"""

In [29]:
from llama_parse import LlamaParse

parsingInstructionMeta = """The provided document contains a table listing signatures, titles, and dates. Extract the data from this table and create a Markdown table with the following columns: Name, Title, and Date. For the Name column, remove any signature prefixes (e.g., '/s/' or '/s') and only include the actual name. Preserve the original titles and dates as they appear in the image. The resulting Markdown table should be formatted properly with pipes (|) separating the columns and dashes (-) separating the header row from the data rows."""

documents = LlamaParse(
    result_type="markdown", parsing_instruction=parsingInstructionMeta
).load_data("/Users/acrobat/Documents/GitHub/AI-Engineering-Cohort-2/midterm/data/meta_10k_filings.pdf")

Started parsing the file under job_id 9153539c-489b-44e8-a48a-d6f4649ef960


In [30]:
target_page = 133
print(documents[0].text.split("\n---\n")[target_page]) # works like a champ!!!


| Name              | Title                                     | Date            |
|-------------------|-------------------------------------------|-----------------|
| Mark Zuckerberg   | Board Chair and Chief Executive Officer  | February 1, 2024 |
| Susan Li          | Chief Financial Officer                   | February 1, 2024 |
| Aaron Anderson    | Chief Accounting Officer                  | February 1, 2024 |
| Peggy Alford      | Director                                  | February 1, 2024 |
| Marc L. Andreessen| Director                                  | February 1, 2024 |
| Andrew W. Houston | Director                                  | February 1, 2024 |
| Nancy Killefer    | Director                                  | February 1, 2024 |
| Robert M. Kimmitt | Director                                  | February 1, 2024 |
| Sheryl K. Sandberg | Director                                 | February 1, 2024 |
| Tracey T. Travis  | Director                                  | Fe

In [31]:
from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8
)

In [32]:
nodes = node_parser.get_nodes_from_documents(documents)

143it [00:00, 62883.78it/s]
100%|██████████| 143/143 [00:29<00:00,  4.79it/s]


In [33]:
print(len(nodes))

428


In [34]:
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

In [35]:
recursive_index = VectorStoreIndex(nodes=base_nodes + objects)

In [36]:
from llama_index.postprocessor.flag_embedding_reranker import (
    FlagEmbeddingReranker,
)

reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=15, node_postprocessors=[reranker], verbose=True
)

In [37]:
query = "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering 8c4d199b-e363-44f8-8b45-aa8d2b79bde2: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering c95f18bc-2e9f-4774-bdd2-0511647a8231: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering 6f3c6828-7c3a-4f68-acf0-3b359af7054d: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering 07ad329b-c6b2-444f-bb6c-75a036fa4fcc: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval 

In [38]:
print(response)

The total value of 'Cash and cash equivalents' as of December 31, 2023, was $41,862 million.


In [39]:
query = "What are the names of people with the director title at Meta?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering d809a57d-767c-44fe-ba04-4d69ea243ad7: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the names of people with the director title at Meta?
[0m[1;3;38;2;11;159;203mRetrieval entering 60bbc3fb-387b-48f9-9f14-aae72bd9ba53: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the names of people with the director title at Meta?
[0m[1;3;38;2;11;159;203mRetrieval entering 79f2be4a-2a68-4713-a9ef-39339e3044f8: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the names of people with the director title at Meta?
[0m[1;3;38;2;11;159;203mRetrieval entering 8981423e-f47b-4bd3-b207-d7f5d33b75a5: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the names of people with the director title at Meta?
[0m[1;3;38;2;11;159;203mRetrieval entering c89b259d-5bd1-4ff2-8341-8acab7f6f6a8: TextNode
[0m[1;3;38;2;237;9

In [40]:
print(response)

Peggy Alford, Marc L. Andreessen, Andrew W. Houston, Nancy Killefer, Robert M. Kimmitt, Sheryl K. Sandberg, Tracey T. Travis, Tony Xu.
