# Project Instructions:
__Data__: Meta 10-k Filings\
__LLM__: OpenAI GPT-3.5-turbo\
__Embedding Model__: text-3-embedding small\
__Infrastructure__: LlamaIndex\
__Vector Store__: Qdrant - in memory\
__Deployment__: Chainlit, Hugging Face

In [None]:
# Install dependencies
%pip install llama-index
%pip install llama-index-core
%pip install llama-index-embeddings-openai
%pip install llama-index-postprocessor-flag-embedding-reranker
%pip install git+https://github.com/FlagOpen/FlagEmbedding.git
%pip install llama-parse
%pip install ipywidgets

# Create a data folder and then download the document while updating its name:

In [None]:
!mkdir -p 'data/'
!wget 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf' -O 'data/meta_10k_filings.pdf'

In [1]:
# Uncomment if you are in a Jupyter Notebook - I did.
import nest_asyncio

nest_asyncio.apply()

# API keys for OpenAI and Llamda Cloud & Settings

In [2]:
import os
import getpass

os.environ["LLAMA_CLOUD_API_KEY"] = getpass.getpass("LLamaParse API Key:")

In [3]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [4]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-3.5-turbo-0125", temperature=0)   # I used the updated GPT-3.5 model since current 3.5 points to 0613 and will be depreciated. 

Settings.llm = llm
Settings.embed_model = embed_model

# Parsing with Instructions:  -- update!!!
Ref: https://github.com/run-llama/llama_parse/blob/main/examples/demo_parsing_instructions.ipynb

Instead of vanialla parsing I decide to use a prompt in my parsing. I saw the option in LlamaParse website: https://cloud.llamaindex.ai/parse


The below instructions did the job however it also added name, title and dates randomly. This did not impact the end results. I am able to return answers correctly to assignment questions as well any other question I have tested. Ideally I need to tailor the instructions to avoid the unnecessary add ons and make it suitable for any and every 10k document.

- To create tailored template I can feed in specific non text pages to llama parse website and tailor a prompt for each page then combine. 

In [5]:
from llama_parse import LlamaParse

parsingInstructionMeta = """The provided document contains a table listing signatures, titles, and dates. Extract the data from this table and create a Markdown table with the following columns: Name, Title, and Date. For the Name column, remove any signature prefixes (e.g., '/s/' or '/s') and only include the actual name. Preserve the original titles and dates as they appear in the image. The resulting Markdown table should be formatted properly with pipes (|) separating the columns and dashes (-) separating the header row from the data rows."""

documents = LlamaParse(
    result_type="markdown", parsing_instruction=parsingInstructionMeta
).load_data("/Users/acrobat/Documents/GitHub/AI-Engineering-Cohort-2/midterm/data/meta_10k_filings.pdf")

# As Chris mentioned there has to be caching at llamaCloud side. My first instruction_parsing run took over 20mins however subsequent ones were under 10 seconds.

Started parsing the file under job_id 9176e64b-b34f-4a92-a0e5-1754907a9987


# Check the Power of attorney table markdown - Check th einstruction parsing page

In [6]:
target_page = 133
print(documents[0].text.split("\n---\n")[target_page]) # works like a champ!!!


| Name              | Title                                     | Date            |
|-------------------|-------------------------------------------|-----------------|
| Mark Zuckerberg   | Board Chair and Chief Executive Officer  | February 1, 2024 |
| Susan Li          | Chief Financial Officer                   | February 1, 2024 |
| Aaron Anderson    | Chief Accounting Officer                  | February 1, 2024 |
| Peggy Alford      | Director                                  | February 1, 2024 |
| Marc L. Andreessen| Director                                  | February 1, 2024 |
| Andrew W. Houston | Director                                  | February 1, 2024 |
| Nancy Killefer    | Director                                  | February 1, 2024 |
| Robert M. Kimmitt | Director                                  | February 1, 2024 |
| Sheryl K. Sandberg | Director                                 | February 1, 2024 |
| Tracey T. Travis  | Director                                  | Fe

In [46]:
# Check rest of the document. 
print(documents[0].text[10000:11000] + "...") 

 maintain levels of user engagement with our products; | | |
| the loss of, or reduction in spending by, our marketers; | | |
| reduced availability of data signals used by our ad targeting and measurement tools; | | |
| ineffective operation with mobile operating systems or changes in our relationships with mobile operating system partners; | | |
| failure of our new products, or changes to our existing products, to attract or retain users or generate revenue; | | |
| Risks Related to Our Business Operations and Financial Results | | |
| our ability to compete effectively; | | |
| fluctuations in our financial results; | | |
| unfavorable media coverage and other risks affecting our ability to maintain and enhance our brands; | | |
| our ability to build, maintain, and scale our technical infrastructure, and risks associated with disruptions in our service, catastrophic events, and crises; | | |
| operating our business in multiple countries around the world; | | |
| acquisitions and 

# Markdown parser & node construction - need it because of recursive retriever
At this point all i have is a markdown doc parsed from the pdf and stored in the documents variable.  Using MarkdownElementNodeParser for parsing the LlamaParse output Markdown results and building recursive retriever query engine for generation.

In [21]:
from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8
)

In [17]:
nodes = node_parser.get_nodes_from_documents(documents)

143it [00:00, 56275.61it/s]
100%|██████████| 143/143 [00:33<00:00,  4.27it/s]


In [29]:
print(len(nodes))

428


In [23]:
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

### Initializing the `VectorStoreIndex` with QDrant and create collection meta_10k_filings


In [33]:
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, models

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="meta_10k_filings",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

True

# Load nodes to Qdrant to create the recursive_index

#Recursive Index - Will use recursive index instead of simple index. 

In [34]:
from llama_index.core import StorageContext

vector_store = QdrantVectorStore(client=client, collection_name="meta_10k_filings")

storage_context = StorageContext.from_defaults(vector_store=vector_store)

recursive_index = VectorStoreIndex(
    nodes=base_nodes + objects, storage_context=storage_context
)

In [45]:
print(type(vector_store)) # check what is the vectorstore, pheww!

<class 'llama_index.vector_stores.qdrant.base.QdrantVectorStore'>


# Initialize the reranker 
- initialluild with BAAI/bge-reranker-large. It takes about 3-5 secs for each question. 
In HF website I see other options: For better performance, recommand BAAI/bge-reranker-v2-minicpm-layerwise and BAAI/bge-reranker-v2-gemma. So I used gemma and crashed my computer. Then I realized it is 2.8B parameters. Sticking with reranker-large.  
https://huggingface.co/BAAI/bge-reranker-v2-m3

In [49]:
from llama_index.postprocessor.flag_embedding_reranker import (
    FlagEmbeddingReranker,
)

reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=15, node_postprocessors=[reranker], verbose=True
)

In [50]:
query = "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering 012d253e-1e90-40c2-ac50-ed264369f499: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering d28c3033-3ba8-4d2b-8273-c643bcb04a1e: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering 543a1f12-0ac1-4b96-8203-2cfcc7cd5cf2: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering bfd075d3-88b4-45ad-8b1b-98439c05ff62: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval 

In [51]:
print(response)

The total value of 'Cash and cash equivalents' as of December 31, 2023 was $41,862 million.


In [38]:
query = "What are the names of people with the director title at Meta?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering 7d4b0904-1e9b-4eff-bfb3-50e515a88dda: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the names of people with the director title at Meta?
[0m[1;3;38;2;11;159;203mRetrieval entering d8afde8c-1adc-4319-a767-08488f2b26ce: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the names of people with the director title at Meta?
[0m[1;3;38;2;11;159;203mRetrieval entering 95ee41eb-36bd-4627-a9e0-6ea128972251: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the names of people with the director title at Meta?
[0m[1;3;38;2;11;159;203mRetrieval entering 997660d2-9888-4091-9c78-548c3ff80a93: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the names of people with the director title at Meta?
[0m[1;3;38;2;11;159;203mRetrieval entering 746b8f51-0776-43c7-8822-612c55090199: TextNode
[0m[1;3;38;2;237;9

In [39]:
print(response)

Peggy Alford, Marc L. Andreessen, Andrew W. Houston, Nancy Killefer, Robert M. Kimmitt, Sheryl K. Sandberg, Tracey T. Travis, Tony Xu


In [40]:
query = "What are the main sections of the document?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering a24c2f9b-c5d9-415d-9326-c98297beef22: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the main sections of the document?
[0m[1;3;38;2;11;159;203mRetrieval entering 26b13ae4-2b75-4c9e-a058-6c2b5f6c342e: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the main sections of the document?
[0m[1;3;38;2;11;159;203mRetrieval entering c893f3e8-43fe-437b-82fe-bd2c1315a15c: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the main sections of the document?
[0m[1;3;38;2;11;159;203mRetrieval entering 51787609-c5ef-40b2-9b25-a4781ce71bb7: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the main sections of the document?
[0m[1;3;38;2;11;159;203mRetrieval entering 8e5192e9-72f5-4aaa-8339-43a637f3c112: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the main secti

In [41]:
print(response)

The main sections of the document include Directors, Executive Officers and Corporate Governance, Executive Compensation, Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters, Certain Relationships and Related Transactions, and Director Independence, and Principal Accountant Fees and Services.


In [42]:
query = "List me the table of contents?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering 9c13d365-5d24-4585-a3f4-02f4cfaf735d: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query List me the table of contents?
[0m[1;3;38;2;11;159;203mRetrieval entering 3b477f19-61f2-4187-b971-40841d652538: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query List me the table of contents?
[0m[1;3;38;2;11;159;203mRetrieval entering c893f3e8-43fe-437b-82fe-bd2c1315a15c: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query List me the table of contents?
[0m[1;3;38;2;11;159;203mRetrieval entering 26b13ae4-2b75-4c9e-a058-6c2b5f6c342e: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query List me the table of contents?
[0m[1;3;38;2;11;159;203mRetrieval entering 1adbd1d3-cc5c-4b6d-b511-f28477f8df77: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query List me the table of contents?
[0m[1;3;38;2;11;159;203mRetrieval entering

In [43]:
print(response)

The table of contents includes documents such as Reports of Independent Registered Public Accounting Firm, Consolidated Balance Sheets, Consolidated Statements of Income, Consolidated Statements of Comprehensive Income, Consolidated Statements of Stockholders' Equity, Consolidated Statements of Cash Flows, Notes to Consolidated Financial Statements, Amended and Restated Certificate of Incorporation, Amended and Restated Bylaws, Form of Class A Common Stock Certificate, Form of Class B Common Stock Certificate, Indenture, First Supplemental Indenture, Second Supplemental Indenture, Description of Registrant's Capital Stock, Form of Indemnification Agreement, 2012 Equity Incentive Plan, Third Amendment to the 2012 Equity Incentive Plan, 2012 Equity Incentive Plan forms of award agreements, and 2012 Equity Incentive Plan forms of award agreements (Additional Forms).
