# Project Instructions:
__Data__: Meta 10-k Filings\
__LLM__: OpenAI GPT-3.5-turbo\
__Embedding Model__: text-3-embedding small\
__Infrastructure__: LlamaIndex\
__Vector Store__: Qdrant - Stored in the db\
__Deployment__: Chainlit, Hugging Face

#### I used llama Cloud Parse with parsing instructions and persisted data in a Qdrand Vector DB. 

In [None]:
# Install dependencies
%pip install llama-index
%pip install llama-index-core
%pip install llama-index-embeddings-openai
%pip install llama-index-postprocessor-flag-embedding-reranker
%pip install git+https://github.com/FlagOpen/FlagEmbedding.git
%pip install llama-parse
%pip install ipywidgets

# Create a data folder and then download the document while updating its name:

In [None]:
!mkdir -p 'data/'
!wget 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf' -O 'data/meta_10k_filings.pdf'

In [1]:
# Uncomment if you are in a Jupyter Notebook - I did.
import nest_asyncio

nest_asyncio.apply()

# API keys for OpenAI and Llamda Cloud & Settings

In [2]:
import os
import getpass

os.environ["LLAMA_CLOUD_API_KEY"] = getpass.getpass("LLamaParse API Key:")

In [3]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [4]:
os.environ["QDRANT_API_KEY"] = getpass.getpass("Qdrant API Key:")

In [16]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-3.5-turbo-0125", temperature=0)   # I used the updated GPT-3.5 model since current 3.5 points to 0613 and will be depreciated. 

Settings.llm = llm
Settings.embed_model = embed_model

# Parsing with Instructions:  -- update!!!
Ref: https://github.com/run-llama/llama_parse/blob/main/examples/demo_parsing_instructions.ipynb

Instead of vanialla parsing I decide to use a prompt in my parsing. I saw the option in LlamaParse website: https://cloud.llamaindex.ai/parse


The below instructions did the job however it also added name, title and dates randomly. This did not impact the end results. I am able to return answers correctly to assignment questions as well any other question I have tested. Ideally I need to tailor the instructions to avoid the unnecessary add ons and make it suitable for any and every 10k document.

- To create tailored template I can feed in specific non text pages to llama parse website and tailor a prompt for each page then combine. 

In [17]:
from llama_parse import LlamaParse

parsingInstructionMeta = """The provided document contains a table listing signatures, titles, and dates. Extract the data from this table and create a Markdown table with the following columns: Name, Title, and Date. For the Name column, remove any signature prefixes (e.g., '/s/' or '/s') and only include the actual name. Preserve the original titles and dates as they appear in the image. The resulting Markdown table should be formatted properly with pipes (|) separating the columns and dashes (-) separating the header row from the data rows."""

documents = LlamaParse(
    result_type="markdown", parsing_instruction=parsingInstructionMeta
).load_data("/Users/acrobat/Documents/GitHub/AI-Engineering-Cohort-2/midterm/data/meta_10k_filings.pdf")

# As Chris mentioned there has to be caching at llamaCloud side. My first instruction_parsing run took over 20mins however subsequent ones were under 10 seconds.

Started parsing the file under job_id ff928432-4033-4f20-9986-a1f05b425faf


# Check the Power of attorney table markdown - Check th einstruction parsing page

In [18]:
target_page = 133
print(documents[0].text.split("\n---\n")[target_page]) # works like a champ!!!


| Name              | Title                                     | Date            |
|-------------------|-------------------------------------------|-----------------|
| Mark Zuckerberg   | Board Chair and Chief Executive Officer  | February 1, 2024 |
| Susan Li          | Chief Financial Officer                   | February 1, 2024 |
| Aaron Anderson    | Chief Accounting Officer                  | February 1, 2024 |
| Peggy Alford      | Director                                  | February 1, 2024 |
| Marc L. Andreessen| Director                                  | February 1, 2024 |
| Andrew W. Houston | Director                                  | February 1, 2024 |
| Nancy Killefer    | Director                                  | February 1, 2024 |
| Robert M. Kimmitt | Director                                  | February 1, 2024 |
| Sheryl K. Sandberg | Director                                 | February 1, 2024 |
| Tracey T. Travis  | Director                                  | Fe

In [19]:
# Check rest of the document. 
print(documents[0].text[500:1000] + "...") # one thing to notice is that the text is not in the same order as the original document.

---|-------|------|
| Signatures | | |
| | /s/ Mark Zuckerberg | February 8, 2022 |
| | Mark Zuckerberg | Chief Executive Officer |
| | | (Principal Executive Officer) |
| | | |
| | /s/ David M. Wehner | February 8, 2022 |
| | David M. Wehner | Chief Financial Officer |
| | | (Principal Financial Officer) |
| | | |
| | /s/ Jennifer G. Newstead | February 8, 2022 |
| | Jennifer G. Newstead | Chief Legal Officer |
| | | (Principal Legal Officer) |
| | | |
| | /s/ Erin Egan | February 8, 2022 |
| |...


# Markdown parser & node construction - need it because of recursive retriever
At this point all i have is a markdown doc parsed from the pdf and stored in the documents variable.  Using MarkdownElementNodeParser for parsing the LlamaParse output Markdown results and building recursive retriever query engine for generation.

In [20]:
from llama_index.core.node_parser import MarkdownElementNodeParser

node_parser = MarkdownElementNodeParser(
    llm=OpenAI(model="gpt-3.5-turbo-0125"), num_workers=8
)

In [21]:
nodes = node_parser.get_nodes_from_documents(documents)

143it [00:00, 50019.64it/s]
100%|██████████| 143/143 [00:42<00:00,  3.33it/s]


In [12]:
print(len(nodes))

428


In [22]:
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

### Initializing the `VectorStoreIndex` with QDrant and create collection meta_10k_filings


## Data in Qdrand memory - POC

In [23]:
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, models

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="meta_10k_filings",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

True

## Persist data in Qdrant DB - client = qdrant_client

In [6]:
# connect the db
import os
from qdrant_client import QdrantClient

qdrant_client = QdrantClient(
    url="https://24f997d6-02df-4889-a352-1eac83e0bd37.us-east4-0.gcp.cloud.qdrant.io:6333", 
    api_key=os.environ["QDRANT_API_KEY"],
)

try:
    collections = qdrant_client.get_collections()
    print("Connected successfully to the Qdrant vector database.")
except Exception as e:
    print(f"Failed to connect to the Qdrant vector database: {e}")

Connected successfully to the Qdrant vector database.


In [7]:
collections = qdrant_client.get_collections()
print(collections)

collections=[CollectionDescription(name='meta_10k_filings')]


# Load nodes to Qdrant to create the recursive_index

#Recursive Index - Will use recursive index instead of simple index. 

In [24]:

from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, models
from llama_index.core import StorageContext

vector_store = QdrantVectorStore(client=client, collection_name="meta_10k_filings")  # client = qudrant_client

storage_context = StorageContext.from_defaults(vector_store=vector_store)

recursive_index = VectorStoreIndex(
    nodes=base_nodes + objects, storage_context=storage_context
)

In [18]:
print(type(vector_store)) # check what is the vectorstore, pheww!

<class 'llama_index.vector_stores.qdrant.base.QdrantVectorStore'>


# Initialize the reranker 
- initialluild with BAAI/bge-reranker-large. It takes about 3-5 secs for each question. 
In HF website I see other options: For better performance, recommand BAAI/bge-reranker-v2-minicpm-layerwise and BAAI/bge-reranker-v2-gemma. So I used gemma and crashed my computer. Then I realized it is 2.8B parameters. Sticking with reranker-large.  
https://huggingface.co/BAAI/bge-reranker-v2-m3

In [25]:
from llama_index.postprocessor.flag_embedding_reranker import (
    FlagEmbeddingReranker,
)

reranker = FlagEmbeddingReranker(
    top_n=5,
    model="BAAI/bge-reranker-large",
)

recursive_query_engine = recursive_index.as_query_engine(
    similarity_top_k=15, node_postprocessors=[reranker], verbose=True
)

In [20]:
query = "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering 8858f292-49fb-497f-b2f9-c42faecd60f9: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering 981b39e8-3498-456f-a0ad-f78ee1090706: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering 409e93db-2164-4ded-9eb4-d2c0caf7b1fb: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval entering c92bf696-a102-465a-9f04-9ec11fa86fdf: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What was the total value of 'Cash and cash equivalents' as of December 31, 2023?
[0m[1;3;38;2;11;159;203mRetrieval 

In [21]:
print(response)

The total value of 'Cash and cash equivalents' as of December 31, 2023, was $41,862 million.


In [22]:
query = "What are the names of people with the director title at Meta?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering 248c8a12-12c0-4957-b5b7-a4325579b216: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the names of people with the director title at Meta?
[0m[1;3;38;2;11;159;203mRetrieval entering badfca08-b294-4f2a-81d6-87fb17989c5e: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the names of people with the director title at Meta?
[0m[1;3;38;2;11;159;203mRetrieval entering b3638b29-4431-41ce-a157-85547ab29073: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the names of people with the director title at Meta?
[0m[1;3;38;2;11;159;203mRetrieval entering 77c1290f-d5ae-43a1-8373-e38271b24a64: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the names of people with the director title at Meta?
[0m[1;3;38;2;11;159;203mRetrieval entering 50247b25-e48b-4a61-8f21-ed2f7356ea55: TextNode
[0m[1;3;38;2;237;9

In [23]:
print(response)

Peggy Alford, Marc L. Andreessen, Andrew W. Houston, Nancy Killefer, Robert M. Kimmitt, Sheryl K. Sandberg, Tracey T. Travis, Tony Xu.


In [24]:
query = "What are the main sections of the document?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering 85d70991-2391-41e8-8022-e0a53f900ec7: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the main sections of the document?
[0m[1;3;38;2;11;159;203mRetrieval entering a8078aaa-bc49-4d55-b7ef-6759995f7178: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the main sections of the document?
[0m[1;3;38;2;11;159;203mRetrieval entering e3037d05-7dc2-490d-90f5-1b3841d44851: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the main sections of the document?
[0m[1;3;38;2;11;159;203mRetrieval entering 14b05db9-b086-4b8b-b72b-7d22ec8001c2: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the main sections of the document?
[0m[1;3;38;2;11;159;203mRetrieval entering 344ccda5-1ef0-45cb-95fa-97052beef754: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query What are the main secti

In [25]:
print(response)

The main sections of the document include government regulations, court decisions, and official actions related to data protection and privacy; sections related to corporate governance, executive compensation, security ownership, relationships and transactions, and accountant fees for the 2024 Annual Meeting of Stockholders; various sections of a financial report such as balance sheets, statements of income, stockholders' equity, cash flows, and notes to financial statements; and information on agreements and plans related to executive compensation and operations.


In [26]:
query = "List me the table of contents?"
response = recursive_query_engine.query(query)

[1;3;38;2;11;159;203mRetrieval entering 25116ea7-3ebe-4ad9-9d39-74e4b319d968: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query List me the table of contents?
[0m[1;3;38;2;11;159;203mRetrieval entering e545fbe8-b944-49bf-98e3-ccbf5a3b1b12: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query List me the table of contents?
[0m[1;3;38;2;11;159;203mRetrieval entering 04aa459d-4aad-44df-bd92-7085d0ffb573: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query List me the table of contents?
[0m[1;3;38;2;11;159;203mRetrieval entering 54d711c3-d2d0-43e5-a30b-2654b9ffb8e5: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query List me the table of contents?
[0m[1;3;38;2;11;159;203mRetrieval entering c8a3588b-467a-4e4c-89de-7ad7d368095c: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query List me the table of contents?
[0m[1;3;38;2;11;159;203mRetrieval entering

In [27]:
print(response)

The table of contents includes the following tables:
1. Financial Items Table
2. Corporate Governance Provisions Table
3. Corporate Governance Topics Table
4. Individual Information Table (Empty)
5. Individual Information Table with Guy Rosen's Information


In [29]:
response = recursive_query_engine.query(
    "How many pages are in the document?"
)
print(response)

[1;3;38;2;11;159;203mRetrieval entering e3037d05-7dc2-490d-90f5-1b3841d44851: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query How many pages are in the document?
[0m[1;3;38;2;11;159;203mRetrieval entering dce35d54-eac3-41e0-b2bf-7074c0cc576f: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query How many pages are in the document?
[0m[1;3;38;2;11;159;203mRetrieval entering 62a2fbe9-06f1-4a46-a5f8-42f29066db03: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query How many pages are in the document?
[0m[1;3;38;2;11;159;203mRetrieval entering e545fbe8-b944-49bf-98e3-ccbf5a3b1b12: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query How many pages are in the document?
[0m[1;3;38;2;11;159;203mRetrieval entering ba48ea7e-3f4e-4a1a-b63f-e6aec5374596: TextNode
[0m[1;3;38;2;237;90;200mRetrieving from object TextNode with query How many pages are in the document?
[0m[1;3;38;2;11;1

# Okay now we are using Qdrand to answer the questions. The pdf is loaded to the Qdrand collection. 