[Hanane D](https://www.linkedin.com/in/hanane-d-algo-trader)

1- Parsing Amazon 10k Financial Report using:

*   **LlamaParse** with **GPT-4o mode** to improve the parsing quality, particularly when financial charts are included in the report

*   **SimpleDirectoryParser**: a standard way of parsing.

2- I used HuggingFace local **embedding** (using LlamaIndex), to store data in a VectoreStore

3- I created a **query engine** using different LLMs and compare their results: **gpt-3.5-turbo**, **GPT-4o**, and **Claude Sonnet 3.5** .


# Install Lib

In [None]:
!pip install llama-index llama-index-core llama-parse openai llama_index.embeddings.huggingface -q
!pip install llama-index-llms-anthropic -q

# Specify API Keys

In [None]:
import nest_asyncio
nest_asyncio.apply()

from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
LLAMAPARSE_API_KEY = userdata.get('LLAMAPARSE_API_KEY')
ANTHROPIC_API_KEY = userdata.get("CLAUDE_API_KEY")

# Loading financial report: Amazon 2023 10K

In [None]:
!wget "https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/c7c14359-36fa-40c3-b3ca-5bf7f3fa0b96.pdf" -O amzn_2023_10k.pdf

# Standard Parsing with SimpleDirectoryReader:

## VectoreStore and specifying LLms for the query_engine

In [None]:
# from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
import nest_asyncio;
nest_asyncio.apply()

pdf_name = "amzn_2023_10k.pdf"
# use SimpleDirectoryReader to parse our file
documents = SimpleDirectoryReader(input_files=[pdf_name]).load_data()
from llama_index.core import Settings

embed_model = "local:BAAI/bge-small-en-v1.5" #https://huggingface.co/collections/BAAI/bge-66797a74476eb1f085c7446d

vector_index_std = VectorStoreIndex(documents, embed_model = embed_model)

from llama_index.llms.openai import OpenAI

llm_gpt35 = OpenAI(model="gpt-3.5-turbo", api_key = OPENAI_API_KEY)
query_engine_gpt35 = vector_index_std.as_query_engine(similarity_top_k=3, llm=llm_gpt35)

llm_gpt4o = OpenAI(model="gpt-4o", api_key = OPENAI_API_KEY)
query_engine_gpt4o = vector_index_std.as_query_engine(similarity_top_k=3, llm=llm_gpt4o)


from llama_index.llms.anthropic import Anthropic
from llama_index.core import Settings

tokenizer = Anthropic().tokenizer
Settings.tokenizer = tokenizer

llm_claude = Anthropic(model="claude-3-5-sonnet-20240620", api_key=ANTHROPIC_API_KEY)
query_engine_claude = vector_index_std.as_query_engine(similarity_top_k=3, llm=llm_claude)



In [None]:
print(documents[36].text)

Table of Contents
AMAZON.COM, INC.
CONSOLIDATED STATEMENTS OF CASH FLOWS
(in millions)
  Year Ended December 31,
 2021 2022 2023
CASH, CASH EQUIV ALENTS, AND RESTRICTED CASH, BEGINNING OF PERIOD $ 42,377 $ 36,477 $ 54,253 
OPERA TING ACTIVITIES:
Net income (loss) 33,364 (2,722) 30,425 
Adjustments to reconcile net income (loss) to net cash from operating activities:
Depreciation and amortization of property and equipment and capitalized content costs, operating lease
assets, and other 34,433 41,921 48,663 
Stock-based compensation 12,757 19,621 24,023 
Non-operating expense (income), net (14,306) 16,966 (748)
Deferred income taxes (310) (8,148) (5,876)
Changes in operating assets and liabilities:
Inventories (9,487) (2,592) 1,449 
Accounts receivable, net and other (9,145) (8,622) (8,348)
Other assets (9,018) (13,275) (12,265)
Accounts payable 3,602 2,945 5,473 
Accrued expenses and other 2,123 (1,558) (2,428)
Unearned revenue 2,314 2,216 4,578 
Net cash provided by (used in) operating

## Chatting with the LLMs: GPT-3.5-Turbo, GPT-4o, Claude 3.5 Sonnet

In [None]:
query1 = "What was the net income in 2023?"
response = query_engine_gpt35.query(query1)
print("GPT-3.5 Turbo")
print(str(response))

resp = query_engine_gpt4o.query(query1)
print("\nGPT-4o")
print(str(resp))

resp = query_engine_claude.query(query1)
print("\nClaude 3.5 Sonnet")
print(str(resp))

# print(resp.source_nodes[0].get_content()) # to get the source_node used to answer

GPT-3.5 Turbo
The net income for 2023 is $36,813 million.

GPT-4o
The net income for 2023 is not explicitly provided in the given context. However, you can calculate it using the provided information. The income before income taxes for 2023 is $37,557 million, and the provision for income taxes is $7,120 million. Therefore, the net income for 2023 would be $37,557 million minus $7,120 million, which equals $30,437 million.

Claude 3.5 Sonnet
Based on the information provided, Amazon's net income for 2023 can be calculated as $30,437 million. This is derived from the "Income (loss) before income taxes" of $37,557 million for 2023, minus the "Provision (benefit) for income taxes, net" of $7,120 million for the same year.


In [None]:
query1 = "What was the net income in 2022?"
response = query_engine_gpt35.query(query1)
print("GPT-3.5 Turbo")
print(str(response))

resp = query_engine_gpt4o.query(query1)
print("\nGPT-4o")
print(str(resp))

resp = query_engine_claude.query(query1)
print("\nClaude 3.5 Sonnet")
print(str(resp))

GPT-3.5 Turbo
The net income for 2022 is a provision (benefit) for income taxes of $(3.2) billion.

GPT-4o
The net income for 2022 is not explicitly provided in the given context. However, you can calculate it using the provided information. The income (loss) before income taxes for 2022 is $(5,936) million, and the provision (benefit) for income taxes is $(3,217) million. Therefore, the net income for 2022 would be:

Net Income = Income (Loss) Before Income Taxes - Provision (Benefit) for Income Taxes
Net Income = $(5,936) million - $(3,217) million
Net Income = $(5,936) million + $3,217 million
Net Income = $(2,719) million

So, the net income for 2022 is $(2,719) million.

Claude 3.5 Sonnet
Based on the information provided, Amazon reported a net loss in 2022. The company had a loss before income taxes of $5,936 million in 2022. After accounting for an income tax benefit of $3,217 million that year, the net loss for 2022 would be approximately $2,719 million.


In [None]:
query2 = "What was the net income in 2023 compared to 2022?"
print("query2:",query2)
response = query_engine_gpt35.query(query2)
print("\nGPT-3.5 Turbo")
print(str(response))

resp = query_engine_gpt4o.query(query2)
print("\nGPT-4o")
print(str(resp))

resp = query_engine_claude.query(query2)
print("\nClaude 3.5 Sonnet")
print(str(resp))

query2: What was the net income in 2023 compared to 2022?

GPT-3.5 Turbo
The net income in 2023 was $37.557 billion, while in 2022, it was $(5.936) billion.

GPT-4o
The net income in 2023 was higher compared to 2022. In 2023, the income before income taxes was $37,557 million, and the provision for income taxes was $7,120 million, resulting in a net income of $30,437 million. In contrast, in 2022, the loss before income taxes was $5,936 million, and the benefit for income taxes was $3,217 million, resulting in a net loss of $2,719 million.

Claude 3.5 Sonnet
Based on the financial information provided, Amazon's net income situation changed significantly from 2022 to 2023:

In 2022, Amazon had a loss before income taxes of $5,936 million and received an income tax benefit of $3,217 million.

In 2023, Amazon had income before income taxes of $37,557 million and incurred an income tax provision of $7,120 million.

While the exact net income figure isn't directly stated, we can infer that 

In [None]:
print(resp.source_nodes[0].get_content())

Table of Contents
The components of the provision (benefit) for income taxes, net are as follows (in millions):
 Year Ended December 31,
2021 2022 2023
U.S. Federal:
Current $ 2,129 $ 2,175 $ 8,652 
Deferred 155 (6,686) (5,505)
Total 2,284 (4,511) 3,147 
U.S. State:
Current 763 1,074 2,158 
Deferred (178) (1,302) (498)
Total 585 (228) 1,660 
International:
Current 2,209 1,682 2,186 
Deferred (287) (160) 127 
Total 1,922 1,522 2,313 
Provision (benefit) for income taxes, net $ 4,791 $ (3,217)$ 7,120 
U.S. and international components of income (loss) before income taxes are as follows (in millions):
 Year Ended December 31,
 2021 2022 2023
U.S. $ 35,879 $ (8,225)$ 32,328 
International 2,272 2,289 5,229 
Income (loss) before income taxes $ 38,151 $ (5,936)$ 37,557 
The items accounting for differences between income taxes computed at the federal statutory rate and the provision (benefit) recorded for income taxes
are as follows (in millions):
 Year Ended December 31,
 2021 2022 2023
Inc

# LlamaParse: Amazon 10K Financial report

In [None]:
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex

pdf_name = "amzn_2023_10k.pdf"
# set up parser
# parser = LlamaParse(api_key=LLAMAPARSE_API_KEY, result_type="markdown", gpt4o_mode = True)
parser = LlamaParse(api_key=LLAMAPARSE_API_KEY, result_type="markdown", gpt4o_mode = True)
documents = parser.load_data(pdf_name)

Started parsing the file under job_id cac11eca-3194-4b60-a73f-76fb55d0bbaa


In [None]:
parser

LlamaParse(is_remote=False, api_key='llx-IGUjt4CbLaOtNr62qME76pw7QTPZv1dVL5toWNrFarwD9NP6', base_url='https://api.cloud.llamaindex.ai', result_type=<ResultType.MD: 'markdown'>, num_workers=4, check_interval=1, max_timeout=2000, verbose=True, show_progress=True, language=<Language.ENGLISH: 'en'>, parsing_instruction='', skip_diagonal_text=False, invalidate_cache=False, do_not_cache=False, fast_mode=False, do_not_unroll_columns=False, page_separator=None, gpt4o_mode=True, gpt4o_api_key=None, bounding_box=None, target_pages=None, ignore_errors=True, split_by_page=True)

In [None]:
print(documents[43].text)

# AMAZON.COM, INC.
## CONSOLIDATED STATEMENTS OF CASH FLOWS
### (in millions)
#### Year Ended December 31,

|                                | 2021     | 2022     | 2023     |
|--------------------------------|----------|----------|----------|
| **CASH, CASH EQUIVALENTS, AND RESTRICTED CASH, BEGINNING OF PERIOD** | $ 42,377 | $ 36,477 | $ 54,253 |
| **OPERATING ACTIVITIES:**      |          |          |          |
| Net income (loss)              | 33,364   | (2,722)  | 30,425   |
| Adjustments to reconcile net income (loss) to net cash from operating activities: |          |          |          |
| Depreciation and amortization of property and equipment and capitalized content costs, operating lease assets, and other | 34,433   | 41,921   | 48,663   |
| Stock-based compensation       | 12,757   | 19,621   | 24,023   |
| Non-operating expense (income), net | (14,306)  | 16,966   | (748)    |
| Deferred income taxes          | (310)    | (8,145)  | (5,876)  |
| Changes in operating asse

## Store documents: with embeddings for later retrieval

In [None]:
from llama_index.core import VectorStoreIndex

embed_model = "local:BAAI/bge-small-en-v1.5" #https://huggingface.co/collections/BAAI/bge-66797a74476eb1f085c7446d
vector_index = VectorStoreIndex(documents, embed_model = embed_model)



In [None]:
# vector_index.storage_context.persist(persist_dir=path)

## Specify LLMs for chat

In [None]:
from llama_index.llms.openai import OpenAI

llm_gpt35 = OpenAI(model="gpt-3.5-turbo", api_key = OPENAI_API_KEY)
query_engine_gpt35 = vector_index.as_query_engine(similarity_top_k=3, llm=llm_gpt35)

llm_gpt4o = OpenAI(model="gpt-4o", api_key = OPENAI_API_KEY)
query_engine_gpt4o = vector_index.as_query_engine(similarity_top_k=3, llm=llm_gpt4o)

from llama_index.llms.anthropic import Anthropic
from llama_index.core import Settings

tokenizer = Anthropic().tokenizer
Settings.tokenizer = tokenizer

llm_claude = Anthropic(model="claude-3-5-sonnet-20240620", api_key=ANTHROPIC_API_KEY)
query_engine_claude = vector_index.as_query_engine(similarity_top_k=3, llm=llm_claude)

## Chatting with the LLMs: GPT-3.5-Turbo, GPT-4o, Claude 3.5 Sonnet

In [None]:
query1 = "What is the net income on 2023?"
response = query_engine_gpt35.query(query1)
print("GPT-3.5 Turbo")
print(str(response))

resp = query_engine_gpt4o.query(query1)
print("\nGPT-4o")
print(str(resp))

resp = query_engine_claude.query(query1)
print("\nClaude 3.5 Sonnet")
print(str(resp))

GPT-3.5 Turbo
The net income for 2023 is $16.9 billion.

GPT-4o
The net income for 2023 can be calculated by subtracting the provision for income taxes from the income before income taxes. For 2023, the income before income taxes is $37,557 million, and the provision for income taxes is $7,120 million. Therefore, the net income for 2023 is $37,557 million - $7,120 million = $30,437 million.

Claude 3.5 Sonnet
The net income for 2023 can be calculated by subtracting the provision for income taxes from the income before income taxes.

Income before income taxes in 2023 was $37,557 million.
Provision for income taxes in 2023 was $7,120 million.

Therefore, the net income for 2023 was:
$37,557 million - $7,120 million = $30,437 million

So, the net income for 2023 was $30,437 million.


In [None]:
query2 = "What was the net income in 2023 compared to 2022?"
print("query2:",query2)
response = query_engine_gpt35.query(query2)
print("\nGPT-3.5 Turbo")
print(str(response))

resp = query_engine_gpt4o.query(query2)
print("\nGPT-4o")
print(str(resp))

resp = query_engine_claude.query(query2)
print("\nClaude 3.5 Sonnet")
print(str(resp))

query2: What was the net income in 2023 compared to 2022?

GPT-3.5 Turbo
The net income in 2023 was higher compared to 2022.

GPT-4o
The net income in 2023 was $30.4 billion, compared to a net loss of $2.7 billion in 2022.

Claude 3.5 Sonnet
The net income in 2023 was significantly higher compared to 2022. In 2023, the company reported income before income taxes of $37.557 billion, and after accounting for income taxes of $7.120 billion, the net income would be approximately $30.437 billion. 

In contrast, 2022 saw a loss before income taxes of $5.936 billion. However, there was a tax benefit of $3.217 billion, which would result in a net loss of approximately $2.719 billion for 2022.

This represents a substantial turnaround, with the company moving from a net loss position in 2022 to a significant profit in 2023. The improvement was primarily driven by a large increase in income before taxes, reflecting better overall business performance in 2023.


In [None]:
query2 = "What was the reason of the net income loss in 2022?"
print("query2:",query2)
response = query_engine_gpt35.query(query2)
print("\nGPT-3.5 Turbo")
print(str(response))

resp = query_engine_gpt4o.query(query2)
print("\nGPT-4o")
print(str(resp))

resp = query_engine_claude.query(query2)
print("\nClaude 3.5 Sonnet")
print(str(resp))

query2: What was the reason of the net income loss in 2022?

GPT-3.5 Turbo
The net income loss in 2022 was primarily due to a decrease in pretax income, an increase in the foreign income deduction, a reduction in excess tax benefits from stock-based compensation, and a decrease in the tax impact of foreign earnings and losses driven by a decline in the favorable effects of corporate restructuring transactions.

GPT-4o
The net income loss in 2022 was primarily due to a decrease in pretax income and an increase in the foreign income deduction. This was partially offset by a reduction in excess tax benefits from stock-based compensation and a decrease in the tax impact of foreign earnings and losses driven by a decline in the favorable effects of corporate restructuring transactions.

Claude 3.5 Sonnet
The net income loss in 2022 was primarily due to a significant decrease in pretax income. The company experienced a loss before income taxes of $5,936 million in 2022, compared to an income

In [None]:
print(resp.source_nodes[0].get_content())

# The components of the provision (benefit) for income taxes, net are as follows (in millions):

|                            | Year Ended December 31, |
|----------------------------|-------------------------|
|                            | 2021    | 2022    | 2023    |
| **U.S. Federal:**          |         |         |         |
| Current                    | $ 2,129 | $ 2,175 | $ 8,652 |
| Deferred                   | 155     | (6,686) | (5,505) |
| **Total**                  | 2,284   | (4,511) | 3,147   |
| **U.S. State:**            |         |         |         |
| Current                    | 763     | 1,074   | 2,158   |
| Deferred                   | (178)   | (1,302) | (498)   |
| **Total**                  | 585     | (228)   | 1,660   |
| **International:**         |         |         |         |
| Current                    | 2,209   | 1,682   | 2,186   |
| Deferred                   | (287)   | (160)   | 127     |
| **Total**                  | 1,922   | 1,522   | 2,313 

In [None]:
query3 = "What are the most important takeaways from the report?"
print("query3:",query3)
response = query_engine_gpt35.query(query3)
print("\nGPT-3.5 Turbo")
print(str(response))

resp = query_engine_gpt4o.query(query3)
print("\nGPT-4o")
print(str(resp))

resp = query_engine_claude.query(query3)
print("\nClaude 3.5 Sonnet")
print(str(resp))

query3: What are the most important takeaways from the report?

GPT-3.5 Turbo
The report includes the consolidated financial statements, notes to the financial statements, and the independent auditor's report. Additionally, it lists the documents filed as part of the report, including the index to consolidated financial statements, financial statement schedules, and exhibits. The forward-looking statements section highlights the uncertainties and factors that could impact future results.

GPT-4o
The most important takeaways from the report include the detailed financial statements and supplementary data, such as the consolidated statements of cash flows, operations, comprehensive income (loss), balance sheets, and stockholders' equity. Additionally, the report includes the independent auditor's report from Ernst & Young LLP. The management's discussion and analysis section highlights the company's forward-looking statements, emphasizing the inherent uncertainties and various factors th

In [None]:
# For this question, the cost went from $0.07 to $0.16 ==> $0.09

In [None]:
print(resp.source_nodes[0].get_content())

# Item 8. Financial Statements and Supplementary Data

## INDEX TO CONSOLIDATED FINANCIAL STATEMENTS

|                                                                                           | Page |
|-------------------------------------------------------------------------------------------|------|
| Report of Ernst & Young LLP, Independent Registered Public Accounting Firm (PCAOB ID: 42) | 35   |
| Consolidated Statements of Cash Flows                                                     | 37   |
| Consolidated Statements of Operations                                                     | 38   |
| Consolidated Statements of Comprehensive Income (Loss)                                    | 39   |
| Consolidated Balance Sheets                                                               | 40   |
| Consolidated Statements of Stockholders’ Equity                                           | 41   |
| Notes to Consolidated Financial Statements                                              

In [None]:
print(resp.source_nodes[1].get_content())

In [None]:
print(resp.source_nodes[2].get_content())

# Key Takeaways

1- **In terms of parsing**, since the 10k report includes only tables, using LlamaParse with markdown is sufficient (and also visually clear). There's no need for GPT-4o mode. However, as you don't know if there are charts in the PDFs, using GPT-4o mode could be beneficial. However, I found that Claude 3.5 Sonnet (in a previous post) was better than GPT-4o on parsing charts as images.

It could be great to have Claude 3.5 Sonnet mode on LlamaParse. It can also be more expensive.

The simple parsing (SimpleDirectoryReader), was good too for these kind of simple questions.

2- **In the chat comparison**, Claude 3.5 Sonnet was slightly better than GPT-4o. It provides accurate answers and offers valuable details on how some calculations are performed, as well as justifications for certain losses. However, GPT-3.5-Turbo lags significantly behind.