### Objective

* Create chunnks of text which we will pass to an LLM for generating IFT data
* We have complex PDF docs which need to be parsed.
* The documents have tables and images which makes the task more complex.
* Explore llamaparse and unstructured both

https://github.com/run-llama/llama_parse/blob/main/examples/demo_json.ipynb
https://docs.cloud.llamaindex.ai/llamaparse/features/metadata


In [8]:
import nest_asyncio
nest_asyncio.apply()

import os
from llama_parse import LlamaParse

In [9]:
!python --version

Python 3.12.2


In [10]:
# check for key
LLAMAPARSE_API_KEY = os.environ.get('LLAMAPARSE_API_KEY')
if LLAMAPARSE_API_KEY is not None:
    print('API key found')
else:
    print('Check for API key in environment variable')

API key found


In [11]:

# instantiate parser
parser = LlamaParse(verbose=True,
                     api_key=LLAMAPARSE_API_KEY,
                     language='en',
                     # result_type="markdown", # or text; no json
                     parsing_instruction="You are parsing a quarterly report from a listed company. Please extract the tables containing financials and other information as well as images."
                     )

In [12]:
json_obs = parser.get_json_result('../data/test/uber_10q_march_2022.pdf')

Started parsing the file under job_id cac11eca-4d76-429e-9a5b-fff2ca1ff6c6


In [13]:
# parser returns a list of length 1
type(json_obs), len(json_obs)

(list, 1)

In [14]:
# list item is a dictionary
json_obs[0].keys()

dict_keys(['pages', 'job_metadata', 'job_id', 'file_path'])

In [15]:
# response is in 0th element of the list
# 0th element is a dictionary with parsed content in 'pages' key
type(json_obs[0]['pages']),len(json_obs[0]['pages'])

(list, 106)

In [16]:
# each element of pages is a dictionary
# each page elements in a separate dictionary
    # page - page number
    # text - parsed text
    # md - markdown 
    # images - images on page
type(json_obs[0]['pages'][0]), json_obs[0]['pages'][0].keys()

(dict, dict_keys(['page', 'text', 'md', 'images', 'items']))

In [17]:
# explore how pages with image and tables are stored
# since zero indexing;
json_obs[0]['pages'][43].keys()

dict_keys(['page', 'text', 'md', 'images', 'items'])

In [18]:
# page number
json_obs[0]['pages'][43]['page']

44

In [19]:
# raw text - also includes text from tables
print(json_obs[0]['pages'][43]['text'])

                                                                                        Trips (in millions)
                                                                                                                                1,641               1,769              1,713
                                                                     1,.443              1.447              1511
                                                 1.184
                               737
                            Q2 2020             Q3 2020            Q4 2020             Q1 2021            Q2 2021             Q3 2021            04 2021             Q1 2022
     Gross Bookings. We define Gross Bookings as the total dollar value, including any applicable taxes, tolls, and fees, of: Mobility and New Mobility rides;
Delivery orders (in each case without any adjustment for consumer discounts and refunds); Driver and Merchant earnings; Driver incentives; and Freight revenue.
Gross Bookings do not in

In [20]:
# markdown text - only has tables
print(json_obs[0]['pages'][43]['md'])

#

# Quarterly Report

# Financial Data

**Trips (in millions)**
| |Q2 2020|Q3 2020|Q4 2020|Q1 2021|Q2 2021|Q3 2021|Q4 2021|Q1 2022|
|---|---|---|---|---|---|---|---|---|
|Gross Bookings (in millions)|25,866|26,449|21,900|23,113|17,152|19,536|14,745|10,224|

**Financial Breakdown (in millions)**
| |Q2 2020|Q3 2020|Q4 2020|Q1 2021|Q2 2021|Q3 2021|Q4 2021|Q1 2022|
|---|---|---|---|---|---|---|---|---|
|Mobility|$3,046|$5,905|$6,789|$6,773|$8,640|$9,883|$11,340|$10,723|
|Delivery|$6,961|$8,550|$10,050|$12,461|$12,912|$12,828|$13,444|$13,903|
|Freight|$212|$290|$313|$302|$348|$402|$1,082|$1,823|
|All Other|$5|-|-|-|-|-|-|-|

# Adjusted EBITDA

**Adjusted EBITDA Comparison (in millions)**
| |Three Months Ended March 31, 2021|Three Months Ended March 31, 2022|% Change|
|---|---|---|---|
|Adjusted EBITDA|($359)|$168|**|

** Percentage not meaningful.

Adjusted EBITDA was $168 million, improving $527 million from an Adjusted EBITDA loss of $359 million in the same period in 2021. The improveme

In [21]:
# 2 images on the page
print(json_obs[0]['pages'][43]['images']), len(json_obs[0]['pages'][43]['images'])

[{'name': 'img_p43_1.png', 'height': 239, 'width': 890}, {'name': 'img_p43_2.png', 'height': 316, 'width': 890}]


(None, 2)

In [22]:
# items - alignment of items on page
print(json_obs[0]['pages'][43]['items'])

[{'type': 'heading', 'lvl': 1, 'value': '', 'md': '#'}, {'type': 'heading', 'lvl': 1, 'value': 'Quarterly Report', 'md': '# Quarterly Report'}, {'type': 'heading', 'lvl': 1, 'value': 'Financial Data', 'md': '# Financial Data'}, {'type': 'text', 'value': '**Trips (in millions)**', 'md': '**Trips (in millions)**'}, {'type': 'table', 'rows': [['', 'Q2 2020', 'Q3 2020', 'Q4 2020', 'Q1 2021', 'Q2 2021', 'Q3 2021', 'Q4 2021', 'Q1 2022'], ['Gross Bookings (in millions)', '25,866', '26,449', '21,900', '23,113', '17,152', '19,536', '14,745', '10,224']], 'md': '| |Q2 2020|Q3 2020|Q4 2020|Q1 2021|Q2 2021|Q3 2021|Q4 2021|Q1 2022|\n|---|---|---|---|---|---|---|---|---|\n|Gross Bookings (in millions)|25,866|26,449|21,900|23,113|17,152|19,536|14,745|10,224|', 'isPerfectTable': True, 'csv': '"","Q2 2020","Q3 2020","Q4 2020","Q1 2021","Q2 2021","Q3 2021","Q4 2021","Q1 2022"\n"Gross Bookings (in millions)","25,866","26,449","21,900","23,113","17,152","19,536","14,745","10,224"'}, {'type': 'text', 'value

In [23]:
# see another page - to verify only tables; also has some associated text
# markdown text - only has tables
print(json_obs[0]['pages'][45]['md'])

#

# Quarterly Report

# Adjusted EBITDA Reconciliation

| |Three Months Ended March 31, 2021|Three Months Ended March 31, 2022|
|---|---|---|
|Net loss attributable to Uber Technologies, Inc.|$(108) million|$(5,930) million|
|Net income (loss) attributable to non-controlling interests, net of tax|$(14) million|$12 million|
|Provision for (benefit from) income taxes|$185 million|$(232) million|
|Loss (income) from equity method investments|$8 million|$(18) million|
|Interest expense|$115 million|$129 million|
|Other (income) expense, net|$(1,710) million|$5,557 million|
|Depreciation and amortization|$212 million|$254 million|
|Stock-based compensation expense|$281 million|$359 million|
|Legal, tax, and regulatory reserve changes and settlements|$551 million|$—|
|Goodwill and asset impairments/loss on sale of assets|$57 million|$13 million|
|Acquisition, financing and divestitures related expenses|$36 million|$14 million|
|Accelerated lease costs related to cease-use of ROU assets|$2 m

##### What does this mean for us?
* Tables are essential since they contain a lot of information about the business
* Tables are getting parsed twice as text and as markdown
* Repetition will not work for us since questions will be repeated
* We need to explore a hybrid approach - json for images, markdown for text and tables.

In [24]:

# instantiate parser for markdown results
parser = LlamaParse(verbose=True,
                     api_key=LLAMAPARSE_API_KEY,
                     language='en',
                     result_type="markdown", # or text; no json
                     parsing_instruction="You are parsing a quarterly report from a listed company. Please extract the tables containing financials and other information as well as images."
                     )

In [26]:
# parse documents
md_docs = parser.load_data('../data/test/uber_10q_march_2022.pdf')

Started parsing the file under job_id cac11eca-4177-4ed4-83f3-94da9f8ef85c


In [27]:
# type, etc.
type(md_docs), len(md_docs), type(md_docs[0])

(list, 1, llama_index.core.schema.Document)

In [28]:
# prints entire text; don't
# md_docs[0]

In [29]:
# since documents is a list with length 1
type(md_docs[0]), print(md_docs[0])

Doc ID: f93f81aa-d3e6-443a-b9af-f6df52110998
Text: #  # Quarterly Report  # Financial Report - Uber Technologies,
Inc.  # Financial Information  |Item|Details| |---|---| |Common
Stock|Trading Symbol: UBER| |Exchange|New York Stock Exchange|  #
Company Information  |Filing Status|Large accelerated filer| |---|---|
|Shell Company|No| |Shares Outstanding|1,963,660,253|  # Contact
Information  Addre...


(llama_index.core.schema.Document, None)

In [30]:
# get actual text and tables in markdown; ignores images as seen below
print(md_docs[0].text)

#

# Quarterly Report

# Financial Report - Uber Technologies, Inc.

# Financial Information

|Item|Details|
|---|---|
|Common Stock|Trading Symbol: UBER|
|Exchange|New York Stock Exchange|

# Company Information

|Filing Status|Large accelerated filer|
|---|---|
|Shell Company|No|
|Shares Outstanding|1,963,660,253|

# Contact Information

Address: 1515 3rd Street, San Francisco, California 94158

Phone: (415) 612-8582

# Images

No images included in the report.
---
#

# Uber Technologies, Inc. - Quarterly Report

# Uber Technologies, Inc. - Quarterly Report

# Special Note Regarding Forward-Looking Statements

Content related to forward-looking statements goes here...

# Part I - Financial Information

# Item 1. Financial Statements (unaudited)

**Condensed Consolidated Balance Sheets**
|Condensed Consolidated Balance Sheets as of December 31, 2021|Content for December 31, 2021|
|---|---|
|Condensed Consolidated Balance Sheets as of March 31, 2022|Content for March 31, 2022|

**Conde

##### So what next?

* Markdown response has tables and text both
* We need to extract tables into a separate list and iterate over them separately
* We will use json response to iterate over images.
* focus on text and tables below; followed by images.

##### How to do this - 3 approaches

* get markdown result
* 

In [None]:
from llama_index.core.node_parser import  MarkdownElementNodeParser
from llama_index.llms.openai import OpenAI

In [None]:
node_parser = MarkdownElementNodeParser(
    llm = OpenAI(model="gpt-3.5-turbo-0125")
)

In [None]:
nodes = node_parser.get_nodes_from_documents(md_docs)
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

102it [00:00, 62619.88it/s]
100%|██████████| 102/102 [00:54<00:00,  1.87it/s]


In [None]:
type(base_nodes), type(objects)

(list, list)

In [None]:
len(base_nodes), len(objects)

(119, 102)

In [None]:
no=10
base_nodes[no]

TextNode(id_='64ae2dd4-4af6-4d19-b2cd-2cd6886ae424', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='def2fac7-41b4-4ec4-a8e6-35d5d37b7265', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='e95c462505103919ce55ee4f541e57baa15ddc1be6c02b9c8ce358d3de23f9a1'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='893798d8-7362-41b1-974f-dcf6efdec6a3', node_type=<ObjectType.TEXT: '1'>, metadata={'table_df': '{\' \': {0: \'Accounts payable\', 1: \'Short-term insurance reserves\', 2: \'Operating lease liabilities, current\', 3: \'Accrued and other current liabilities\', 4: \'Total current liabilities\', 5: \'Long-term insurance reserves\', 6: \'Long-term debt, net of current portion\', 7: \'Operating lease liabilities, non-current\', 8: \'Other long-term liabilities\', 9: \'Total liabilities\', 10: \'Redeemable non-controlling interests\', 11: \'Common stock\', 12:

In [None]:
objects[no]

IndexNode(id_='570cea54-ec9a-4715-91f7-4e50305dfb8c', embedding=None, metadata={'col_schema': 'Column: Redeemable Non-Controlling Interests\nType: Numeric\nSummary: None\n\nColumn: Common Stock\nType: Numeric\nSummary: None\n\nColumn: Additional Paid-In Capital\nType: Numeric\nSummary: None\n\nColumn: Other Comprehensive Income (Loss)\nType: Numeric\nSummary: None\n\nColumn: Accumulated Deficit\nType: Numeric\nSummary: None\n\nColumn: Redeemable Non-Controlling Interests\nType: Numeric\nSummary: None\n\nColumn: Total Equity\nType: Numeric\nSummary: None'}, excluded_embed_metadata_keys=['col_schema'], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='def2fac7-41b4-4ec4-a8e6-35d5d37b7265', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='e95c462505103919ce55ee4f541e57baa15ddc1be6c02b9c8ce358d3de23f9a1'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='ea5ec3c4-94c3-42e7-b451-57e9128958e3', node_type=<ObjectType.TEXT:

In [None]:
no=100
base_nodes[no]

TextNode(id_='48bbac8a-cbd0-4b30-84da-b8c8866e6c66', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='def2fac7-41b4-4ec4-a8e6-35d5d37b7265', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='e95c462505103919ce55ee4f541e57baa15ddc1be6c02b9c8ce358d3de23f9a1'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='ff90b2c8-f730-4fa0-ad5b-7199b2ca9383', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='ca7417d20e0a9e6dfc0a7f1ac263b38ba3cd0ed1021f077742783831251e023e'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='262361fc-80c6-4487-b84f-47c547e50bbb', node_type=<ObjectType.INDEX: '3'>, metadata={'col_schema': 'Column: Item\nType: text\nSummary: Names of the financial metrics\n\nColumn: Amount\nType: text\nSummary: Descriptions indicating the variability or dependency of each metric'}, hash='f47e8f141f81d0fa0ae36a651e86b75e2e6311a1827e7a9e5eb190a29e016

In [None]:
base_nodes[no].text

"For example, we have agreed to not calculate consumer fares in excess of the maximum government-mandated fares in all major Indian cities where legal proceedings have limited the use of surge pricing. Further, in 2018, Honolulu, Hawaii became the first U.S. city to pass legislation to cap surge pricing if increased rates exceed the maximum fare set by the city. Additional regulation of our pricing models could increase our operating costs and adversely affect our business. Furthermore, our pricing model has been the subject of litigation and regulatory inquiries related to, among other things, the calculation of and statements regarding consumer fares and Driver earnings (including rates, fees, surcharges, and tolls), as well as the use of surge pricing during emergencies and natural disasters. In addition, an increasing number of municipalities have proposed delivery network fee caps with respect to our Delivery offering and caps on surge pricing with respect to our Mobility offering

In [None]:
objects[1]

IndexNode(id_='8bfab8b1-c516-4a89-b838-fb5bc573ea40', embedding=None, metadata={'col_schema': 'Column: Filing Status\nType: Text\nSummary: Indicates if the company is a large accelerated filer.\n\nColumn: Shell Company\nType: Text\nSummary: Indicates if the company is a shell company.\n\nColumn: Shares Outstanding\nType: Number\nSummary: Number of shares outstanding for the company.'}, excluded_embed_metadata_keys=['col_schema'], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='def2fac7-41b4-4ec4-a8e6-35d5d37b7265', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='e95c462505103919ce55ee4f541e57baa15ddc1be6c02b9c8ce358d3de23f9a1'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='0d1721d2-70a7-4ebe-b8a7-32c94685653a', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='7a7ea5635526edce663f80175edaed48ef70db19842fa60022f4606f43b37847'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='68ba97b7-45e8-479b-947f-f

##### Approach 2

In [31]:
# save markdown file to disk
md_file_path = '../data/test/uber_parsed_pdf.md'
with open(md_file_path, 'w', encoding='utf-8') as file:
    file.write(md_docs[0].text)


In [None]:
# # source and GPT result
# # https://community.retool.com/t/extract-data-from-markdown-table/28463
# # https://chatgpt.com/share/fd555bec-ece4-4336-953c-79c9f4c5cbb2
# import re
# import pandas as pd

# def extract_tables_from_markdown(md_file):
#     with open(md_file, 'r', encoding='utf-8') as f:
#         md_content = f.read()

#     # Regular expression to find Markdown tables
#     table_regex = r'\|(.+)\|\n\|?( *[-:]+ *\|)+\n((\|.*\|\n)+)'
#     tables = re.findall(table_regex, md_content, re.MULTILINE)

#     for idx, table in enumerate(tables):
#         header = [h.strip() for h in table[0].strip().strip('|').split('|')]
#         rows = [list(map(str.strip, row.strip().strip('|').split('|'))) for row in table[2].strip().split('\n')]
#         df = pd.DataFrame(rows, columns=header)
#         print(f'Table {idx + 1}:')
#         print(df)
#         print()
        
# # Example usage:
# extract_tables_from_markdown('../data/parsed_pdf.md')

In [32]:
import re

In [33]:
# Load markdown file and substitute tables with space
md_file_mod_path = '../data/test/uber_parsed_pdf_wo_tables.md'
with open(md_file_path, 'r', encoding='utf-8') as file:
    content = file.read()

# Regular expression to find Markdown tables
table_regex = r'\|(.+)\|\n\|?( *[-:]+ *\|)+\n((\|.*\|\n)+)'
# Replace tables with space
content = re.sub(table_regex, ' ', content, flags=re.MULTILINE)

# Save the modified content back to the file
with open(md_file_mod_path, 'w', encoding='utf-8') as file:
    file.write(content)


#### Approach 3

In [None]:
# parser returns a list of length 1
type(json_obs), len(json_obs)


(list, 1)

In [None]:
type(json_obs[0])

dict

In [None]:
# list item is a dictionary; pages contains details of parsed text; rest is all metadata
json_obs[0].keys()


dict_keys(['pages', 'job_metadata', 'job_id', 'file_path'])

In [None]:
# response is in 0th element of the list; 106 pages matches with raw pdf
# 0th element is a dictionary with parsed content in 'pages' key
type(json_obs[0]['pages']),len(json_obs[0]['pages'])

(list, 106)

In [None]:
# each element of pages is a dictionary
# each page elements in a separate dictionary
    # page - page number
    # text - parsed text
    # md - markdown 
    # images - images on page
type(json_obs[0]['pages'][0]), json_obs[0]['pages'][0].keys()

(dict, dict_keys(['page', 'text', 'md', 'images', 'items']))

In [None]:
pg = 5
# explore how pages with image and tables are stored
# since zero indexing;
json_obs[0]['pages'][pg].keys()

dict_keys(['page', 'text', 'md', 'images', 'items'])

In [None]:
# page number
json_obs[0]['pages'][pg]['page']


6

In [None]:
# raw text - also includes text from tables
print(json_obs[0]['pages'][pg]['text'])


                                                                               UBER TECHNOLOGIES, INC.
                                                      CONDENSED CONSOLIDATED STATEMENTS OF OPERATIONS
                                        (In millions, except share amounts which are reflected in thousands, and per share amounts)
                                                                                            (Unaudited)


                                                                                                                                             Three Months Ended March 31,
                                                                                                                                                2021                      2022
Revenue                                                                                                                                   $            2,903     $               6,854
Costs and expenses
Cost of re

In [None]:
# markdown text - only has tables
print(json_obs[0]['pages'][pg]['md'])

#

# Uber Technologies, Inc. - Condensed Consolidated Statements of Operations

# Uber Technologies, Inc. - Condensed Consolidated Statements of Operations

Three Months Ended March 31, 2021 and 2022

2021
2022

Revenue
$2,903
$6,854

Cost of revenue
1,710
4,026

Operations and support
423
574

Sales and marketing
1,103
1,263

Research and development
515
587

General and administrative
464
632

Depreciation and amortization
212
254

Total costs and expenses
4,427
7,336

Loss from operations
(1,524)
(482)

Interest expense
(115)
(129)

Other income (expense), net
1,710
(5,557)

Income (loss) before income taxes and income (loss) from equity method investments
71
(6,168)

Provision for (benefit from) income taxes
185
(232)

Income (loss) from equity method investments
(8)
18

Net loss including non-controlling interests
(122)
(5,918)

Net loss attributable to Uber Technologies, Inc.
$(108)
$(5,930)

Net loss per share attributable to Uber Technologies, Inc. common stockholders:

Basic
$

In [None]:
# no images on the page
print(json_obs[0]['pages'][pg]['images']), len(json_obs[0]['pages'][pg]['images'])


[]


(None, 0)

In [None]:
# items - alignment of items on page
print(json_obs[0]['pages'][pg]['items'])
# see another page - to verify only tables; also has some associated text
# markdown text - only has tables

[{'type': 'heading', 'lvl': 1, 'value': '', 'md': '#'}, {'type': 'heading', 'lvl': 1, 'value': 'Uber Technologies, Inc. - Condensed Consolidated Statements of Operations', 'md': '# Uber Technologies, Inc. - Condensed Consolidated Statements of Operations'}, {'type': 'heading', 'lvl': 1, 'value': 'Uber Technologies, Inc. - Condensed Consolidated Statements of Operations', 'md': '# Uber Technologies, Inc. - Condensed Consolidated Statements of Operations'}, {'type': 'text', 'value': 'Three Months Ended March 31, 2021 and 2022\n\n2021\n2022\n\nRevenue\n$2,903\n$6,854\n\nCost of revenue\n1,710\n4,026\n\nOperations and support\n423\n574\n\nSales and marketing\n1,103\n1,263\n\nResearch and development\n515\n587\n\nGeneral and administrative\n464\n632\n\nDepreciation and amortization\n212\n254\n\nTotal costs and expenses\n4,427\n7,336\n\nLoss from operations\n(1,524)\n(482)\n\nInterest expense\n(115)\n(129)\n\nOther income (expense), net\n1,710\n(5,557)\n\nIncome (loss) before income taxes an

In [None]:
print(json_obs[0]['pages'][pg]['md'])


#

# Uber Technologies, Inc. - Condensed Consolidated Statements of Operations

# Uber Technologies, Inc. - Condensed Consolidated Statements of Operations

Three Months Ended March 31, 2021 and 2022

2021
2022

Revenue
$2,903
$6,854

Cost of revenue
1,710
4,026

Operations and support
423
574

Sales and marketing
1,103
1,263

Research and development
515
587

General and administrative
464
632

Depreciation and amortization
212
254

Total costs and expenses
4,427
7,336

Loss from operations
(1,524)
(482)

Interest expense
(115)
(129)

Other income (expense), net
1,710
(5,557)

Income (loss) before income taxes and income (loss) from equity method investments
71
(6,168)

Provision for (benefit from) income taxes
185
(232)

Income (loss) from equity method investments
(8)
18

Net loss including non-controlling interests
(122)
(5,918)

Net loss attributable to Uber Technologies, Inc.
$(108)
$(5,930)

Net loss per share attributable to Uber Technologies, Inc. common stockholders:

Basic
$

* 1st half of the table does not get recognised as table.
* only net loss onwards is recognised as table.
* let's explore approach 3 - convert to html

In [None]:
import marko

In [None]:
html = marko.convert(md_docs[0].text)

In [None]:
print(html)

<h1></h1>
<h1>Quarterly Report</h1>
<h1>Financial Report - Uber Technologies, Inc.</h1>
<h1>Financial Information</h1>
<p>|Item|Details|
|---|---|
|Common Stock|Trading Symbol: UBER|
|Exchange|New York Stock Exchange|</p>
<h1>Company Information</h1>
<p>|Filing Status|Large accelerated filer|
|---|---|
|Shell Company|No|
|Shares Outstanding|1,963,660,253|</p>
<h1>Contact Information</h1>
<p>Address: 1515 3rd Street, San Francisco, California 94158</p>
<p>Phone: (415) 612-8582</p>
<h1>Images</h1>
<h2>No images included in the report.</h2>
<h1></h1>
<h1>Uber Technologies, Inc. - Quarterly Report</h1>
<h1>Uber Technologies, Inc. - Quarterly Report</h1>
<h1>Special Note Regarding Forward-Looking Statements</h1>
<p>Content related to forward-looking statements goes here...</p>
<h1>Part I - Financial Information</h1>
<h1>Item 1. Financial Statements (unaudited)</h1>
<p><strong>Condensed Consolidated Balance Sheets</strong>
|Condensed Consolidated Balance Sheets as of December 31, 2021|Conten

#### finally 

* Some tables are not being parsed as tables by LlamaParse
* No way around it

#### Our final approach - Llamaparse Edition

* Parse json
    * Separate out images in a list; get alt text, store as list and get model to ask questions
    * get tables from each page - store as list and get model to ask questions
* Parse markdown
    * replace tables with empty space
    * chunk and generate questions