# Smart Document Auditor Notebook

This notebook accompanies the blog post on building a **Smart Document Auditor**. It demonstrates how to use **BeautifulSoup** to parse HTML, how to download a publicly available financial report as a PDF, and how to integrate **LandingAI's Agentic Document Extraction (ADE)** to extract structured data from that PDF.

> **Note:** Due to environment constraints, network calls to external sites may return an HTTP 403 error. The ADE library (`agentic-doc`) is also not available in this environment. Code cells that involve web requests or ADE parsing are therefore examples only; they need to be run in an environment with internet access and the required library installed.

## Setup

Install the required libraries if you haven't already. BeautifulSoup is part of the `bs4` package, and `requests` is used for HTTP requests. The ADE library (`agentic-doc`) is only available from LandingAI; you would install it via:

```bash
pip install bs4 requests agentic-doc
```

You also need to set your LandingAI API key in an environment variable called `VISION_AGENT_API_KEY` before using ADE.

In [None]:
# Import required libraries
try:
    from bs4 import BeautifulSoup
    import requests
    print('Libraries imported successfully.')
except ImportError as e:
    print('You may need to install missing libraries:', e)


Libraries imported successfully.


## Parsing HTML with BeautifulSoup

Below we construct a simple HTML document and use BeautifulSoup to find all the links. This demonstrates the basic API of BeautifulSoup.

In [None]:
from bs4 import BeautifulSoup

html_doc = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
</p>
</body></html>'''

soup = BeautifulSoup(html_doc, 'html.parser')
links = [a.get('href') for a in soup.find_all('a')]
print('Extracted links:', links)


Extracted links: ['http://example.com/elsie', 'http://example.com/lacie', 'http://example.com/tillie']


## Downloading a PDF financial statement

In our example we locate Apple’s FY2024 Q1 consolidated financial statements using BeautifulSoup. Rather than scraping a Form 10‑K from EDGAR, we parse Apple’s press‑release page to discover the link to the financial statements PDF. This approach demonstrates how BeautifulSoup can be used to find relevant documents on a company’s website.

In [None]:
import requests
from bs4 import BeautifulSoup

press_release_url = "https://www.apple.com/newsroom/2024/02/apple-reports-first-quarter-results/"

headers = {
    "User-Agent": "SmartDocumentAuditor/1.0 (ankit.khare@landing.ai)",
    "Accept-Language": "en-US,en;q=0.9",
}

resp = requests.get(press_release_url, headers=headers)
resp.raise_for_status()

soup = BeautifulSoup(resp.text, "html.parser")

# Find the link under the 'Consolidated Financial Statements' heading
pdf_link = None
for link in soup.find_all("a"):
    if "View PDF" in link.get_text(strip=True):
        pdf_link = link["href"]
        break

if pdf_link:
    # If the link is relative, prepend the domain
    if pdf_link.startswith("/"):
        pdf_url = "https://www.apple.com" + pdf_link
    else:
        pdf_url = pdf_link
    print("Found PDF URL:", pdf_url)
else:
    print("Could not find the PDF link.")

Found PDF URL: https://www.apple.com/newsroom/pdfs/fy2024-q1/FY24_Q1_Consolidated_Financial_Statements.pdf


## Integrating ADE for Document Extraction

You can use LandingAI's Agentic Document Extraction (ADE) library to parse the pdf directly from the URL and extract structured data. ADE accepts either a local file path or a direct PDF URL, so you can choose the approach that suits your environment. The extraction_model parameter is an optional argument that provides the Pydantic model schema for field extraction. To know more about the available options, check - https://docs.landing.ai/ade/ade-parse-docs

In [None]:
# Example code for using ADE to extract financial metrics
from pydantic import BaseModel, Field
from agentic_doc.parse import parse
import os

class FinancialMetrics(BaseModel):
    total_revenue: float = Field(description="Total revenue in USD")
    net_income: float = Field(description="Net income in USD")
    diluted_eps: float = Field(description="Diluted earnings per share")

os.environ["VISION_AGENT_API_KEY"] = "MWk1bjd1aTJ5cHRlMGZoZGdsY2hrOjNGV3RIVFJxak1ZRkF1Q3pLeUVqOHluQVFKWkJ0SUc4"

# Parse the PDF directly from the URL
results = parse(pdf_url, extraction_model=FinancialMetrics)

[2m2025-08-04 23:17:31[0m [info   [0m] [1mAPI key is valid.             [0m [[0m[1m[34magentic_doc.utils[0m][0m (utils.py:42)
[2m2025-08-04 23:17:31[0m [info   [0m] [1mParsing 1 documents           [0m [[0m[1m[34magentic_doc.parse[0m][0m (parse.py:280)
[2m2025-08-04 23:17:31[0m [info   [0m] [1mDownloading file from 'https://www.apple.com/newsroom/pdfs/fy2024-q1/FY24_Q1_Consolidated_Financial_Statements.pdf' to '/tmp/tmp2es4novp/FY24_Q1_Consolidated_Financial_Statements.pdf'[0m [[0m[1m[34magentic_doc.utils[0m][0m (utils.py:442)


Parsing documents:   0%|          | 0/1 [00:00<?, ?it/s]

HTTP Request: GET https://www.apple.com/newsroom/pdfs/fy2024-q1/FY24_Q1_Consolidated_Financial_Statements.pdf "HTTP/1.1 200 OK" (_client.py:1025)
[2m2025-08-04 23:17:31[0m [info   [0m] [1mSplitting PDF: '/tmp/tmp2es4novp/FY24_Q1_Consolidated_Financial_Statements.pdf' into 0 parts under '/tmp/tmp578bpuc2'[0m [[0m[1m[34magentic_doc.utils[0m][0m (utils.py:236)
[2m2025-08-04 23:17:31[0m [info   [0m] [1mCreated /tmp/tmp578bpuc2/FY24_Q1_Consolidated_Financial_Statements_1.pdf[0m [[0m[1m[34magentic_doc.utils[0m][0m (utils.py:252)
[2m2025-08-04 23:17:31[0m [info   [0m] [1mStart parsing document part: 'File name: FY24_Q1_Consolidated_Financial_Statements_1.pdf	Page: [0:2]'[0m [[0m[1m[34magentic_doc.parse[0m][0m (parse.py:670)



Parsing document parts from 'FY24_Q1_Consolidated_Financial_Statements.pdf':   0%|          | 0/1 [00:00<?, ?it/s][A

HTTP Request: POST https://api.va.landing.ai/v1/tools/agentic-document-analysis "HTTP/1.1 200 OK" (_client.py:1025)
[2m2025-08-04 23:17:56[0m [info   [0m] [1mTime taken to successfully parse a document chunk: 24.98 seconds[0m [[0m[1m[34magentic_doc.parse[0m][0m (parse.py:823)
[2m2025-08-04 23:17:56[0m [info   [0m] [1mSuccessfully parsed document part: 'File name: FY24_Q1_Consolidated_Financial_Statements_1.pdf	Page: [0:2]'[0m [[0m[1m[34magentic_doc.parse[0m][0m (parse.py:679)



Parsing document parts from 'FY24_Q1_Consolidated_Financial_Statements.pdf': 100%|██████████| 1/1 [00:24<00:00, 24.99s/it]
Parsing documents: 100%|██████████| 1/1 [00:25<00:00, 25.36s/it]

Extracted metrics: total_revenue=119575000000.0 net_income=33916000000.0 diluted_eps=2.18
ADE integration code goes here.





## Analysing the extracted data

After you have extracted the key metrics with ADE, you can perform your own analysis. For example, you might compute year‑over‑year growth in revenue or profit margins, or compare the values against analyst forecasts. Because the extraction step is not run here, the example uses dummy data.

In [None]:
# After calling parse()
metrics = results[0].extraction
print("Extracted metrics:", metrics)

# Access fields as attributes, not dict keys
revenue = metrics.total_revenue
net_income = metrics.net_income
eps = metrics.diluted_eps

profit_margin = net_income / revenue

print(f"Total revenue: ${revenue/1e9:.2f}B")
print(f"Net income: ${net_income/1e9:.2f}B")
print(f"Diluted EPS: {eps:.2f}")
print(f"Profit margin: {profit_margin:.2%}")


Extracted metrics: total_revenue=119575000000.0 net_income=33916000000.0 diluted_eps=2.18
Total revenue: $119.58B
Net income: $33.92B
Diluted EPS: 2.18
Profit margin: 28.36%
