# Introducing Parse Jobs API - Quick Start Guide

**Process *large* documents asynchronously with LandingAI ADE**

In [None]:
# ---
# LandingAI Applied AI Content Notebook Template
# ---
# Title: Introducing Parse Jobs API - Quick Start Guide
# Author: Ava Xia
# Description: Simple introduction to ADE Parse Jobs API for processing large documents
# Target Audience: Developers
# Content Type: Tutorial
# Publish Date: 2025-10-10
# ---

## Overview

The **ADE Parse Jobs API** enables processing of large documents (up to 1GB / 1,000 pages) that exceed the limits of the standard synchronous API.

### Key Benefits:
- 📄 **Large Documents**: Up to 1GB files and 1,000 pages
- ⚡ **Non-blocking**: Submit and check status later
- 📊 **Progress Tracking**: Monitor completion (0.0 to 1.0)
- 🔄 **Automatic Retries**: Built-in error handling

### Parse vs Parse Jobs Comparison:

| Feature | Standard Parse API | Parse Jobs API |
|---------|----------|----------|
| Max Size | 50MB | 1GB |
| Max Pages | 50 | 1,000 |
| Response | Immediate | Job ID |
| Best For | Small docs | Large docs |



## 🔧 Setup

In [None]:
# Install/upgrade required libraries: `requests` (HTTP client) and `python-dotenv` (loads .env files)
!pip install -U requests python-dotenv



In [None]:
import os
import json
import time
import requests
from pathlib import Path
from typing import Dict, Any, Optional
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Configuration of the LandingAI API key
API_KEY = os.getenv('VISION_AGENT_API_KEY')

if not API_KEY:
    print("⚠️ Please set VISION_AGENT_API_KEY environment variable")
    print(f"Get your key at: https://docs.landing.ai/ade/agentic-api-key")
else:
    print("✅ API Key configured")


✅ API Key configured


In [None]:
BASE_URL = 'https://api.va.landing.ai/'

## 🔄  Workflow

The parse jobs API follows a simple 3-step workflow:

```
1. SUBMIT → Returns job_id
2. MONITOR → Check progress
3. RETRIEVE → Get results
4. Integrated Workflow → (Optional) Assemble these three steps into one
```

## 📤 Step 1: Submit Document

In [None]:
from typing import Optional
from pathlib import Path
import requests, os

def submit_document(file_path: str, api_key: str) -> Optional[str]:
    """
    Upload a PDF to LandingAI’s ADE endpoint and get back the job_id.
    Returns the job_id string when the POST succeeds, else None.
    """
    # ── 1.  Resolve & sanity-check the path ───────────────────────────────────
    p = Path(file_path).expanduser().resolve()
    if not p.exists():
        print(f"❌ File not found: {p}")
        return None

    print(f"📄 File: {p.name}  |  📏 {p.stat().st_size / 1_048_576:.1f} MB")

    # ── 2.  Prepare request ──────────────────────────────────────────────────
    url = f'{BASE_URL}/v1/ade/parse/jobs'
    headers = {"Authorization": f"Bearer {api_key}"}

    # open() inside a with-block → auto-close even if an exception fires
    with p.open("rb") as fh:
        files = {"document": fh}
        resp  = requests.post(url, headers=headers, files=files, timeout=30)

    # ── 3.  Handle response ──────────────────────────────────────────────────
    if resp.status_code in (200, 202):                    # 202 = queued/accepted
        data   = resp.json()
        job_id = data.get("job_id")
        if job_id:
            print(f"✅ Job accepted — job_id: {job_id}")
            return job_id
        else:
            print("❌ Response missing job_id:", data)
            return None

    # Any non-success status drops through to here
    print(f"❌ Upload failed ({resp.status_code}): {resp.text}")
    return None


In [None]:
job_id = submit_document('one_huge_file.pdf', API_KEY)

📄 File: one_huge_file.pdf  |  📏 57.6 MB
✅ Job accepted — job_id: cmgmldz540004mj8do5uwzf0l


## 📊 Step 2: Monitor Job Status

In [None]:
def check_job_status(job_id: str, api_key: str) -> Dict[str, Any]:
    """
    Check the status of an async job.

    Args:
        job_id: The job ID from submission
        api_key: Your API key

    Returns:
        Status dictionary with progress and results
    """
    url = f'{BASE_URL}/v1/ade/parse/jobs/{job_id}'
    headers = {'Authorization': f'Bearer {api_key}'}

    try:
        response = requests.get(url, headers=headers)

        if response.status_code == 200:
            data = response.json()
            status = data.get('status')
            progress = data.get('progress', 0) * 100

            print(f"Status: {status} | Progress: {progress:.0f}%")
            return data
        else:
            print(f"❌ Error checking status: {response.status_code}")
            return None

    except Exception as e:
        print(f"❌ Error: {e}")
        return None

In [None]:
# Example: Poll until complete
def wait_for_completion(job_id: str, api_key: str, timeout: int = 3600):
    """
    Wait for job to complete with polling.
    """
    start_time = time.time()

    while time.time() - start_time < timeout:
        status_data = check_job_status(job_id, api_key)

        if status_data:
            status = status_data.get('status')

            if status == 'completed':
                print("✅ Job completed!")
                return status_data
            elif status == 'failed':
                print(f"❌ Job failed: {status_data.get('failure_reason')}")
                return None

        time.sleep(30)  # Poll every 30 seconds

    print("⏱️ Timeout waiting for completion")
    return None

In [None]:
# Example usage:
result = wait_for_completion(job_id, API_KEY)

Status: processing | Progress: 0%
Status: processing | Progress: 5%
Status: processing | Progress: 7%
Status: processing | Progress: 12%
Status: processing | Progress: 17%
Status: processing | Progress: 19%
Status: processing | Progress: 21%
Status: processing | Progress: 26%
Status: processing | Progress: 28%
Status: processing | Progress: 31%
Status: processing | Progress: 36%
Status: processing | Progress: 38%
Status: processing | Progress: 40%
Status: processing | Progress: 45%
Status: processing | Progress: 48%
Status: processing | Progress: 52%
Status: processing | Progress: 55%
Status: processing | Progress: 59%
Status: processing | Progress: 62%
Status: processing | Progress: 64%
Status: processing | Progress: 67%
Status: processing | Progress: 69%
Status: processing | Progress: 74%
Status: processing | Progress: 76%
Status: processing | Progress: 78%
Status: processing | Progress: 83%
Status: processing | Progress: 86%
Status: processing | Progress: 88%
Status: processing | Pr

## 📥 Step 3: Retrieve Results

### Handling API Results

The API provides job results in two different ways to optimize performance. For smaller documents, the results are returned directly in the status response. For larger documents, a temporary download link is provided via the `output_url` field.

- **`output_url`**: `string | null`  
  The `output_url` is a field in the API response that contains a secure, temporary link to download your job's results. Its value will be either a **string** (the URL) or **`null`**.

Think of it like receiving mail. 📬 A small letter fits directly into your mailbox, but for a large package, you get a slip telling you where to pick it up.

- **Direct Results (Small Files < 1 MB)**:  
  If the processed output is small, the API returns it directly within the `data` field of the response. In this case, `output_url` will be `null`.

- **URL Link (Large Files ≥ 1 MB)**:  
  If the output is large, the API places the results in a separate JSON file and provides a link to it in the `output_url` field. This prevents the main API response from becoming slow or unwieldy. In this scenario, the `data` field will be `null`.

> **Note:**  
> The URL is generated only when the job’s `status` is **`completed`**. To get the results, make a separate HTTP **GET** request to this URL to download the JSON file containing the final markdown content.

In [None]:
import requests
import json
from typing import Optional

def get_results(job_id: str, api_key: str, save_to_file: bool = True) -> Optional[str]:
    """
    Retrieve results from a completed job, handling both direct data responses
    (for small files) and fetching from an output URL (for large files).

    Args:
        job_id: The job ID.
        api_key: Your API key.
        save_to_file: Whether to save the markdown content to a file.

    Returns:
        Markdown content if successful, otherwise None.
    """
    # 1. Check the job status
    status_data = check_job_status(job_id, api_key)
    if not status_data or status_data.get('status') != 'completed':
        status = status_data.get('status', 'unknown') if status_data else 'unknown'
        print(f"⚠️ Job not completed yet. Current status: '{status}'")
        return None

    markdown = ''

    # 2. Check if results are returned directly (for smaller files)
    if status_data.get('data') is not None:
        print("✅ Job complete. Results found directly in API response.")
        data = status_data.get('data', {})
        markdown = data.get('markdown', '')

    # 3. If not, fetch results from the output URL (for larger files)
    else:
        output_url = status_data.get('output_url')
        if not output_url:
            print("❌ Job is complete, but no output URL or direct data was found.")
            return None

        print("✅ Job complete. Fetching results from URL for large file...")
        try:
            response = requests.get(output_url)
            response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)
            results_data = response.json()
            markdown = results_data.get('markdown', '')
        except requests.exceptions.RequestException as e:
            print(f"❌ Failed to fetch results from URL: {e}")
            return None
        except json.JSONDecodeError:
            print("❌ Failed to parse the fetched results as JSON.")
            return None

    # 4. Process the markdown, regardless of where it came from
    if markdown:
        print(f"📄 Retrieved {len(markdown)} characters of markdown.")

        if save_to_file:
            output_file = f'{job_id}_output.md'
            with open(output_file, 'w', encoding='utf-8') as f:
                f.write(markdown)
            print(f"💾 Saved to: {output_file}")

        # Display metadata from the top-level status object
        metadata = status_data.get('metadata', {})
        if metadata:
            print(f"\n📊 Processing stats:")
            print(f"  • Pages: {metadata.get('page_count', 'N/A')}")
            print(f"  • Time: {metadata.get('duration_ms', 0) / 1000:.1f}s")
            print(f"  • Credits: {metadata.get('credit_usage', 'N/A')}")

        return markdown
    else:
        print("❌ No markdown content found in the results.")
        return None

In [None]:
# --- Example Usage ---
markdown_content = get_results(job_id, API_KEY, save_to_file=True)

Status: completed | Progress: 100%
✅ Job complete. Fetching results from URL for large file...
📄 Retrieved 1331672 characters of markdown.
💾 Saved to: cmgmldz540004mj8do5uwzf0l_output.md

📊 Processing stats:
  • Pages: 421
  • Time: 946.0s
  • Credits: 1263.0


## Utility Function to Preview the Result

In [None]:
def preview_markdown(file_path: str, num_chars: int = 1000):
    """
    Prints the first few characters of a text file.

    Args:
        file_path: The path to the file you want to preview.
        num_chars: The number of characters to display.
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            # Read the specified number of characters
            preview_content = f.read(num_chars)

            print(f"📄 Previewing first {num_chars} characters of '{file_path}':")
            print("--------------------- START OF FILE ---------------------")
            print(preview_content)
            print("---------------------- END OF PREVIEW ----------------------")

            # Add a small note if the preview is the same length as requested,
            # which implies there's probably more content in the file.
            if len(preview_content) == num_chars:
                print("(File continues...)")

    except FileNotFoundError:
        print(f"❌ Error: The file '{file_path}' was not found.")
    except Exception as e:
        print(f"An error occurred: {e}")

In [None]:
# Use the job_id from the previous step to construct the filename
job_id = 'cmgmldz540004mj8do5uwzf0l'
markdown_filename = f'{job_id}_output.md'

# Call the function to print the first 1000 characters
preview_markdown(markdown_filename)

📄 Previewing first 1000 characters of 'cmgmldz540004mj8do5uwzf0l_output.md':
--------------------- START OF FILE ---------------------
<a id='8f83a392-9759-49c1-b816-aade0dbe1a8f'></a>

MACHINE
LEARNING

<::A diagram resembling a neural network with interconnected nodes.
: figure::>

TOM M. MITCHELL

<a id='99a96c5b-2a2e-44ad-bf1b-da9878254eb6'></a>

Machine Learning

<a id='bca724ab-3e4b-414d-9cc8-d923d7882a40'></a>

Tom M. Mitchell

<a id='a85d4599-c8e4-46d6-8c12-bcfaba697827'></a>

## Product Details
*   **Hardcover**: 432 pages ; Dimensions (in inches): 0.75 x 10.00 x 6.50
*   **Publisher**: McGraw-Hill Science/Engineering/Math; (March 1, 1997)
*   **ISBN**: 0070428077
*   **Average Customer Review**: ⭐⭐⭐⭐⭒ Based on 16 reviews.
*   **Amazon.com Sales Rank**: 42,816
*   **Popular in**: Redmond, WA (#17), Ithaca, NY (#9)

<a id='8d37667c-0bd9-4920-a7ac-b4c1cf2a7db9'></a>

## Editorial Reviews

**From Book News, Inc.** An introductory text on primary approaches to machine learning and

## Step 4: Assemble the Entire Pipeline

In [None]:
def process_large_document(file_path: str, api_key: str):
    """
    Complete workflow for processing a large document.
    """
    print("🚀 ASYNC DOCUMENT PROCESSING WORKFLOW")
    print("="*50)

    # Step 1: Submit
    print("\n1️⃣ Submitting document...")
    job_id = submit_document(file_path, api_key)

    if not job_id:
        print("Failed to submit document")
        return

    # Step 2: Wait for completion
    print("\n2️⃣ Waiting for processing...")
    result = wait_for_completion(job_id, api_key)

    if not result:
        print("Processing failed or timed out")
        return

    # Step 3: Get results
    print("\n3️⃣ Retrieving results...")
    markdown = get_results(job_id, api_key)

    if not markdown:
        print("Failed to retrieve results")
        return

    return {
        'job_id': job_id,
        'markdown': markdown,
    }

In [None]:
# Example usage:
results = process_large_document('one_large_file.pdf', API_KEY)

🚀 ASYNC DOCUMENT PROCESSING WORKFLOW

1️⃣ Submitting document...
📄 File: one_large_file.pdf  |  📏 39.0 MB
✅ Job accepted — job_id: cmgmnkyer0004f4ee511uo9oc

2️⃣ Waiting for processing...
Status: processing | Progress: 0%
Status: processing | Progress: 10%
Status: processing | Progress: 31%
Status: processing | Progress: 41%
Status: processing | Progress: 62%
Status: processing | Progress: 72%
Status: processing | Progress: 72%
Status: processing | Progress: 82%
Status: processing | Progress: 93%
Status: completed | Progress: 100%
✅ Job completed!

3️⃣ Retrieving results...
Status: completed | Progress: 100%
✅ Job complete. Results found directly in API response.
📄 Retrieved 249294 characters of markdown.
💾 Saved to: cmgmnkyer0004f4ee511uo9oc_output.md

📊 Processing stats:
  • Pages: 97
  • Time: 260.0s
  • Credits: 291.0


## 📋 Quick Reference

### API Endpoints:
```
POST /v1/ade/parse/jobs         → Submit document
GET  /v1/ade/parse/jobs/{job_id} → Check status
GET  /v1/ade/parse/jobs    → List all jobs
```

### Python Workflow:
```python
# 1. Submit
job_id = submit_document('doc.pdf', API_KEY)

# 2. Wait
result = wait_for_completion(job_id, API_KEY)

# 3. Retrieve
markdown = get_results(job_id, API_KEY)
```

### 💡 Tips:
- Poll every 10-30 seconds for large documents
- Save job_id immediately after submission
- Check file size before submission (<1GB)
- Use exponential backoff for retries

### 🔗 Resources:
- Docs: https://docs.landing.ai/ade
- API Key: https://va.landing.ai/settings/api-key

---

**That's it! You're ready to process large documents with the Async Parse API.** 🎉