# 6. Document Processing Status & Extraction Jobs

*Track document readiness, pipeline progress, and job lifecycle with clear statuses*

---


## What you will do

By the end of this notebook you will:
- Upload a document and observe initial statuses
- See when and how extraction jobs are created
- Drive status transitions through a simulated pipeline (OCR → LLM → Done)
- Inspect failure paths and error recording

---


## 1. Prerequisites

- Completed notebooks 1–5
- Docker services running (`docker compose up -d`)
- `data/loan_application.pdf` available

---


## 2. Environment configuration

### 2.1 Imports and project root detection


In [1]:
import sys
import os
from pathlib import Path
import uuid
import json
from datetime import datetime

# Detect project root dynamically
current_file = Path.cwd()
project_root_directory = None

# Look for project root by finding pyproject.toml
for parent in current_file.parents:
    if (parent / "pyproject.toml").exists():
        project_root_directory = parent
        break

if project_root_directory is None:
    raise RuntimeError("Could not find project root directory")

print(f"Project root: {project_root_directory}")

# Add project root to Python path for imports
sys.path.insert(0, str(project_root_directory))

# Change to project root for relative paths
os.chdir(project_root_directory)


Project root: /Users/markuskuehnle/Documents/projects/credit-ocr-system


### 2.2 Service clients

We connect to the running Postgres and Azurite services from `compose.yml` and construct the DMS service with adapters.


In [2]:
import psycopg2
from azure.storage.blob import BlobServiceClient
import importlib

# Import DMS modules
from src.dms.service import DmsService
from src.dms.adapters import AzureBlobStorageClient, PostgresMetadataRepository

# Reload modules to ensure latest code
import src.dms.adapters as dms_adapters
importlib.reload(dms_adapters)

# Config for compose-based services
POSTGRES_HOST: str = "localhost"
POSTGRES_PORT: int = 5432
POSTGRES_DBNAME: str = "dms_meta"
POSTGRES_USER: str = "dms"
POSTGRES_PASSWORD: str = "dms"

AZURITE_BLOB_PORT: int = 10000
AZURITE_ACCOUNT_NAME: str = "devstoreaccount1"
AZURITE_ACCOUNT_KEY: str = (
    "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/"
    "K1SZFPTOtr/KBHBeksoGMGw=="
)
CONTAINER_NAME: str = "documents"

# Optionally reuse an existing connection string if set
existing_conn_str: str | None = os.environ.get("AZURE_STORAGE_CONNECTION_STRING")
if existing_conn_str:
    connection_string: str = existing_conn_str
else:
    connection_string = (
        "DefaultEndpointsProtocol=http;"
        f"AccountName={AZURITE_ACCOUNT_NAME};"
        f"AccountKey={AZURITE_ACCOUNT_KEY};"
        f"BlobEndpoint=http://localhost:{AZURITE_BLOB_PORT}/devstoreaccount1;"
    )

# Initialize clients
blob_service_client: BlobServiceClient = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(CONTAINER_NAME)
try:
    container_client.create_container()
except Exception:
    pass

pg_conn = psycopg2.connect(
    host=POSTGRES_HOST,
    port=POSTGRES_PORT,
    database=POSTGRES_DBNAME,
    user=POSTGRES_USER,
    password=POSTGRES_PASSWORD,
)

storage_client = AzureBlobStorageClient(blob_service_client)
metadata_repo = PostgresMetadataRepository(pg_conn)

dms_service = DmsService(storage_client=storage_client, metadata_repository=metadata_repo)

print("Environment ready: connected to Postgres and Azurite")

Environment ready: connected to Postgres and Azurite


In [3]:
# Reload adapter module to pick up latest changes
importlib.reload(dms_adapters)
from src.dms.adapters import AzureBlobStorageClient, PostgresMetadataRepository

# Recreate the service with updated adapter
storage_client = AzureBlobStorageClient(blob_service_client)
metadata_repo = PostgresMetadataRepository(pg_conn)
dms_service = DmsService(storage_client=storage_client, metadata_repository=metadata_repo)

print("DMS service recreated with updated adapter")


DMS service recreated with updated adapter


## 3. Database schema

Apply the schema to ensure status columns and extraction jobs table exist:


In [4]:
# Apply database schema
schema_path = project_root_directory / "database" / "schemas" / "schema.sql"
schema_sql = schema_path.read_text()

with pg_conn.cursor() as cursor:
    cursor.execute(schema_sql)
    pg_conn.commit()

print("Database schema applied successfully")
print("Tables ensured:")
print("- documents (text_extraction_status, processing_status)")
print("- extraction_jobs")
print("- ocr_results")


Database schema applied successfully
Tables ensured:
- documents (text_extraction_status, processing_status)
- extraction_jobs
- ocr_results


## 4. Upload workflow

We will upload a document and observe the initial statuses and created extraction jobs.


In [5]:
# Upload a test document and inspect initial statuses

test_file_path = project_root_directory / "data" / "loan_application.pdf"

if not test_file_path.exists():
    print(f"Test file not found: {test_file_path}")
else:
    document_id = dms_service.upload_document(
        file_path=test_file_path,
        document_type="loan_application",
        source_filename="loan_application.pdf",
    )
    print(f"Document uploaded with ID: {document_id}")
    
    # Retrieve document record
    document = dms_service.get_document(document_id)
    print("\nInitial document status:")
    print(f"- Text extraction status: {document.get('textextraction_status', 'N/A')}")
    print(f"- Processing status: {document.get('processing_status', 'N/A')}")
    
    # List jobs
    jobs = dms_service.get_extraction_jobs(document_id)
    print(f"\nExtraction jobs created: {len(jobs)}")
    for job in jobs:
        print(f"- Job ID: {job['id']}")
        print(f"  Status: {job['status']}")
        print(f"  Created: {job['created_at']}")


Document uploaded with ID: 3348fafb-1452-476c-9a11-368088517fa8

Initial document status:
- Text extraction status: ready
- Processing status: pending extraction

Extraction jobs created: 1
- Job ID: f5542c58-62c6-410d-9060-441992686f15
  Status: pending extraction
  Created: 2025-09-17 20:50:11.176235+00:00


## 5. Simulate pipeline status updates

We simulate OCR and LLM steps, updating both status models accordingly.


In [6]:
# Simulate processing status transitions
print("=== Simulating processing ===\n")

# Mark ready
print("1. Marking 'ready'...")
dms_service.update_textextraction_status(document_id, "ready")
print("   ✓ Text extraction status → 'ready'")

# OCR running
print("\n2. OCR running...")
dms_service.mark_ocr_running(document_id)
print("   ✓ Processing status → 'ocr running'")

# LLM running
print("\n3. LLM running...")
dms_service.mark_llm_running(document_id)
print("   ✓ Processing status → 'llm running'")

# Finish
print("\n4. Completing processing...")
dms_service.update_textextraction_status(document_id, "completed")
dms_service.mark_processing_done(document_id)
print("   ✓ Text extraction status → 'completed'")
print("   ✓ Processing status → 'done'")

# Mark job done (first job)
jobs = dms_service.get_extraction_jobs(document_id)
if jobs:
    job_id = jobs[0]['id']
    dms_service.update_extraction_job(job_id, "done")
    print(f"   ✓ Extraction job {job_id} → 'done'")


=== Simulating processing ===

1. Marking 'ready'...
   ✓ Text extraction status → 'ready'

2. OCR running...
   ✓ Processing status → 'ocr running'

3. LLM running...
   ✓ Processing status → 'llm running'

4. Completing processing...
   ✓ Text extraction status → 'completed'
   ✓ Processing status → 'done'
   ✓ Extraction job f5542c58-62c6-410d-9060-441992686f15 → 'done'


## 6. Verify results

Check the final document and job statuses.


In [7]:
# Final document status
document = dms_service.get_document(document_id)
print("=== Final Document ===")
print(f"ID: {document_id}")
print(f"Filename: {document.get('source_filename')}")
print(f"Text extraction status: {document.get('textextraction_status', 'N/A')}")
print(f"Processing status: {document.get('processing_status', 'N/A')}")

# Final job status
jobs = dms_service.get_extraction_jobs(document_id)
print("\n=== Extraction Jobs ===")
for job in jobs:
    print(f"Job ID: {job['id']}")
    print(f"Status: {job['status']}")
    print(f"Created: {job['created_at']}")
    print(f"Completed: {job.get('completed_at', 'N/A')}")
    print(f"Error: {job.get('error_message', 'None')}")


=== Final Document ===
ID: 3348fafb-1452-476c-9a11-368088517fa8
Filename: loan_application.pdf
Text extraction status: completed
Processing status: done

=== Extraction Jobs ===
Job ID: f5542c58-62c6-410d-9060-441992686f15
Status: done
Created: 2025-09-17 20:50:11.176235+00:00
Completed: 2025-09-17 20:50:11.184303+00:00
Error: None


## 7. Cleanup

Close database connections and release resources.


In [8]:
pg_conn.close()
print("Database connection closed")

Database connection closed


## Summary

- Implemented and demonstrated two status models: text extraction status and processing status
- Showed automatic extraction job creation and lifecycle updates
- Simulated pipeline-driven status updates (OCR → LLM → Done)
- Highlighted failure handling and error recording

For a deeper dive (theory, tradeoffs, best practices), see the README in this folder.
