# Legal Transcript Line Number Reflow

This notebook demonstrates how to process legal documents (depositions, court transcripts, trial records) with Azure Content Understanding and reflow the output to include inline line numbers.

## The Challenge

Legal transcripts have a standardized format with **line numbers in the left margin** (typically 1-25 per page). These line numbers are critical for:
- Citing specific testimony in legal briefs
- Cross-referencing during depositions and trials
- Creating accurate legal summaries

By default, Content Understanding's markdown output groups these margin line numbers separately from the main text content. This notebook shows how to **reflow the output** to include line numbers inline with each text line.

## Workflow
1. **Load PDF** - Read the local legal transcript file
2. **Content Extraction** - Use Azure Content Understanding to extract text with position data
3. **Reflow** - Match line numbers with text using bounding box coordinates
4. **Output** - Generate markdown with inline line numbers (e.g., `1 | witness testimony...`)

## Prerequisites
1. Ensure your Azure AI service is configured by following the [configuration steps](../README.md#configure-azure-ai-service-resource).
2. Install the required packages to run this sample.

In [None]:
%pip install -r ../requirements.txt

## Create Azure AI Content Understanding Client

In [2]:
from datetime import datetime
import logging
import os
import sys
from typing import Any, Optional
from dotenv import find_dotenv, load_dotenv

# Add the parent directory to the Python path to import the helper modules
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'python'))
from content_understanding_client import AzureContentUnderstandingClient
from extension.sample_helper import save_json_to_file 
from azure.identity import DefaultAzureCredential

load_dotenv(find_dotenv())
logging.basicConfig(level=logging.INFO)

# For authentication, you can use either token-based auth or subscription key
AZURE_AI_ENDPOINT = os.getenv("AZURE_AI_ENDPOINT")
AZURE_AI_API_KEY = os.getenv("AZURE_AI_API_KEY")
API_VERSION = "2025-11-01"

# Create token provider for Azure AD authentication
def token_provider():
    credential = DefaultAzureCredential()
    token = credential.get_token("https://cognitiveservices.azure.com/.default")
    return token.token

# Create the Content Understanding client
try:
    client = AzureContentUnderstandingClient(
        endpoint=AZURE_AI_ENDPOINT,
        api_version=API_VERSION,
        subscription_key=AZURE_AI_API_KEY,
        token_provider=token_provider if not AZURE_AI_API_KEY else None,
        x_ms_useragent="azure-ai-content-understanding-python-sample-legal-reflow"
    )
    credential_type = "Subscription Key" if AZURE_AI_API_KEY else "Azure AD Token"
    print(f"‚úÖ Client created successfully")
    print(f"   Endpoint: {AZURE_AI_ENDPOINT}")
    print(f"   Credential: {credential_type}")
    print(f"   API Version: {API_VERSION}")
except Exception as e:
    print(f"‚ùå Failed to create client: {e}")
    raise

‚úÖ Client created successfully
   Endpoint: https://mmi-usw3-eft-foundry.services.ai.azure.com/
   Credential: Subscription Key
   API Version: 2025-11-01


## Configure Model Deployments

> **üí° Note:** This step is only required **once per Azure Content Understanding resource**.

In [4]:
# Get model deployment names from environment variables
GPT_4_1_DEPLOYMENT = os.getenv("GPT_4_1_DEPLOYMENT")
GPT_4_1_MINI_DEPLOYMENT = os.getenv("GPT_4_1_MINI_DEPLOYMENT")
TEXT_EMBEDDING_3_LARGE_DEPLOYMENT = os.getenv("TEXT_EMBEDDING_3_LARGE_DEPLOYMENT")

# Check if required deployments are configured
missing_deployments = []
if not GPT_4_1_DEPLOYMENT:
    missing_deployments.append("GPT_4_1_DEPLOYMENT")
if not GPT_4_1_MINI_DEPLOYMENT:
    missing_deployments.append("GPT_4_1_MINI_DEPLOYMENT")
if not TEXT_EMBEDDING_3_LARGE_DEPLOYMENT:
    missing_deployments.append("TEXT_EMBEDDING_3_LARGE_DEPLOYMENT")

if missing_deployments:
    print(f"‚ö†Ô∏è  Warning: Missing model deployment configuration(s): {missing_deployments}")
    print("   Add these to your .env file and restart the kernel.")
else:
    print(f"üìã Configuring default model deployments...")
    try:
        result = client.update_defaults({
            "gpt-4.1": GPT_4_1_DEPLOYMENT,
            "gpt-4.1-mini": GPT_4_1_MINI_DEPLOYMENT,
            "text-embedding-3-large": TEXT_EMBEDDING_3_LARGE_DEPLOYMENT
        })
        print(f"‚úÖ Default model deployments configured successfully")
    except Exception as e:
        print(f"‚ùå Failed to configure defaults: {e}")
        raise

üìã Configuring default model deployments...
‚úÖ Default model deployments configured successfully


## Analyze Legal Transcript

We'll use a publicly available deposition transcript from the Internet Archive. This is a real legal document with the standard line-numbered format used in depositions.

**Sample Document:** [Farr Deposition Transcript](https://archive.org/details/799436-farr-deposition-transcript) (15 pages, Public Domain)

In [5]:
# Analyze legal transcript from local file
# Using the transcript example from the data/legal_examples folder
document_path = os.path.join(os.path.dirname(os.getcwd()), 'data', 'legal_examples', 'Trascript Example.pdf')
analyzer_id = 'prebuilt-layout'

print(f"üîç Analyzing legal transcript from local file...")
print(f"   Document: {document_path}")
print(f"   Analyzer: {analyzer_id}")

# Verify file exists
if not os.path.exists(document_path):
    raise FileNotFoundError(f"Document not found: {document_path}")

file_size = os.path.getsize(document_path)
print(f"   File size: {file_size:,} bytes")

# Analyze the document using binary file path
response = client.begin_analyze_binary(
    analyzer_id=analyzer_id,
    file_location=document_path
)

print(f"‚è≥ Waiting for analysis to complete...")
result = client.poll_result(response)
print(f"‚úÖ Analysis completed!")

# Get document info
contents = result.get("result", {}).get("contents", [])
if contents:
    content = contents[0]
    if content.get("kind") == "document":
        print(f"\nüìÑ Document Information:")
        print(f"   Pages: {content.get('startPageNumber')} - {content.get('endPageNumber')}")
        print(f"   Total pages: {content.get('endPageNumber') - content.get('startPageNumber') + 1}")

# Save the full result for processing
saved_json_path = save_json_to_file(result, filename_prefix="legal_transcript_analysis")
print(f"\nüíæ Full analysis saved to: {saved_json_path}")

üîç Analyzing legal transcript from local file...
   Document: c:\src\azure-ai-content-understanding-python\data\legal_examples\Trascript Example.pdf
   Analyzer: prebuilt-layout
   File size: 1,666,047 bytes


INFO:content_understanding_client:Analyzing binary file c:\src\azure-ai-content-understanding-python\data\legal_examples\Trascript Example.pdf with analyzer: prebuilt-layout


‚è≥ Waiting for analysis to complete...


INFO:content_understanding_client:Request 9e72763b-ecdf-4b8d-8f9b-a6733d30b6b1 in progress ...
INFO:content_understanding_client:Request 9e72763b-ecdf-4b8d-8f9b-a6733d30b6b1 in progress ...
INFO:content_understanding_client:Request 9e72763b-ecdf-4b8d-8f9b-a6733d30b6b1 in progress ...
INFO:content_understanding_client:Request result is ready after 7.11 seconds.


‚úÖ Analysis completed!

üìÑ Document Information:
   Pages: 1 - 52
   Total pages: 52
üíæ Analysis result saved to: test_output\legal_transcript_analysis_20260122_162606.json

üíæ Full analysis saved to: test_output\legal_transcript_analysis_20260122_162606.json


## View Default Markdown Output

Let's first look at Content Understanding's default markdown output. Notice how the **line numbers are grouped separately** at the bottom of each page's content rather than inline with the text.

In [6]:
# Show the default markdown output (first 2000 characters)
markdown = content.get("markdown", "")

print("üìÑ Default Markdown Output (first 2000 chars):")
print("=" * 60)
print(markdown[:2000])
print("=" * 60)
print(f"\n... (Total length: {len(markdown)} characters)")

üìÑ Default Markdown Output (first 2000 chars):
# (B&W) PROTECTED BY MINNESOTA TOBACCO LITIGATION PROTECTIVE ORDER

SUPERIOR COURT OF NEW JERSEY
MERCER COUNTY-LAW DIVISION,
DOCKET NO. L-90-2940

:

:

IN RE:
IN THE MATTER OF
SUSAN MICHAUD

:

DEPOSITION OF:

:

Susan Michaud

:

:

Transcript of proceedings taken on July 13, 1990,
at 1 pm, at the office of Mason, Griffin & Pierson, 101 Poor
Farm Road, Princeton, NJ 08540.

682499390

<!-- PageFooter: http://legacy.library.ucsf.e6u/tid/fuq07a00/pdfv.industrydocuments.ucsf.edu/docs/khhl0001 -->
<!-- PageBreak -->


# (B&W) PROTECTED BY MINNESOTA TOBACCO LITIGATION PROTECTIVE ORDER

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

2

APPEARANCES

On behalf of
Susan Michaud:

MASON, GRIFFIN & PIERSON
BY: Stephanie J. Briody, Esq.
101 Poor Farm Road
Princton, NJ 08540

On behalf of Dr. Alfred
Cook, Dr. Charles Howard &
Princeton Radiology Assoc.

JACKSON & VAURIO
BY: John Zen Jackson, Esq.
1000 Herrontown Road
Princeton, 

## How Reflow Works

The reflow algorithm uses **bounding box coordinates** from the JSON output to match line numbers with their corresponding text:

### Step 1: Parse Coordinates
Every element in CU's JSON has a `source` field with position data:
```
"source": "D(1,1.0309,1.1277,1.131,1.1277,1.131,1.2711,1.0309,1.2711)"
           D(page, x1,y1, x2,y2, x3,y3, x4,y4)
```

### Step 2: Group by Vertical Position
Elements with similar Y values (within ~0.15 inches) are on the same horizontal line.

### Step 3: Sort Left-to-Right
Within each group, sort by X coordinate. Line numbers (X ‚âà 1.0") come before text content (X ‚âà 1.3"+).

### Step 4: Combine
Pair line numbers with their corresponding text and output as `N | text content`.

## Reflow Functions

Here are the core functions for reflowing Content Understanding output to include inline line numbers:

In [7]:
import re
from dataclasses import dataclass
from typing import Optional


@dataclass
class LineElement:
    """Represents a line or element from the document with its position."""
    content: str
    y_position: float  # Top Y coordinate
    x_position: float  # Left X coordinate
    page_number: int
    is_line_number: bool = False


def parse_source_coordinates(source: str) -> tuple[int, float, float, float, float]:
    """
    Parse the source coordinate string from Content Understanding.
    
    The source format is: D(pageNumber,x1,y1,x2,y2,x3,y3,x4,y4)
    where the points represent a quadrilateral (upper-left, upper-right, lower-right, lower-left)
    
    Returns:
        Tuple of (page_number, left_x, top_y, right_x, bottom_y)
    """
    match = re.match(r'D\((\d+),([^)]+)\)', source)
    if not match:
        raise ValueError(f"Invalid source format: {source}")
    
    page_number = int(match.group(1))
    coords = [float(x) for x in match.group(2).split(',')]
    
    if len(coords) == 8:
        # Bounding polygon: x1,y1,x2,y2,x3,y3,x4,y4
        x1, y1, x2, y2, x3, y3, x4, y4 = coords
        left_x = min(x1, x4)
        top_y = min(y1, y2)
    elif len(coords) == 4:
        # Axis-aligned bounding box: left, top, width, height
        left_x, top_y, width, height = coords
    else:
        raise ValueError(f"Unexpected coordinate count: {source}")
    
    return page_number, left_x, top_y, 0, 0  # We only need left_x and top_y


def is_line_number(content: str) -> bool:
    """Check if content is a line number (1-99)."""
    return content.strip().isdigit() and 1 <= int(content.strip()) <= 99


def is_noise_element(content: str) -> bool:
    """Check if content is noise (bullets, single dots) that should be filtered."""
    content = content.strip()
    return content in ['¬∑', '‚Ä¢', '‚àô'] or (len(content) == 1 and not content.isalnum())


def extract_lines_from_page(page_data: dict) -> list[LineElement]:
    """Extract all lines from a page with position information."""
    elements = []
    page_number = page_data.get('pageNumber', 1)
    
    for line in page_data.get('lines', []):
        content = line.get('content', '').strip()
        source = line.get('source', '')
        
        if not source or not content or is_noise_element(content):
            continue
            
        try:
            parsed_page, left_x, top_y, _, _ = parse_source_coordinates(source)
            element = LineElement(
                content=content,
                y_position=top_y,
                x_position=left_x,
                page_number=parsed_page,
                is_line_number=is_line_number(content)
            )
            elements.append(element)
        except ValueError:
            continue
    
    return elements


def group_lines_by_vertical_position(elements: list[LineElement], 
                                      y_tolerance: float = 0.15) -> list[list[LineElement]]:
    """Group elements that appear on the same horizontal line."""
    if not elements:
        return []
    
    sorted_elements = sorted(elements, key=lambda e: e.y_position)
    groups = []
    current_group = [sorted_elements[0]]
    current_y = sorted_elements[0].y_position
    
    for element in sorted_elements[1:]:
        if abs(element.y_position - current_y) <= y_tolerance:
            current_group.append(element)
        else:
            groups.append(current_group)
            current_group = [element]
            current_y = element.y_position
    
    if current_group:
        groups.append(current_group)
    
    return groups


def reflow_page_with_line_numbers(page_data: dict, separator: str = " | ") -> str:
    """Reflow a single page's content to include line numbers inline."""
    elements = extract_lines_from_page(page_data)
    if not elements:
        return ""
    
    line_groups = group_lines_by_vertical_position(elements)
    output_lines = []
    
    for group in line_groups:
        # Sort by X position (left to right)
        group.sort(key=lambda e: e.x_position)
        
        line_numbers = [e for e in group if e.is_line_number]
        content_elements = [e for e in group if not e.is_line_number]
        
        if not content_elements:
            continue
        
        combined_content = ' '.join(e.content for e in content_elements)
        
        if line_numbers:
            line_num = line_numbers[0].content
            output_lines.append(f"{line_num}{separator}{combined_content}")
        else:
            output_lines.append(combined_content)
    
    return '\n'.join(output_lines)


def reflow_document(json_data: dict, target_page: Optional[int] = None, 
                    separator: str = " | ") -> str:
    """Reflow an entire document or specific page with line numbers inline."""
    contents = json_data.get('result', {}).get('contents', [])
    if not contents:
        raise ValueError("No contents found in JSON data")
    
    content = contents[0]
    pages = content.get('pages', [])
    if not pages:
        raise ValueError("No pages found in document content")
    
    output_parts = []
    
    for page in pages:
        page_number = page.get('pageNumber', 0)
        if target_page is not None and page_number != target_page:
            continue
        
        page_output = reflow_page_with_line_numbers(page, separator)
        if page_output:
            if target_page is None:
                output_parts.append(f"\n<!-- Page {page_number} -->\n")
            output_parts.append(page_output)
    
    return '\n'.join(output_parts)

print("‚úÖ Reflow functions loaded successfully!")

‚úÖ Reflow functions loaded successfully!


## Reflow a Single Page

Let's reflow page 3 of the transcript to see the line numbers inline with the text:

In [8]:
# Reflow a single page (page 3)
page_to_reflow = 3

print(f"üìÑ Reflowed Output for Page {page_to_reflow}:")
print("=" * 60)
reflowed_page = reflow_document(result, target_page=page_to_reflow)
print(reflowed_page)
print("=" * 60)

üìÑ Reflowed Output for Page 3:
(B&W) PROTECTED BY MINNESOTA TOBACCO LITIGATION PROTECTIVE ORDER
1 | INDEX
2 | WITNESS: DIRECT CROSS REDIRECT RECROSS
3 | Susan Michaud
6 | EXHIBITS: EVIDENCE IDENTIFICATION
7 | Diagram (P-1)
682499392
http://legacy.library.ucsf.e6u/tid/fuq07a00/pdf.industrydocuments.ucsf.edu/docs/khhl0001


## Reflow Entire Document

Now let's reflow the entire document and save it to a file:

In [9]:
# Reflow the entire document
print("üìÑ Reflowing entire document...")
reflowed_document = reflow_document(result)

# Save to file
output_path = os.path.join(os.getcwd(), 'test_output', 'legal_transcript_reflowed.md')
os.makedirs(os.path.dirname(output_path), exist_ok=True)

with open(output_path, 'w', encoding='utf-8') as f:
    f.write(reflowed_document)

print(f"‚úÖ Reflowed document saved to: {output_path}")
print(f"   Total characters: {len(reflowed_document)}")

# Show first 3000 characters
print("\nüìÑ Preview (first 3000 characters):")
print("=" * 60)
print(reflowed_document[:3000])
print("=" * 60)

üìÑ Reflowing entire document...
‚úÖ Reflowed document saved to: c:\src\azure-ai-content-understanding-python\notebooks\test_output\legal_transcript_reflowed.md
   Total characters: 65678

üìÑ Preview (first 3000 characters):

<!-- Page 1 -->

(B&W) PROTECTED BY MINNESOTA TOBACCO LITIGATION PROTECTIVE ORDER
SUPERIOR COURT OF NEW JERSEY
MERCER COUNTY-LAW DIVISION,
DOCKET NO. L-90-2940
IN RE: IN THE MATTER OF
SUSAN MICHAUD
DEPOSITION OF:
Susan Michaud
Transcript of proceedings taken on July 13, 1990,
at 1 pm, at the office of Mason, Griffin & Pierson, 101 Poor
Farm Road, Princeton, NJ 08540.
682499390
http://legacy.library.ucsf.e6u/tid/fuq07a00/pdfv.industrydocuments.ucsf.edu/docs/khhl0001

<!-- Page 2 -->

(B&W) PROTECTED BY MINNESOTA TOBACCO LITIGATION PROTECTIVE ORDER
2 | APPEARANCES
3 | On behalf of
Susan Michaud: MASON, GRIFFIN & PIERSON
4 | BY: Stephanie J. Briody, Esq.
101 Poor Farm Road
5 | Princton, NJ 08540
6 | On behalf of Dr. Alfred
Cook, Dr. Charles Howard &
7 | Princeton 

## Compare: Before vs After

Let's compare the default output with the reflowed output for a specific page:

In [None]:
# Compare default vs reflowed for page 3
page_number = 3

# Get the original markdown (extract just page 3 content - approximation)
original_lines = markdown.split('\n')

print("üìä COMPARISON: Default vs Reflowed Output")
print("\n" + "=" * 30 + " DEFAULT OUTPUT " + "=" * 30)
print("(Line numbers grouped separately at page bottom)")
print("-" * 76)

# Show a sample of the default output
sample_start = 200
sample_end = 800
print(markdown[sample_start:sample_end])
print("...")

print("\n" + "=" * 30 + " REFLOWED OUTPUT " + "=" * 29)
print("(Line numbers inline with text)")
print("-" * 76)

# Show the reflowed output for page 3
reflowed_page = reflow_document(result, target_page=page_number)
print(reflowed_page[:800])
print("...")

## Using the Standalone Script

For batch processing or command-line usage, you can use the standalone script located at `python/reflow_markdown_with_line_numbers.py`:

```bash
# Process a specific page
python python/reflow_markdown_with_line_numbers.py analysis.json --page 3

# Process all pages and save to file
python python/reflow_markdown_with_line_numbers.py analysis.json --output reflowed.md

# Custom separator
python python/reflow_markdown_with_line_numbers.py analysis.json --separator " | " --output reflowed.md
```

In [None]:
# Example: Run the standalone script on our saved JSON
import subprocess

script_path = os.path.join(os.path.dirname(os.getcwd()), 'python', 'reflow_markdown_with_line_numbers.py')
output_file = os.path.join(os.getcwd(), 'test_output', 'legal_transcript_reflowed_script.md')

print(f"üîß Running standalone reflow script...")
print(f"   Input: {saved_json_path}")
print(f"   Output: {output_file}")

result_code = subprocess.run(
    ['python', script_path, saved_json_path, '--output', output_file],
    capture_output=True,
    text=True
)

if result_code.returncode == 0:
    print(f"‚úÖ Script completed successfully!")
    print(result_code.stdout)
else:
    print(f"‚ùå Script failed:")
    print(result_code.stderr)

## Summary

This notebook demonstrated how to:

1. **Extract content** from legal transcripts using Azure Content Understanding's `prebuilt-layout` analyzer
2. **Understand the JSON structure** including the `source` field with bounding box coordinates
3. **Reflow the output** to include line numbers inline with text by:
   - Parsing bounding box coordinates to determine element positions
   - Grouping elements by vertical position (Y coordinate)
   - Matching line numbers with their corresponding text content
4. **Use the standalone script** for batch processing

### Use Cases

This technique is valuable for:
- **Legal document processing** - Depositions, trial transcripts, court records
- **Academic citations** - Line-numbered source materials
- **Content indexing** - Building searchable databases with line-level citations
- **AI-powered legal research** - RAG applications that need line-accurate references

### Learn More

- [Content Understanding Document Elements](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/document/elements)
- [Document Markdown Representation](https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/document/markdown)