# PDF Population with JSON Data

This notebook populates a PDF form with data from JSON.

We'll implement multiple approaches:
1. **Form field filling** - For PDFs with fillable form fields
2. **Text overlay** - For PDFs without form fields (coordinate-based)
3. **Hybrid approach** - Combining both methods

Based on the analysis from notebook 01, we'll choose the best approach.

In [None]:
import json
import os
from datetime import datetime
from pathlib import Path

# PDF libraries
import fitz  # PyMuPDF
from pypdf import PdfReader, PdfWriter
import pdfplumber

print("Libraries imported successfully!")

In [None]:
# File paths
PDF_INPUT = "pdf/A0124_pages_1_to_4.pdf"
JSON_INPUT = "inputs/test.json"
PDF_OUTPUT = f"results/populated_output_{datetime.now().strftime('%Y%m%d_%H%M%S')}.pdf"

# Verify input files
print(f"PDF exists: {os.path.exists(PDF_INPUT)}")
print(f"JSON exists: {os.path.exists(JSON_INPUT)}")
print(f"Output will be: {PDF_OUTPUT}")

## Load JSON Data

In [None]:
# Load the JSON data
with open(JSON_INPUT, 'r', encoding='utf-8') as f:
    json_data = json.load(f)

# Extract the parsed data
data_to_fill = json_data.get('parsedJson', {})

print("Data to populate into PDF:")
for key, value in data_to_fill.items():
    print(f"  {key}: {value}")

## Approach 1: Fill Form Fields (PyMuPDF)

Best for PDFs with fillable form fields.

In [None]:
def populate_form_fields_pymupdf(input_pdf, output_pdf, field_mapping):
    """
    Populate PDF form fields using PyMuPDF
    
    Args:
        input_pdf: Path to input PDF
        output_pdf: Path to output PDF
        field_mapping: Dictionary mapping field names to values
    
    Returns:
        Success status and list of populated fields
    """
    doc = fitz.open(input_pdf)
    populated_fields = []
    
    for page_num in range(len(doc)):
        page = doc[page_num]
        widgets = page.widgets()
        
        for widget in widgets:
            field_name = widget.field_name
            
            # Check if we have data for this field
            if field_name in field_mapping:
                value = field_mapping[field_name]
                widget.field_value = str(value)
                widget.update()
                populated_fields.append({
                    'field': field_name,
                    'value': value,
                    'page': page_num + 1
                })
                print(f"  ✓ Populated '{field_name}' with '{value}' on page {page_num + 1}")
    
    # Save the populated PDF
    doc.save(output_pdf)
    doc.close()
    
    return len(populated_fields) > 0, populated_fields

# Define field mapping
# IMPORTANT: Adjust these field names based on the actual PDF field names
# You should have identified these in notebook 01
field_mapping = {
    # Map JSON keys to PDF field names
    # Example mappings (adjust based on your PDF):
    'name': data_to_fill['name'],
    'Name': data_to_fill['name'],
    'Full Name': data_to_fill['name'],
    'fullname': data_to_fill['name'],
    '성명': data_to_fill['name'],  # Korean for "name"
    
    'id': data_to_fill['id_number'],
    'ID': data_to_fill['id_number'],
    'id_number': data_to_fill['id_number'],
    'ID Number': data_to_fill['id_number'],
    'resident_number': data_to_fill['id_number'],
    '주민번호': data_to_fill['id_number'],  # Korean for "resident number"
    
    'address': data_to_fill['address'],
    'Address': data_to_fill['address'],
    '주소': data_to_fill['address'],  # Korean for "address"
    
    'phone': data_to_fill['phone'],
    'Phone': data_to_fill['phone'],
    'telephone': data_to_fill['phone'],
    'Telephone': data_to_fill['phone'],
    '전화': data_to_fill['phone'],  # Korean for "phone"
}

print("\n" + "="*60)
print("Attempting Form Field Population (PyMuPDF)")
print("="*60 + "\n")

output_method1 = PDF_OUTPUT.replace('.pdf', '_method1_form_fields.pdf')

try:
    success, populated = populate_form_fields_pymupdf(PDF_INPUT, output_method1, field_mapping)
    
    if success:
        print(f"\n✓ Successfully populated {len(populated)} fields!")
        print(f"\n✓ Output saved to: {output_method1}")
    else:
        print("\n✗ No matching form fields found.")
        print("The PDF might not have fillable form fields.")
        print("Try Method 2 (text overlay) below.")
except Exception as e:
    print(f"\n✗ Error: {e}")
    print("Try Method 2 (text overlay) below.")

## Approach 2: Text Overlay (PyMuPDF)

For PDFs without fillable form fields, we overlay text at specific coordinates.

In [None]:
def populate_with_text_overlay(input_pdf, output_pdf, text_mappings):
    """
    Add text to PDF at specific coordinates
    
    Args:
        input_pdf: Path to input PDF
        output_pdf: Path to output PDF
        text_mappings: List of dicts with 'page', 'text', 'position', 'fontsize'
    """
    doc = fitz.open(input_pdf)
    
    for mapping in text_mappings:
        page_num = mapping['page'] - 1  # 0-indexed
        page = doc[page_num]
        
        # Insert text at specified position
        point = fitz.Point(mapping['position'][0], mapping['position'][1])
        
        page.insert_text(
            point,
            mapping['text'],
            fontsize=mapping.get('fontsize', 10),
            fontname=mapping.get('fontname', 'helv'),
            color=mapping.get('color', (0, 0, 0)),  # RGB, black by default
        )
        
        print(f"  ✓ Added '{mapping['text'][:30]}...' at position {mapping['position']} on page {mapping['page']}")
    
    doc.save(output_pdf)
    doc.close()

# Define text overlay positions
# IMPORTANT: You need to determine these coordinates from notebook 01
# or by examining the PDF manually
# Format: (x, y) where (0,0) is top-left corner

text_overlays = [
    {
        'page': 1,
        'text': data_to_fill['name'],
        'position': (150, 200),  # Adjust these coordinates!
        'fontsize': 11,
    },
    {
        'page': 1,
        'text': data_to_fill['id_number'],
        'position': (150, 230),  # Adjust these coordinates!
        'fontsize': 11,
    },
    {
        'page': 1,
        'text': data_to_fill['address'],
        'position': (150, 260),  # Adjust these coordinates!
        'fontsize': 10,
    },
    {
        'page': 1,
        'text': data_to_fill['phone'],
        'position': (150, 290),  # Adjust these coordinates!
        'fontsize': 11,
    },
]

print("\n" + "="*60)
print("Text Overlay Population (PyMuPDF)")
print("="*60 + "\n")
print("NOTE: You need to adjust the coordinates based on your PDF!")
print("Use notebook 01 to analyze text positions.\n")

output_method2 = PDF_OUTPUT.replace('.pdf', '_method2_text_overlay.pdf')

try:
    populate_with_text_overlay(PDF_INPUT, output_method2, text_overlays)
    print(f"\n✓ Output saved to: {output_method2}")
    print("\nReview the PDF and adjust coordinates in the text_overlays list above.")
except Exception as e:
    print(f"\n✗ Error: {e}")

## Approach 3: pypdf for Form Filling

Alternative method using pypdf library.

In [None]:
def populate_form_fields_pypdf(input_pdf, output_pdf, field_mapping):
    """
    Populate PDF form fields using pypdf
    """
    reader = PdfReader(input_pdf)
    writer = PdfWriter()
    
    # Copy all pages
    for page in reader.pages:
        writer.add_page(page)
    
    # Update form fields
    populated_count = 0
    if reader.get_fields():
        for field_name in reader.get_fields().keys():
            if field_name in field_mapping:
                writer.update_page_form_field_values(
                    writer.pages[0],
                    {field_name: field_mapping[field_name]}
                )
                print(f"  ✓ Populated '{field_name}' with '{field_mapping[field_name]}'")
                populated_count += 1
    
    # Write to file
    with open(output_pdf, 'wb') as output_file:
        writer.write(output_file)
    
    return populated_count > 0

print("\n" + "="*60)
print("Form Field Population (pypdf)")
print("="*60 + "\n")

output_method3 = PDF_OUTPUT.replace('.pdf', '_method3_pypdf.pdf')

try:
    success = populate_form_fields_pypdf(PDF_INPUT, output_method3, field_mapping)
    
    if success:
        print(f"\n✓ Output saved to: {output_method3}")
    else:
        print("\n✗ No form fields found or populated.")
except Exception as e:
    print(f"\n✗ Error: {e}")

## Approach 4: Smart Auto-Detection and Population

This approach automatically detects form fields and tries to match them with JSON data intelligently.

In [None]:
def smart_populate_pdf(input_pdf, output_pdf, json_data):
    """
    Intelligently populate PDF by matching field names with JSON keys
    """
    doc = fitz.open(input_pdf)
    populated = []
    
    # Keywords for matching
    name_keywords = ['name', 'fullname', 'full_name', '성명', 'nome', 'nombre']
    id_keywords = ['id', 'resident', 'registration', '주민', 'ssn', 'number']
    address_keywords = ['address', '주소', 'addr', 'location']
    phone_keywords = ['phone', 'tel', 'telephone', '전화', 'mobile', 'contact']
    
    for page_num in range(len(doc)):
        page = doc[page_num]
        widgets = page.widgets()
        
        for widget in widgets:
            field_name = widget.field_name.lower() if widget.field_name else ''
            
            # Try to match field with data
            value = None
            
            if any(kw in field_name for kw in name_keywords):
                value = json_data.get('name')
            elif any(kw in field_name for kw in id_keywords):
                value = json_data.get('id_number')
            elif any(kw in field_name for kw in address_keywords):
                value = json_data.get('address')
            elif any(kw in field_name for kw in phone_keywords):
                value = json_data.get('phone')
            
            if value:
                widget.field_value = str(value)
                widget.update()
                populated.append({
                    'field': widget.field_name,
                    'value': value,
                    'page': page_num + 1
                })
                print(f"  ✓ Auto-matched '{widget.field_name}' → '{value[:30]}...'")
    
    doc.save(output_pdf)
    doc.close()
    
    return populated

print("\n" + "="*60)
print("Smart Auto-Detection Population")
print("="*60 + "\n")

output_method4 = PDF_OUTPUT.replace('.pdf', '_method4_smart.pdf')

try:
    populated_fields = smart_populate_pdf(PDF_INPUT, output_method4, data_to_fill)
    
    if populated_fields:
        print(f"\n✓ Successfully auto-populated {len(populated_fields)} fields!")
        print(f"\n✓ Output saved to: {output_method4}")
    else:
        print("\n⚠ No fields were auto-matched.")
        print("The PDF might not have form fields, or field names don't match keywords.")
except Exception as e:
    print(f"\n✗ Error: {e}")

## Verification: Compare Input vs Output

Let's verify the populated PDF.

In [None]:
def verify_populated_pdf(pdf_path):
    """
    Read and display form field values from a PDF
    """
    if not os.path.exists(pdf_path):
        print(f"File not found: {pdf_path}")
        return
    
    doc = fitz.open(pdf_path)
    print(f"\nVerifying: {pdf_path}")
    print("="*60)
    
    found_values = False
    for page_num in range(len(doc)):
        page = doc[page_num]
        widgets = page.widgets()
        
        for widget in widgets:
            if widget.field_value:
                found_values = True
                print(f"Page {page_num + 1} - {widget.field_name}: {widget.field_value}")
    
    if not found_values:
        print("No populated field values found.")
    
    doc.close()

# Check which output files exist and verify them
output_files = [
    output_method1,
    output_method2,
    output_method3,
    output_method4
]

print("\n" + "="*60)
print("Verification of Generated PDFs")
print("="*60)

for output_file in output_files:
    if os.path.exists(output_file):
        verify_populated_pdf(output_file)

## Summary and Recommendations

### Method Comparison:

1. **Method 1 (PyMuPDF Form Fields)**: Best when PDF has fillable form fields with known names
2. **Method 2 (Text Overlay)**: Best when PDF has no form fields - requires coordinate tuning
3. **Method 3 (pypdf)**: Alternative to Method 1, sometimes more compatible
4. **Method 4 (Smart Auto-Detection)**: Best for quick testing, automatically matches fields

### Next Steps:

1. Check the generated PDFs in the `results/` folder
2. If using text overlay (Method 2), adjust coordinates based on your PDF structure
3. For production use, create a mapping file specific to your PDF form
4. Consider using AWS Textract for complex form analysis

### For AWS/AI Enhancement:

To use AWS services:
- **AWS Textract**: Analyze PDF forms and extract field locations automatically
- **AWS Lambda**: Automate the population process
- **AWS S3**: Store input/output PDFs

Would you like a notebook for AWS Textract integration?