# PDF Form Field Extraction

This notebook extracts form field information from a PDF document including:
- Field names
- Field types
- Field positions/coordinates
- Current values (if any)

We'll test multiple Python libraries to find the best approach.

In [3]:
import json
import os
from pathlib import Path

# PDF processing libraries
import fitz  # PyMuPDF
from pypdf import PdfReader
import pdfplumber

print("Libraries imported successfully!")

Libraries imported successfully!


In [4]:
# File paths
PDF_PATH = "pdf/A0124_pages_1_to_4.pdf"
JSON_PATH = "inputs/test.json"

# Verify files exist
print(f"PDF exists: {os.path.exists(PDF_PATH)}")
print(f"JSON exists: {os.path.exists(JSON_PATH)}")

PDF exists: True
JSON exists: True


## Method 1: PyMuPDF (fitz) - Most Comprehensive

PyMuPDF is excellent for extracting detailed form field information.

In [5]:
def extract_fields_pymupdf(pdf_path):
    """
    Extract form fields using PyMuPDF (fitz)
    Returns detailed information about each field
    """
    doc = fitz.open(pdf_path)
    fields = []
    
    for page_num in range(len(doc)):
        page = doc[page_num]
        
        # Get all widgets (form fields) on the page
        widgets = page.widgets()
        
        for widget in widgets:
            field_info = {
                'page': page_num + 1,
                'field_name': widget.field_name,
                'field_type': widget.field_type_string,
                'field_value': widget.field_value,
                'rect': widget.rect,  # Position (x0, y0, x1, y1)
                'field_flags': widget.field_flags,
                'field_label': widget.field_label if hasattr(widget, 'field_label') else None,
            }
            fields.append(field_info)
    
    doc.close()
    return fields

# Extract fields
print("\n" + "="*60)
print("PyMuPDF (fitz) Field Extraction")
print("="*60)

pymupdf_fields = extract_fields_pymupdf(PDF_PATH)

if pymupdf_fields:
    print(f"\nFound {len(pymupdf_fields)} form fields:\n")
    for i, field in enumerate(pymupdf_fields, 1):
        print(f"Field {i}:")
        print(f"  Name: {field['field_name']}")
        print(f"  Type: {field['field_type']}")
        print(f"  Page: {field['page']}")
        print(f"  Value: {field['field_value']}")
        print(f"  Position: {field['rect']}")
        print()
else:
    print("\nNo form fields found with PyMuPDF.")
    print("This PDF might not have fillable form fields.")


PyMuPDF (fitz) Field Extraction

No form fields found with PyMuPDF.
This PDF might not have fillable form fields.


## Method 2: pypdf - Alternative Approach

pypdf (successor to PyPDF2) provides another way to extract form fields.

In [6]:
def extract_fields_pypdf(pdf_path):
    """
    Extract form fields using pypdf
    """
    reader = PdfReader(pdf_path)
    fields = []
    
    # Check if PDF has form fields
    if reader.get_fields():
        form_fields = reader.get_fields()
        
        for field_name, field_data in form_fields.items():
            field_info = {
                'field_name': field_name,
                'field_type': field_data.get('/FT', 'Unknown'),
                'field_value': field_data.get('/V', ''),
                'default_value': field_data.get('/DV', ''),
                'field_flags': field_data.get('/Ff', 0),
            }
            fields.append(field_info)
    
    return fields

print("\n" + "="*60)
print("pypdf Field Extraction")
print("="*60)

pypdf_fields = extract_fields_pypdf(PDF_PATH)

if pypdf_fields:
    print(f"\nFound {len(pypdf_fields)} form fields:\n")
    for i, field in enumerate(pypdf_fields, 1):
        print(f"Field {i}:")
        print(f"  Name: {field['field_name']}")
        print(f"  Type: {field['field_type']}")
        print(f"  Value: {field['field_value']}")
        print()
else:
    print("\nNo form fields found with pypdf.")


pypdf Field Extraction

No form fields found with pypdf.


## Method 3: Text and Layout Analysis with pdfplumber

If the PDF doesn't have fillable form fields, we can analyze text positions to understand where data should go.

In [7]:
def analyze_pdf_text(pdf_path):
    """
    Extract text with positions using pdfplumber
    Useful for understanding PDF structure even without form fields
    """
    with pdfplumber.open(pdf_path) as pdf:
        print(f"Total pages: {len(pdf.pages)}\n")
        
        for page_num, page in enumerate(pdf.pages, 1):
            print(f"\n--- Page {page_num} ---")
            print(f"Size: {page.width} x {page.height}")
            
            # Extract text
            text = page.extract_text()
            if text:
                print(f"\nText content:\n{text[:500]}...")  # First 500 chars
            
            # Extract words with positions (useful for mapping)
            words = page.extract_words()
            if words:
                print(f"\nFirst 10 words with positions:")
                for word in words[:10]:
                    print(f"  '{word['text']}' at ({word['x0']:.2f}, {word['top']:.2f})")

print("\n" + "="*60)
print("Text and Layout Analysis (pdfplumber)")
print("="*60)

analyze_pdf_text(PDF_PATH)


Text and Layout Analysis (pdfplumber)
Total pages: 4


--- Page 1 ---
Size: 595 x 842

Text content:
■ ■
www.lina.co.kr
보험금 청구서
※ 기재하시는 정보를 바르게 정자체로 작성해주시면 빠른 보험금 심사를 받으실 수 있습니다.
아래의 항목을 모두 작성하시고 보험금 청구서와 개인(신용)정보처리동의서를 모두 접수해 주셔야 정상적인 보험금심사 및 지급이 가능합니다.
※ 우편(등기) 보내실 곳 : (우) 03156 서울특별시 종로구 삼봉로 48 라이나타워 18층 ㈜라이나생명 보험금심사담당자 앞 (문의전화 : 고객센터 1588-0058)
■ 피보험자 (보험대상자) ※ 보험금 수령을 위임하시는 경우에는 「보험금 수령 위임장」 및 피위임자의 「개인(신용)정보처리 동의서」를 추가로 제출하셔야 합니다.
성명 주민등록번호 연락처
※ 특정금융정보법 제5조의2 및 동법 시행령 제10조의4에 따른 고객확인의무 이행을 위하여 수익자의 정보를 확인합니다. 기존에 당사에 통지하신 정
보로부터 변경사항이 있으신 경우 반드시 최신의 정보를 기준으로 작성하여 주셔야 하며, 미기재된 항목에 대하여서는 변...

First 10 words with positions:
  '■' at (11.64, 13.26)
  '■' at (571.65, 13.26)
  'www.lina.co.kr' at (435.69, 58.87)
  '보험금' at (41.47, 83.64)
  '청구서' at (114.22, 83.64)
  '※' at (189.81, 94.75)
  '기재하시는' at (196.85, 94.75)
  '정보를' at (233.18, 94.75)
  '바르게' at (255.67, 94.75)
  '정자체로' at (279.23, 93.95)

--- Page 2 ---
Size: 595 x 842

Text content:
www.lina.co.kr
보험금 청구를 위한 개인(신용)정보 처리 동

## Load JSON Data to Map

Let's load the JSON data we need to populate into the PDF.

In [8]:
# Load JSON data
with open(JSON_PATH, 'r', encoding='utf-8') as f:
    json_data = json.load(f)

print("JSON Data to populate:")
print(json.dumps(json_data, indent=2, ensure_ascii=False))

parsed_data = json_data.get('parsedJson', {})
print("\nParsed fields:")
for key, value in parsed_data.items():
    print(f"  {key}: {value}")

JSON Data to populate:
{
  "fileName": "36d3bc15-37ad-40e5-ae3e-94ea4f9a1264.png",
  "extractedText": "1. 성명 Jeon Chulmin 주민 번호 940101-1111111 성별 2. 주소 48, Sambong-ro, Jongno-gu, Seoul, 03156, Rep. of KOREA 전화: 010-1234-1234",
  "confidence": 0.8400000000000001,
  "language": "kor+eng",
  "parsedJson": {
    "name": "Jeon Chulmin",
    "id_number": "940101-1111111",
    "address": "48, Sambong-ro, Jongno-gu, Seoul, 03156, Rep. of KOREA",
    "phone": "010-1234-1234"
  }
}

Parsed fields:
  name: Jeon Chulmin
  id_number: 940101-1111111
  address: 48, Sambong-ro, Jongno-gu, Seoul, 03156, Rep. of KOREA
  phone: 010-1234-1234


## Field Mapping Analysis

Now let's create a mapping between JSON fields and PDF fields.

In [10]:
# JSON fields we need to map
json_fields = {
    'name': 'Jeon Chulmin',
    'id_number': '940101-1111111',
    'address': '48, Sambong-ro, Jongno-gu, Seoul, 03156, Rep. of KOREA',
    'phone': '010-1234-1234'
}

print("Fields to map from JSON:")
for field, value in json_fields.items():
    print(f"  {field}: {value}")

print("\n" + "="*60)
print("Field Mapping Strategy")
print("="*60)

if pymupdf_fields:
    print("\nPDF has fillable form fields. We can use direct field population.")
    print("\nSuggested mapping (you may need to adjust based on actual field names):")
    
    # Try to suggest mappings
    for json_key in json_fields.keys():
        print(f"\n  JSON '{json_key}' could map to:")
        for pdf_field in pymupdf_fields:
            field_name = pdf_field['field_name'].lower()
            if any(keyword in field_name for keyword in [json_key, json_key.split('_')[0]]):
                print(f"    - PDF field: '{pdf_field['field_name']}'")
else:
    print("\nPDF does NOT have fillable form fields.")
    print("We'll need to use coordinate-based text insertion or PDF creation.")
    print("\nCheck the text analysis above to understand the PDF structure.")

Fields to map from JSON:
  name: Jeon Chulmin
  id_number: 940101-1111111
  address: 48, Sambong-ro, Jongno-gu, Seoul, 03156, Rep. of KOREA
  phone: 010-1234-1234

Field Mapping Strategy

PDF does NOT have fillable form fields.
We'll need to use coordinate-based text insertion or PDF creation.

Check the text analysis above to understand the PDF structure.


## Save Field Information

Save the extracted field information for reference.

In [11]:
# Save field information to JSON for reference
output_data = {
    'pdf_file': PDF_PATH,
    'has_form_fields': len(pymupdf_fields) > 0,
    'pymupdf_fields': [
        {
            'name': f['field_name'],
            'type': f['field_type'],
            'page': f['page'],
            'value': f['field_value'],
            'position': str(f['rect'])
        } for f in pymupdf_fields
    ],
    'pypdf_fields': [
        {
            'name': f['field_name'],
            'type': str(f['field_type']),
            'value': f['field_value']
        } for f in pypdf_fields
    ]
}

output_path = 'results/pdf_field_analysis.json'
with open(output_path, 'w', encoding='utf-8') as f:
    json.dump(output_data, f, indent=2, ensure_ascii=False)

print(f"Field analysis saved to: {output_path}")

Field analysis saved to: results/pdf_field_analysis.json


## Summary

Based on the analysis above:

1. **If form fields were found**: Proceed to notebook 02 to populate them directly
2. **If no form fields found**: The PDF is likely a scanned document or non-fillable form. We'll need to:
   - Option A: Use coordinate-based text overlay
   - Option B: Create a new PDF from scratch with the data
   - Option C: Use AWS Textract or similar services to understand the form structure better

Run this notebook first to understand the PDF structure, then proceed to the population notebook.