# W2 Form OCR with Amazon Nova Lite - Model Evaluation

## Introduction

In this final notebook, we'll evaluate the performance of our fine-tuned Amazon Nova Lite model on W2 tax form OCR tasks. We'll compare the base model against our custom fine-tuned model to measure the improvements achieved through the fine-tuning process.

The evaluation will focus on:
- Overall field extraction accuracy
- Accuracy by field category (employee information, employer information, earnings, etc.)
- Analysis of numerical field errors and their distribution
- Comparative performance metrics between base and fine-tuned models

This analysis will help us understand how effectively our model has learned to extract structured data from tax documents and where further improvements might be needed.

## Environment Setup

First, let's set up our environment by importing necessary libraries and initializing the AWS Bedrock client:

In [None]:
import boto3
import os
import json
import time
import shutil
from tqdm import tqdm
from datasets import load_dataset
from PIL import Image
import io
import uuid
import warnings
import numpy as np
import re
warnings.filterwarnings('ignore')

# Set AWS region
region = "us-east-1"

# Create AWS clients
session = boto3.session.Session(region_name=region)
s3_client = session.client('s3')
sts_client = session.client('sts')
bedrock = session.client(service_name="bedrock")
bedrock_runtime = session.client(service_name="bedrock-runtime", region_name=region)

In [None]:
# Retrieve stored variables from previous notebook
%store -r bucket_name
%store -r test_data_uri
%store -r role_arn
%store -r role_name
%store -r policy_arn
%store -r text_prompt
%store -r test_s3_paths
%store -r account_id
%store -r deployment_arn

print(f"Test data URI: {test_data_uri}")
print(f"Role ARN: {role_arn}")
print(f"Custom model deployment arn: {deployment_arn}")

## Load Test Dataset

We'll load our test dataset from the JSONL file we created in the data preparation notebook. This dataset contains W2 form images and their corresponding ground truth structured data:

In [None]:
local_test_data_path = "./test.jsonl"
test_data=[]
with open(local_test_data_path, 'r') as f:
    for line in f:
          test_data.append(json.loads(line.strip()))

### Helper function to help evaluation and format the test dataset

In [None]:
# Function to flatten nested dictionaries
def flatten_dict(d, parent_key='', sep='_'):
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten_dict(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

In [None]:
# Function to evaluate a single prediction against ground truth
def evaluate_prediction(gt, pred):
    # Flatten both dictionaries to make comparison easier
    flat_gt = flatten_dict(gt)
    flat_pred = flatten_dict(pred)
    
    # Get all unique keys
    all_keys = set(flat_gt.keys()).union(set(flat_pred.keys()))
    
    # Track correct predictions
    correct = 0
    total = 0
    errors = {}
    
    # Compare each field
    for key in all_keys:
        total += 1
        
        # If key exists in both dictionaries
        if key in flat_gt and key in flat_pred:
            # For numerical values, allow small tolerance
            if isinstance(flat_gt[key], (int, float)) and isinstance(flat_pred[key], (int, float)):
                # Calculate percentage error for numerical values
                if abs(flat_gt[key]) > 0:  # Avoid division by zero
                    pct_error = abs(flat_gt[key] - flat_pred[key]) / abs(flat_gt[key]) * 100
                    if pct_error < 0.1:  # 0.1% tolerance for numerical values
                        correct += 1
                    else:
                        errors[key] = (flat_gt[key], flat_pred[key], pct_error)
                else:
                    # If ground truth is 0, check if prediction is very close to 0
                    if abs(flat_pred[key]) < 0.1:
                        correct += 1
                    else:
                        errors[key] = (flat_gt[key], flat_pred[key], float('inf'))
            # For strings, exact match required
            elif flat_gt[key] == flat_pred[key]:
                correct += 1
            else:
                errors[key] = (flat_gt[key], flat_pred[key], "mismatch")
        else:
            # Missing or extra field
            gt_val = flat_gt.get(key, None)
            pred_val = flat_pred.get(key, None)
            errors[key] = (gt_val, pred_val, "missing" if key in flat_gt else "extra")
    
    accuracy = correct / total if total > 0 else 0
    return {
        'accuracy': accuracy,
        'correct': correct,
        'total': total,
        'errors': errors
    }

In [None]:
# Function to categorize fields
def get_field_category(field):
    if field.startswith('employee'):
        return 'Employee Information'
    elif field.startswith('employer'):
        return 'Employer Information'
    elif field.startswith('earnings'):
        return 'Earnings'
    elif field.startswith('benefits'):
        return 'Benefits'
    elif field.startswith('multiStateEmployment'):
        return 'Multi-State Employment'
    else:
        return 'Other'

In [None]:
def reorder_content(entry):
    # Make a deep copy to avoid modifying the original
    import copy
    new_entry = copy.deepcopy(entry)

    # Find text and image items in content
    if 'content' in new_entry and isinstance(new_entry['content'], list):
        text_items = [i for i in new_entry['content'] if 'text' in i]
        image_items = [i for i in new_entry['content'] if 'image' in i]

        # Create new content list with image first, then text
        new_entry['content'] = image_items + text_items

    return new_entry

### Comprehensive Model Evaluation Function

This is our main evaluation function that:
1. Takes a sample of test data examples
2. Passes them to the model for inference
3. Compares the predictions against ground truth
4. Computes overall and categorical accuracy metrics
5. Analyzes numerical field errors

In [None]:
# Function to run evaluation on a sample of the test dataset
def evaluate_model_on_test_data(num_samples, model_id):
    np.random.seed(42)  # For reproducibility
    
    # Select random samples from test data
    if num_samples > len(test_data):
        num_samples = len(test_data)
    
    sample_indices = np.random.choice(len(test_data), num_samples, replace=False)
    
    results = []
    field_categories = {
        'Employee Information': {'correct': 0, 'total': 0},
        'Employer Information': {'correct': 0, 'total': 0},
        'Earnings': {'correct': 0, 'total': 0},
        'Benefits': {'correct': 0, 'total': 0},
        'Multi-State Employment': {'correct': 0, 'total': 0},
        'Other': {'correct': 0, 'total': 0}
    }
    
    numerical_errors = []
    
    for i in tqdm(sample_indices, desc="Evaluating model"):
        # Get the test sample
        test_sample = test_data[i]
        messages = [reorder_content(entry) for entry in test_sample["messages"][:1]]
        messages.append(
            {
                "role": "assistant",
                "content": [
                    {"text": "```json"}
                ]
            }
        )

        pattern = r"```json\n(.*?)\n```"
        match = re.search(pattern, test_sample["messages"][1]["content"][0]["text"], re.DOTALL)
        if match:
          # Get the JSON string
          json_string = match.group(1)
          gt = json.loads(json_string)
        
        # Get model prediction
        response = bedrock_runtime.converse(
            modelId=model_id,
            messages=messages,
            inferenceConfig={"maxTokens": 5000, "temperature": 0.0, "topP": 0.1, "stopSequences": ["```"]},
        )
        # Extract the prediction from the response
        prediction_text = response["output"]["message"]["content"][0]["text"]
        prediction = json.loads(prediction_text.replace("```", "").strip())
        
        # Evaluate the prediction
        eval_result = evaluate_prediction(gt, prediction)
        results.append({
            'index': i,
            'accuracy': eval_result['accuracy'],
            'correct': eval_result['correct'],
            'total': eval_result['total'],
            'errors': eval_result['errors']
        })
        
        # Update field category stats
        flat_gt = flatten_dict(gt)
        for key in flat_gt:
            category = get_field_category(key)
            field_categories[category]['total'] += 1
            if key not in eval_result['errors']:
                field_categories[category]['correct'] += 1
        
        # Collect numerical errors
        for key, (gt_val, pred_val, error) in eval_result['errors'].items():
            if isinstance(error, (int, float)) and error != float('inf'):
                numerical_errors.append({
                    'field': key,
                    'gt': gt_val,
                    'pred': pred_val,
                    'error_pct': error,
                    'category': get_field_category(key)
                })
    
    # Calculate overall accuracy
    overall_accuracy = sum(r['correct'] for r in results) / sum(r['total'] for r in results)
    
    # Calculate category accuracies
    category_accuracies = {}
    for category, stats in field_categories.items():
        accuracy = stats['correct'] / stats['total'] if stats['total'] > 0 else 0
        category_accuracies[category] = accuracy
    
    return {
        'results': results,
        'overall_accuracy': overall_accuracy,
        'category_accuracies': category_accuracies,
        'numerical_errors': numerical_errors
    }

## Base Model Evaluation

First, let's evaluate the performance of Amazon Nova Lite's base model without fine-tuning. This will serve as our baseline for comparison to measure the improvements achieved through fine-tuning.

The base model evaluation will:
1. Process 100 randomly selected test samples
2. Record accuracy metrics for each field type
3. Calculate overall extraction accuracy
4. Analyze error patterns and trends

This baseline measurement is critical for understanding the impact of our fine-tuning efforts.

In [None]:
nova_lite_id = "us.amazon.nova-lite-v1:0"

In [None]:
# Run evaluation on 100 test samples
print("Running base model evaluation on test dataset...")
eval_results = evaluate_model_on_test_data(100, nova_lite_id)

# Print overall accuracy
print(f"\nOverall Field Extraction Accuracy: {eval_results['overall_accuracy']:.2%}")

# Print category accuracies
print("\nAccuracy by Field Category:")
for category, accuracy in eval_results['category_accuracies'].items():
    print(f"  - {category}: {accuracy:.2%}")

## Custom Fine-tuned Model Evaluation

Now, let's evaluate our fine-tuned model using the same methodology and test dataset. The custom model has been specifically trained on W2 tax form data to improve its extraction accuracy and field recognition.

By using identical evaluation parameters and test samples, we can directly compare:
1. Overall accuracy improvements
2. Field-specific accuracy gains
3. Reduction in error rates across different data types
4. Improvements in handling complex structures like multi-state employment data

The results will demonstrate the effectiveness of our fine-tuning approach and identify any remaining areas for improvement.

In [None]:
# Run evaluation on 100 test samples
print("Running custom model evaluation on test dataset...")
eval_results = evaluate_model_on_test_data(100, deployment_arn)

# Print overall accuracy
print(f"\nOverall Field Extraction Accuracy: {eval_results['overall_accuracy']:.2%}")

# Print category accuracies
print("\nAccuracy by Field Category:")
for category, accuracy in eval_results['category_accuracies'].items():
    print(f"  - {category}: {accuracy:.2%}")

## Resource Cleanup

After completing our evaluation, it's important to clean up the AWS resources we created to avoid unnecessary costs. This includes:

1. Deleting the custom model deployment
2. Cleaning up IAM resources (roles and policies)
3. Properly terminating any active services

This helps ensure we don't incur unexpected charges for resources we no longer need.

In [None]:
def clean_up():
    """
    Clean up all AWS resources created during the fine-tuning process to avoid unnecessary costs
    """
    print("Cleaning up resources...")
    
    # Delete on-demand deployment if it exists
    if 'deployment_arn' in locals() or 'deployment_arn' in globals():
        try:
            print(f"Deleting on-demand deployment: {deployment_arn}...")
            bedrock.delete_custom_model_deployment(
                customModelDeploymentIdentifier=deployment_arn
            )
            print("On-demand deployment deletion initiated")
        except Exception as e:
            print(f"Error deleting on-demand deployment: {e}")
    
    # Clean up IAM resources
    iam = session.client('iam')
    try:
        print("Detaching policy from role...")
        iam.detach_role_policy(RoleName=role_name, PolicyArn=policy_arn)
        
        print("Deleting policy...")
        iam.delete_policy(PolicyArn=policy_arn)
        
        print("Deleting role...")
        iam.delete_role(RoleName=role_name)
        
        print("IAM resources cleaned up")
    except Exception as e:
        print(f"Error cleaning up IAM resources: {e}")
    
    # We're not deleting the S3 bucket here as you might want to keep your data and model
    
    print("Cleanup completed")

In [None]:
clean_up()

## Conclusion

In this notebook, we've evaluated and compared the performance of the base Amazon Nova Lite model and our custom fine-tuned version for W2 tax form OCR tasks. 

### Key Findings:

1. **Overall Accuracy Improvement**: The fine-tuned model achieved 85.31% field extraction accuracy compared to 55.87% for the base model, representing a significant 29.44% absolute improvement.

2. **Field Category Improvements**:
   - Employee Information: 81.67% (↑ from 57.00%)
   - Employer Information: 92.00% (↑ from 57.33%)
   - Earnings Information: 85.14% (↑ from 60.57%)
   - Benefits Information: 60.50% (↑ from 45.00%)
   - Multi-State Employment: 94.19% (↑ from 61.88%)

The results demonstrate that fine-tuning significantly improves the model's ability to extract structured information from tax forms, making it suitable for automated document processing workflows.