# Convert Unstructured Data to Structured Data using Amazon Bedrock Data Automation 

#### The purpose of this demo is to transform a PDF document into a CSV file, using Amazon Bedrock Data Automation (BDA). Amazon BDA is an end-to-end document processing service, powered by GenAI. Given a document image and a defined blueprint schema, BDA will return a structured output. In this notebook, we will explore how to:
1. Create and register a blueprint schema
2. Invoke a Bedrock Data Automation job
3. Evaluate the job results and iterate
#### At the end, you will have time to update the blueprint schema with additional fields and instructions for extracting those fields. Your goal is to continue iterating on the blueprint schema until you have achieved 100% accuracy.

#### Directory structure:
```
📁 mayo-clinic-ai-summit-idp-demo/
│
├── 📁 input_files/
│   ├── 📄 pathology_report.pdf
│   └── 📊 ground_truth.csv
│
├── 📁 output/
│   └── 📊 processed_pathology_report.csv
│
├── 📁 src/
│   ├── 📄 bda_processor.py
│   ├── 📄 evaluator.py
│   └── 📄 requirements.txt
│
└── 📓 bda-notebook.ipynb
```

## Initial Setup

In [None]:
!pip install -qq -r src/requirements.txt

In [None]:
from src.bda_processor import BDAProcessor
from src.evaluator import Evaluator

In [None]:
bda_processor = BDAProcessor()
evaluator = Evaluator()

## Create Bedrock Data Automation Blueprint 

#### Define blueprint schema

In [None]:
blueprint_schema = {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "description": "This is a blueprint for a pathology report",
    "class": "Pathology Report",
    "type": "object",
    "definitions": {},
    "properties": {
        "hospital_name": {
            "type": "string",
            "instruction": "Name of hospital"
        },
        "lab_name": {
            "type": "string",
            "instruction": "Name of lab"
        },
        "physician_name": {
            "type": "string",
            "instruction": "Name of physician. Return first name and last name as a single string value"
        },
        "has_serum_specimen": {
            "type": "string",
            "instruction": "Whether a serum specimen was collected. Return Yes or No"
        }
    }
}


#### Create blueprint

In [None]:
blueprint_arn = bda_processor.create_blueprint(
    blueprint_name="bda-blueprint-demo", 
    blueprint_schema=blueprint_schema)

In [None]:
blueprint_arn

## Invoke Bedrock Data Automation Job

#### The BDA automation job is asynchronous. A job ID is returned, which will be used later to get the job results.

In [None]:
job_id = bda_processor.start_data_automation(
    file_path="input_files/pathology_report.pdf", 
    blueprint_arn=blueprint_arn)

In [None]:
job_id

#### Get BDA job results (this may take a few minutes)

In [None]:
bda_processor.get_data_automation_results(job_id=job_id)

## Evaluate Results

#### Compare ground truth and BDA results

In [None]:
comparison_df = evaluator.create_comparison_df(
    ground_truth_path="input_files/ground_truth.csv",
    results_path="output/processed_pathology_report.csv"
)

In [None]:
comparison_df

#### There can be minor differences between the ground truth and BDA results, which require a multi-tiered evaluation approach. The following values can be specified for the "match_type" when calculating the accuracy of the extraction results. 
* EXACT
* FUZZY
* LLM
* FUZZY_AND_LLM

In [None]:
exact_match_df = evaluator.calculate_accuracy(comparison_df, match_type="EXACT")
fuzzy_match_df = evaluator.calculate_accuracy(comparison_df, match_type="FUZZY")
llm_match_df = evaluator.calculate_accuracy(comparison_df, match_type="LLM")
llm_and_fuzzy_df = evaluator.calculate_accuracy(comparison_df, match_type="FUZZY_AND_LLM")

## Your turn!

#### Your task is to update the blueprint schema with additional fields from the ground truth file. You can invoke a new BDA job as many times as you want. The goal is to try and get a 100% accuracy!

In [None]:
blueprint_schema = {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "description": "This is a blueprint for a pathology report",
    "class": "Pathology Report",
    "type": "object",
    "definitions": {},
    "properties": {
        "hospital_name": {
            "type": "string",
            "instruction": "Name of hospital"
        },
        "has_serum_specimen": {
            "type": "string",
            "instruction": "Whether a serum specimen was collected. Return Yes or No"
        },
        "serum_receiving_date": {
            "type": "string",
            "instruction": "Date in which serum specimen was received. Return the date in this format: MM/DD/YYY. If the date is not explicitly labeled 'Receiving Date', return 'Unknown'."
        },
        "bilirubin_total": {
            "type": "string",
            "instruction": "Total bilirubin level"
        },
        "enter-field-name": {
            "type": "enter-output-data-type",
            "instruction": "enter-clear-instructions-or-definitions-for-field"
        }
    }
}

#### Update blueprint

In [None]:
bda_processor.update_blueprint(
    blueprint_arn=blueprint_arn, 
    blueprint_schema=blueprint_schema)

#### Start job

In [None]:
job_id = bda_processor.start_data_automation(
    file_path="input_files/pathology_report.pdf", 
    blueprint_arn=blueprint_arn)

#### Get job results

In [None]:
bda_processor.get_data_automation_results(job_id=job_id)

#### Compare results to the ground truth

In [None]:
comparison_df = evaluator.create_comparison_df(
    ground_truth_path="input_files/ground_truth.csv",
    results_path="output/processed_pathology_report.csv"
)

In [None]:
comparison_df

#### Calculate accuracies

In [None]:
exact_match_df = evaluator.calculate_accuracy(comparison_df, match_type="EXACT")
fuzzy_match_df = evaluator.calculate_accuracy(comparison_df, match_type="FUZZY")
llm_match_df = evaluator.calculate_accuracy(comparison_df, match_type="LLM")
llm_and_fuzzy_df = evaluator.calculate_accuracy(comparison_df, match_type="FUZZY_AND_LLM")