In [1]:
!pip install -qq -r requirements.txt

In [2]:
from src.textract_processor import TextractProcessor
from src.extractor import Extractor
from src.evaluator import Evaluator

  from pandas.core.computation.check import NUMEXPR_INSTALLED


# Intelligent Document Processing using Amazon Textract and Bedrock

#### The purpose of this notebook is to demonstrate how to transform unstructured medical data into structured data. The following classes have already been implemented to be used in this notebook:
* TextractProcessor: Text extraction using Amazon Textract
* Extractor: Metadata extraction using Amazon Bedrock and Anthropic Claude
* Evaluator: LLM evaluation using exact match, fuzzy match, and LLM match

## Convert PDF document to markdown file using Amazon Textract

#### Initialize TextractProcessor class

In [3]:
textractor = TextractProcessor()

#### Start document analysis and get Textract Job ID

In [4]:
job_id = textractor.start_textract(file_path = "input_files/patient_02.pdf")

In [5]:
job_id

'47396342adf5c076ffe7bafbbd13c8fe930c120fc18d73ebd9c16c6e683d7ecd'

#### Retrieve Textract response using the Job ID

In [6]:
textract_response = textractor.get_textract_response(job_id = job_id)

In [7]:
textract_response

{'DocumentMetadata': {'Pages': 4},
 'JobStatus': 'SUCCEEDED',
 'NextToken': 'wLq2OMFGVHifKZBMCvOdA/TU8jiEIauMVsS7PWBokcanFUIqYoCa7Wz94+1XkOomvxAQw+4s5HP729EfOh8rGw/TqbY/opwVacUGxrF1OM8afdumSTKyw84MHx3zg9eEiDA4UnI=',
 'Blocks': [{'BlockType': 'PAGE',
   'Geometry': {'BoundingBox': {'Width': 1.0,
     'Height': 1.0,
     'Left': 0.0,
     'Top': 0.0},
    'Polygon': [{'X': 0.0, 'Y': 2.0999438277158333e-07},
     {'X': 1.0, 'Y': 0.0},
     {'X': 1.0, 'Y': 1.0},
     {'X': 3.6217792853676656e-07, 'Y': 1.0}]},
   'Id': '8442723f-24a7-4e59-8892-6d9d2ba13a09',
   'Relationships': [{'Type': 'CHILD',
     'Ids': ['a97ea69a-9ca1-4016-8d04-c4f0dc17f1f6',
      '7f699814-2a98-4f90-ba66-48996f3e92b3',
      'de4c7f82-0ebc-4633-8227-bda91814db0b',
      '3558d0e4-0271-4a76-a6a2-01387625bc75',
      '3abb4177-4a24-4f1c-98e1-39f0a7b325e1',
      '7df3cb05-de16-42ef-b1e7-a766beb4d2e6',
      '0bfad731-4a5a-49fa-8df5-5bf7bb0c104c',
      '96daeebb-d56f-4623-8bd2-7e8db531a298',
      'e208ccdd-422c-4a9e-

#### Convert textract response into markdown file

In [8]:
markdown_response = textractor.parse_textract_response(textract_response = textract_response)

In [9]:
print(markdown_response)

--- Page 1---
T


ISO:9001-2008 

# Diagnostic Point Pathology Labs 

EQUIPPED WITH COMPUTERISED AUTO CHEMISTRY, BLOOD GAS & HAEMATOLOGY ANALYSERS 

101, 102. 107, 108, 109, 1st Floor. Agarwal & P Mkt. Dilshad Garden, Delhi-95 Tower, Pkt. O Ph.: 

SARVODAYA HOSPITAL KJ-7, Kavi Nagar, Ghaziabad (U.P.) 


Dr. (Capt) Atul Kapila (Retd.) MD Path MCI 3456 

Regn No. 




| Investigation                | Observed Value    | Unit    | Biological Ref Interval    |
|------------------------------|-------------------|---------|----------------------------|
| BIOCHEMISTRY                 |                   |         |                            |
| VER FUNCTION TEST (LFT)      |                   |         |                            |
| BILIRUBIN TOTAL              | 9.42 H            | mg/dl   | 0.30 1.20                  |
| CONJUGATED (D. BILIRUBIN)    | 7.79 H            | mg/dL   | 0.00 0,30                  |
| UNCONJUGATED (I.D.BILIRUBIN) | 1.63 H            | mg/dl   | 0.00 0,70       

## Configure Bedrock Prompt

#### TODO: Your task is to edit the sample prompts and the field_definitions.csv file.

In [10]:
system_prompt = """
You are a medical data extraction assistant. 
Your task is to extract specific medical terms from the provided document according to the defined schema.
Please pay careful attention to:
1. Field locations and contexts specified in section headers
2. Special instructions for each field
3. Inclusion and exclusion criteria
4. Exact matching of field types and formats
5. Validation against any provided choices or calculations

For each field you extract, you must provide:
- variable_field_name: The extracted value from the document
- reasoning: Your explanation for why you chose this value and how it matches the field definition
- page_citation: Specific reference to where in the document you found this information

Format your response as a JSON object where each field contains these three elements.
"""

In [11]:
extraction_prompt = """
Analyze the following medical document segments and extract the requested fields. 
Each segment is preceded by a page number in the format --- Page <page number> ---.

Schema Information:
{schema}

Document Text:
{document}

For each field:
1. Extract the value according to the field definition and constraints
2. Explain your reasoning for choosing this value
3. Cite where in the document you found it

Return a JSON where each field has:
- variable_field_name: The extracted value (or null if not found)
- reasoning: Your one-sentence explanation for this extraction. DO NOT USE double quotes in the explanation.
- page_citation: Where you found it in the document

For fields with specific choices or calculations, ensure the extracted value matches one of the allowed options.
"""

## Invoke Bedrock Prompt

#### Initialize Extractor class

In [12]:
extractor = Extractor()

#### Extract metadata from the markdown file, using the prompts and field definitions as inputs

In [13]:
extractor.extract_metadata(
    document_path="output/markdown_output/patient_02.md",
    system_prompt=system_prompt,
    extraction_prompt=extraction_prompt,
    field_definitions_path="field_definitions.csv"
)

Processing field: hospital_name
Processing field: lab_name
Processing field: physician_name
Processing field: has_serum_specimen
Processing field: serum_receiving_date
Processing field: serum_reporting_date
Processing field: serum_turnaround_time
Processing field: bilirubin_total
Processing field: bilirubin_total_unit
Processing field: bilirubin_level
Processing field: bilirubin_conjugated
Processing field: bilirubin_conjugated_unit
Processing field: bilirubin_conjugated_level
Processing field: bilirubin_unconjugated
Processing field: bilirubin_unconjugated_unit
Processing field: bilirubin_unconjugated_level
Processing field: has_blood_specimen
Processing field: blood_receiving_date
Processing field: blood_reporting_date
Processing field: blood_turnaround_time
Processing field: ammonia
Processing field: ammonia_unit
Processing field: ammonia_level


## LLM Extraction Results

#### Initialize Evaluator class

In [14]:
evaluator = Evaluator()

#### Create comparison dataframe that joins the ground truth and LLM results

In [16]:
comparison_df = evaluator.create_comparison_df(
    ground_truth_path="ground_truth.csv",
    results_path="output/results/llm_results/patient_02_llm_results.csv"
)

In [17]:
comparison_df

Unnamed: 0,field_name,field_value,llm_extraction,reasoning,page_citation
0,hospital_name,Sarvodaya Hospital,SARVODAYA HOSPITAL,Identified the hospital name in the header sec...,"Page 1, repeated header sections"
1,lab_name,Diagnostic Point Pathology Labs,Diagnostic Point Pathology Labs,The lab name is clearly printed in the header ...,"Page 1, top of document header"
2,physician_name,Atul Kapila,"{'first_name': 'Atul', 'last_name': 'Kapila'}",Full physician name appears multiple times in ...,"Multiple pages, e.g. Page 1 first section"
3,has_serum_specimen,Yes,Yes,Serum is explicitly listed as a specimen type ...,"--- Page 1 ---, under Biochemistry results sec..."
4,serum_receiving_date,Unknown,18/Apr/2025,The receiving date for the serum specimen is c...,"Page 1, last section with receiving date table"
5,serum_reporting_date,Unknown,04/18/2025,Found the reporting date for the serum specime...,"Page 1, Reporting Date line: 18/Apr/2025 3:17:..."
6,serum_turnaround_time,Unknown,2 Hours 29 Minutes,The turnaround time for the serum specimen is ...,"Page 1, line with Date TAT 2 Hours 29 Minute"
7,bilirubin_total,9.42,9.42,Total bilirubin level is directly reported in ...,--- Page 1 --- Biochemistry section of lab rep...
8,bilirubin_total_unit,mg/dl,mg/dl,The total bilirubin value is reported with the...,"Page 1, Biochemistry Table, Bilirubin Total co..."
9,bilirubin_level,H,H,"Total bilirubin level is marked as 9.42 H, whi...","--- Page 1--- Biochemistry section, Bilirubin ..."


## Perform LLM evaluation 

#### Oftentimes, the extracted results do not perfectly match the ground truth results. As such, we may want to use additional evaluation methods, such as fuzzy match or LLM match to handle these imperfections. The following values can be specified for the "match_type" when calculating the accuracy of the extraction results.
* EXACT
* FUZZY
* LLM
* FUZZY_AND_LLM

In [18]:
exact_match_df = evaluator.calculate_accuracy(comparison_df, match_type="EXACT")
fuzzy_match_df = evaluator.calculate_accuracy(comparison_df, match_type="FUZZY")
llm_match_df = evaluator.calculate_accuracy(comparison_df, match_type="LLM")
llm_and_fuzzy_df = evaluator.calculate_accuracy(comparison_df, match_type="FUZZY_AND_LLM")

Exact match accuracy: 56.52%
Fuzzy match accuracy: 82.61%
LLM match accuracy: 86.96%
LLM and Fuzzy match accuracy: 86.96%
