PHI Deidentification Platform

Index	Description
Overview	See what this project does and its key capabilities
Demo	View the demo experience and walkthrough video
Description	Learn about the problem and our approach
Architecture	View the system architecture diagram
Tech Stack	Technologies and services used
Deployment	How to install and deploy the solution
Usage	How to process notes and approve redactions
Costs	Cost drivers and estimation guidance
Credits	Project ownership and contributors
License	Current license status for this repository
Disclaimers	Important legal and operational disclaimers

Overview

PHI Deidentification Platform is an AI-driven system for detecting and redacting sensitive information in clinical and operational text documents. The solution uses large language models (LLMs) through AWS services to identify PHI with context awareness, generate redacted outputs, and support human review before release.

A University of Pittsburgh research team requested this project to address challenges with existing deidentification approaches: static PHI identifier lists that miss context-specific or unique information, long processing times for large note volumes, and mid-processing failures that require full restarts.

The platform provides end-to-end capabilities including batch ingestion, asynchronous processing, LLM-based detection, redacted artifact generation, reviewer approval workflows, and operational metrics for monitoring system health.

Key capabilities include:

Automated PHI Detection: Uses Claude via Amazon Bedrock to identify sensitive entities in clinical or operational notes.
Redacted Output Generation: Produces redacted text and entity metadata for each input note.
Human-in-the-Loop Review: Provides a dashboard to compare original vs redacted text, edit, and approve.
Async Processing at Scale: Uses S3, SQS, and Lambda for asynchronous, serverless processing.
Operational Visibility: Tracks batch stats and emits CloudWatch metrics for throughput, latency, retries, and failures.

Demo

pii-deid-demo.mp4

Description

Problem Statement

Medical research teams handling protected health information (PHI) and personally identifiable information (PII) must de-identify notes before secondary use, collaboration, and analytics. Existing solutions struggle with context-specific identifiers, produce inconsistent results, and require long runtimes. We needed a context-aware automated pipeline with human-in-the-loop validation.

Our Approach

PHI Deidentification Platform addresses these challenges through a context-aware, serverless redaction pipeline that combines LLM-based entity detection, asynchronous processing, and structured human review.

Asynchronous Ingestion Pipeline: Users upload notes in batch folders to Amazon S3, where Amazon SQS queues them for processing. This design handles high note volumes without blocking user workflows and supports reliable retries for long-running jobs.

AI Detection and Redaction Layer: Worker Lambda functions call Claude Sonnet 4.5 via Amazon Bedrock to identify PHI in context, then generate redacted outputs and entity artifacts per note. This improves detection quality beyond static pattern matching and supports all 18 HIPAA identifier categories.

Secure Access: The platform uses Cognito for authentication and access control across the dashboard and API workflows. This ensures only authorized users can sign in, process batches, and approve redacted outputs.

Human-in-the-Loop Review Workflow: The review interface lets users inspect original vs. redacted text, edit redactions, and approve notes or full batches. The system tracks approved outputs separately, supporting controlled release workflows and reviewer quality oversight.

Architecture

Tech Stack

Category	Technology	Purpose
Amazon Web Services (AWS)	AWS CDK	Infrastructure as code
	Amazon Bedrock	Claude-based PHI detection
	AWS Lambda	Ingestion, processing, and API compute
	Amazon S3	Input/output note storage
	Amazon SQS	Asynchronous note processing queue
	Amazon API Gateway	Authenticated REST API
	Amazon Cognito	Authentication and user management
	Amazon DynamoDB	Batch statistics/state
	AWS Amplify	Frontend hosting
	Amazon CloudWatch	Metrics and operational dashboard
Backend	Python 3.12	Lambda runtime
	pydantic-ai	Agent orchestration around Bedrock
	boto3	AWS SDK
	aws-lambda-powertools	Logging, metrics, and partial batch handling
Frontend	React	Review dashboard UI
	Vite	Frontend build and dev server
	TypeScript	Type-safe frontend development
	TanStack Query	API data synchronization and caching
	AWS Amplify SDK	Cognito authentication integration

Deployment

Prerequisites

Prepare the following tools and accounts before deploying:

An active AWS account
Node.js (18+) from the official download page, or install it with nvm
AWS CDK v2, installed globally:
```
npm install -g aws-cdk
```
AWS CLI using this installation guide
Docker from docker.com/get-started
Git from git-scm.com

AWS Configuration

Configure AWS CLI with your credentials:
```
aws configure
```
Provide your AWS Access Key ID, Secret Access Key, and default region (for example, us-east-1) when prompted.
Bootstrap CDK for your target account/region (required once per account/region):
```
cdk bootstrap aws://ACCOUNT_ID/REGION
```
Replace ACCOUNT_ID and REGION with the AWS account and region where you are deploying.

Quick Start (Recommended)

Clone the repository:

git clone git@github.com:pitt-cic/phi-deidentification.git
cd phi-deidentification

Deploy infrastructure with CDK:
```
cd cdk
npm install
npm run deploy
```

Retrieve stack outputs (API URL, Cognito IDs, bucket name, region, dashboard URL):

aws cloudformation describe-stacks \
  --stack-name PHIDeidentificationStack \
  --query "Stacks[0].Outputs[].[OutputKey,OutputValue]" \
  --output table

Deploy the frontend to AWS Amplify: Run the deployment script:

cd ../frontend
chmod +x ./deploy-frontend.sh
./deploy-frontend.sh

Local Frontend Setup

Install dependencies (from frontend/):
```
npm install
```

Add environment variables to frontend/.env:

VITE_API_URL=<ApiUrl>
VITE_USER_POOL_ID=<UserPoolId>
VITE_USER_POOL_CLIENT_ID=<UserPoolClientId>

Start development server:
```
npm run dev
```

Local Testing

For local evaluation tooling, synthetic data generation, and running the dashboard locally, see tooling/README.md.

Usage

Open the application:
- Primary path: use AmplifyAppUrl from stack outputs
- Local: use the URL printed by npm run dev

Create a Cognito user (admin action):

aws cognito-idp admin-create-user \
  --user-pool-id <UserPoolId> \
  --username "user@example.com" \
  --user-attributes \
    Name=email,Value=user@example.com \
    Name=given_name,Value=First \
    Name=family_name,Value=Last \
  --desired-delivery-mediums EMAIL

Replace <UserPoolId> with the output from your deployed stack.

Sign in:

Log in with the invited user and temporary password, then set a permanent password on first login.

Create a batch and upload .txt notes:

./scripts/create_batch.sh --notes-dir /PATH/TO/NOTES

To upload to an existing batch:

./scripts/create_batch.sh --batch-id "<batch-id>" --notes-dir /PATH/TO/NOTES

Additional Options

--stack-name <name> when stack name differs from PHIDeidentificationStack
--profile <profile> and --region <region> for non-default AWS CLI contexts
--bucket <bucket-name> to bypass stack output lookup

Manual CLI Method (No Helper Script)

BUCKET=$(aws cloudformation describe-stacks \
  --stack-name PHIDeidentificationStack \
  --query "Stacks[0].Outputs[?OutputKey=='BucketName'].OutputValue | [0]" \
  --output text)

BATCH_ID="batch-$(date -u +%Y%m%d%H%M%S)"

aws s3api put-object --bucket "$BUCKET" --key "$BATCH_ID/input/"
aws s3 cp /PATH/TO/NOTES "s3://$BUCKET/$BATCH_ID/input/" \
  --recursive --exclude "*" --include "*.txt"

Start deidentification:

Select the batch in the dashboard and click Start Deidentification.
Review and approve outputs:

Open the review page, compare original vs redacted text, edit if needed, and approve note-by-note or use Approve All after processing completes.

Costs

The following costs are based on AWS pricing as of March 2026 and a test run of 1,277 clinical notes processed at SQS concurrency of 10 (~60 notes/minute). Actual costs vary by AWS region, note length, and batch size.

Estimated Monthly Recurring Costs

Assumes fewer than 10,000 users processing 50,000 notes/month.

Service	Estimated Cost	Notes
AWS Amplify	~$0	Free tier covers most small deployments
Amazon Cognito	~$0	Free tier covers first 10,000 MAUs
Amazon S3	<$1	$0.023/GB per month
Amazon SQS	~$0	First 1 million requests free
AWS Lambda	<$1	Ingestion/worker/API invocations and duration
API Gateway	<$1	Based on request volume
Amazon DynamoDB	<$1	Minimal read/write activity
Amazon CloudWatch	~$0	Free tier includes logging and metrics
Total Baseline	$0–$5/month	Excludes variable Bedrock inference spend

Per-Note Model Costs (Amazon Bedrock)

Costs below reflect a test run of 1,277 notes using Claude Sonnet 4.5. Token counts vary by note length; these are observed averages.

Component	Avg Tokens/Note	Cost per Note
Input (non-cached)	~2,750	$0.0083
Input (cached)	~1,050	$0.0003
Cache write	~5	<$0.0001
Output	~650	$0.0098
Total per Note	~4,455	~$0.018

Batch Cost Examples

Batch Size	Estimated Cost	Notes
100 notes	~$1.80	Minimal cache benefit
1,000 notes	~$18.00	Cache warming reduces cost
10,000 notes	~$150–$170	Higher cache hit rate reduces input cost

How Prompt Caching Reduces Cost

Amazon Bedrock caches repeated prompt content (system instructions, few-shot examples) across requests within a batch. In our 1,277-note test run:

Cache hit rate: 27.6%
Savings from caching: $3.63 (13% reduction vs. no caching)

Larger batches benefit more from caching because the system prompt is reused across more notes. A batch of 10,000 notes will see a higher cache hit rate—and lower per-note cost—than a batch of 100.

Credits

PHI Deidentification Platform is an open-source project developed by the University of Pittsburgh Health Sciences and Sports Analytics Cloud Innovation Center.

Development Team:

Project Leadership:

Technical Lead: Maciej Zukowski - Solutions Architect, Amazon Web Services (AWS)
Program Manager: Kate Ulreich - Program Leader, University of Pittsburgh Health Sciences and Sports Analytics Cloud Innovation Center

Special Thanks:

Dr. Gilles Clermont - Professor of Critical Care Medicine and Vice Chair for Research Operations at the University of Pittsburgh

This project is designed and developed with guidance and support from the Health Sciences and Sports Analytics Cloud Innovation Center, powered by AWS.

License

This project is licensed under the MIT License.

MIT License

Copyright (c) 2026 University of Pittsburgh Health Sciences and Sports Analytics Cloud Innovation Center

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Disclaimers

Customers are responsible for making their own independent assessment of the information in this document.

This document:

(a) is for informational purposes only,

(b) references AWS product offerings and practices, which are subject to change without notice,

(c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided "as is" without warranties, representations, or conditions of any kind, whether express or implied. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers, and

(d) is not to be considered a recommendation or viewpoint of AWS.

Additionally, you are solely responsible for testing, security and optimizing all code and assets on GitHub repo, and all such code and assets should be considered:

(a) as-is and without warranties or representations of any kind,

(b) not suitable for production environments, or on production or other critical data, and

(c) to include shortcuts in order to support rapid prototyping such as, but not limited to, relaxed authentication and authorization and a lack of strict adherence to security best practices.

All work produced is open source. More information can be found in the GitHub repo.

For questions, issues, or contributions, please visit our GitHub repository or contact the development team.

Name		Name	Last commit message	Last commit date
Latest commit History 337 Commits
backend		backend
cdk		cdk
docs		docs
frontend		frontend
info-site		info-site
scripts		scripts
tooling		tooling
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PHI Deidentification Platform

Overview

Demo

Description

Problem Statement

Our Approach

Architecture

Tech Stack

Deployment

Prerequisites

AWS Configuration

Quick Start (Recommended)

Local Frontend Setup

Local Testing

Usage

Costs

Estimated Monthly Recurring Costs

Per-Note Model Costs (Amazon Bedrock)

Batch Cost Examples

How Prompt Caching Reduces Cost

Credits

License

Disclaimers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PHI Deidentification Platform

Overview

Demo

Description

Problem Statement

Our Approach

Architecture

Tech Stack

Deployment

Prerequisites

AWS Configuration

Quick Start (Recommended)

Local Frontend Setup

Local Testing

Usage

Costs

Estimated Monthly Recurring Costs

Per-Note Model Costs (Amazon Bedrock)

Batch Cost Examples

How Prompt Caching Reduces Cost

Credits

License

Disclaimers

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages