| Index | Description |
|---|---|
| Overview | See what this project does and its key capabilities |
| Demo | View the demo experience and walkthrough video |
| Description | Learn about the problem and our approach |
| Architecture | View the system architecture diagram |
| Tech Stack | Technologies and services used |
| Deployment | How to install and deploy the solution |
| Usage | How to process notes and approve redactions |
| Costs | Cost drivers and estimation guidance |
| Credits | Project ownership and contributors |
| License | Current license status for this repository |
| Disclaimers | Important legal and operational disclaimers |
PHI Deidentification Platform is an AI-driven system for detecting and redacting sensitive information in clinical and operational text documents. The solution uses large language models (LLMs) through AWS services to identify PHI with context awareness, generate redacted outputs, and support human review before release.
A University of Pittsburgh research team requested this project to address challenges with existing deidentification approaches: static PHI identifier lists that miss context-specific or unique information, long processing times for large note volumes, and mid-processing failures that require full restarts.
The platform provides end-to-end capabilities including batch ingestion, asynchronous processing, LLM-based detection, redacted artifact generation, reviewer approval workflows, and operational metrics for monitoring system health.
Key capabilities include:
- Automated PHI Detection: Uses Claude via Amazon Bedrock to identify sensitive entities in clinical or operational notes.
- Redacted Output Generation: Produces redacted text and entity metadata for each input note.
- Human-in-the-Loop Review: Provides a dashboard to compare original vs redacted text, edit, and approve.
- Async Processing at Scale: Uses S3, SQS, and Lambda for asynchronous, serverless processing.
- Operational Visibility: Tracks batch stats and emits CloudWatch metrics for throughput, latency, retries, and failures.
pii-deid-demo.mp4
Medical research teams handling protected health information (PHI) and personally identifiable information (PII) must de-identify notes before secondary use, collaboration, and analytics. Existing solutions struggle with context-specific identifiers, produce inconsistent results, and require long runtimes. We needed a context-aware automated pipeline with human-in-the-loop validation.
PHI Deidentification Platform addresses these challenges through a context-aware, serverless redaction pipeline that combines LLM-based entity detection, asynchronous processing, and structured human review.
Asynchronous Ingestion Pipeline: Users upload notes in batch folders to Amazon S3, where Amazon SQS queues them for processing. This design handles high note volumes without blocking user workflows and supports reliable retries for long-running jobs.
AI Detection and Redaction Layer: Worker Lambda functions call Claude Sonnet 4.5 via Amazon Bedrock to identify PHI in context, then generate redacted outputs and entity artifacts per note. This improves detection quality beyond static pattern matching and supports all 18 HIPAA identifier categories.
Secure Access: The platform uses Cognito for authentication and access control across the dashboard and API workflows. This ensures only authorized users can sign in, process batches, and approve redacted outputs.
Human-in-the-Loop Review Workflow: The review interface lets users inspect original vs. redacted text, edit redactions, and approve notes or full batches. The system tracks approved outputs separately, supporting controlled release workflows and reviewer quality oversight.
| Category | Technology | Purpose |
|---|---|---|
| Amazon Web Services (AWS) | AWS CDK | Infrastructure as code |
| Amazon Bedrock | Claude-based PHI detection | |
| AWS Lambda | Ingestion, processing, and API compute | |
| Amazon S3 | Input/output note storage | |
| Amazon SQS | Asynchronous note processing queue | |
| Amazon API Gateway | Authenticated REST API | |
| Amazon Cognito | Authentication and user management | |
| Amazon DynamoDB | Batch statistics/state | |
| AWS Amplify | Frontend hosting | |
| Amazon CloudWatch | Metrics and operational dashboard | |
| Backend | Python 3.12 | Lambda runtime |
| pydantic-ai | Agent orchestration around Bedrock | |
| boto3 | AWS SDK | |
| aws-lambda-powertools | Logging, metrics, and partial batch handling | |
| Frontend | React | Review dashboard UI |
| Vite | Frontend build and dev server | |
| TypeScript | Type-safe frontend development | |
| TanStack Query | API data synchronization and caching | |
| AWS Amplify SDK | Cognito authentication integration |
Prepare the following tools and accounts before deploying:
- An active AWS account
- Node.js (18+) from the official download page, or install it with nvm
- AWS CDK v2, installed globally:
npm install -g aws-cdk
- AWS CLI using this installation guide
- Docker from docker.com/get-started
- Git from git-scm.com
-
Configure AWS CLI with your credentials:
aws configure
Provide your AWS Access Key ID, Secret Access Key, and default region (for example,
us-east-1) when prompted. -
Bootstrap CDK for your target account/region (required once per account/region):
cdk bootstrap aws://ACCOUNT_ID/REGION
Replace
ACCOUNT_IDandREGIONwith the AWS account and region where you are deploying.
-
Clone the repository:
git clone git@github.com:pitt-cic/phi-deidentification.git cd phi-deidentification -
Deploy infrastructure with CDK:
cd cdk npm install npm run deploy -
Retrieve stack outputs (API URL, Cognito IDs, bucket name, region, dashboard URL):
aws cloudformation describe-stacks \ --stack-name PHIDeidentificationStack \ --query "Stacks[0].Outputs[].[OutputKey,OutputValue]" \ --output table -
Deploy the frontend to AWS Amplify: Run the deployment script:
cd ../frontend chmod +x ./deploy-frontend.sh ./deploy-frontend.sh
-
Install dependencies (from
frontend/):npm install
-
Add environment variables to
frontend/.env:VITE_API_URL=<ApiUrl> VITE_USER_POOL_ID=<UserPoolId> VITE_USER_POOL_CLIENT_ID=<UserPoolClientId>
-
Start development server:
npm run dev
For local evaluation tooling, synthetic data generation, and running the dashboard locally, see tooling/README.md.
-
Open the application:
- Primary path: use
AmplifyAppUrlfrom stack outputs - Local: use the URL printed by
npm run dev
- Primary path: use
-
Create a Cognito user (admin action):
aws cognito-idp admin-create-user \ --user-pool-id <UserPoolId> \ --username "user@example.com" \ --user-attributes \ Name=email,Value=user@example.com \ Name=given_name,Value=First \ Name=family_name,Value=Last \ --desired-delivery-mediums EMAIL
Replace
<UserPoolId>with the output from your deployed stack. -
Sign in:
Log in with the invited user and temporary password, then set a permanent password on first login.
-
Create a batch and upload
.txtnotes:./scripts/create_batch.sh --notes-dir /PATH/TO/NOTES
To upload to an existing batch:
./scripts/create_batch.sh --batch-id "<batch-id>" --notes-dir /PATH/TO/NOTESAdditional Options
--stack-name <name>when stack name differs fromPHIDeidentificationStack--profile <profile>and--region <region>for non-default AWS CLI contexts--bucket <bucket-name>to bypass stack output lookup
Manual CLI Method (No Helper Script)
BUCKET=$(aws cloudformation describe-stacks \ --stack-name PHIDeidentificationStack \ --query "Stacks[0].Outputs[?OutputKey=='BucketName'].OutputValue | [0]" \ --output text) BATCH_ID="batch-$(date -u +%Y%m%d%H%M%S)" aws s3api put-object --bucket "$BUCKET" --key "$BATCH_ID/input/" aws s3 cp /PATH/TO/NOTES "s3://$BUCKET/$BATCH_ID/input/" \ --recursive --exclude "*" --include "*.txt"
-
Start deidentification:
Select the batch in the dashboard and click Start Deidentification.
-
Review and approve outputs:
Open the review page, compare original vs redacted text, edit if needed, and approve note-by-note or use Approve All after processing completes.
The following costs are based on AWS pricing as of March 2026 and a test run of 1,277 clinical notes processed at SQS concurrency of 10 (~60 notes/minute). Actual costs vary by AWS region, note length, and batch size.
Assumes fewer than 10,000 users processing 50,000 notes/month.
| Service | Estimated Cost | Notes |
|---|---|---|
| AWS Amplify | ~$0 | Free tier covers most small deployments |
| Amazon Cognito | ~$0 | Free tier covers first 10,000 MAUs |
| Amazon S3 | <$1 | $0.023/GB per month |
| Amazon SQS | ~$0 | First 1 million requests free |
| AWS Lambda | <$1 | Ingestion/worker/API invocations and duration |
| API Gateway | <$1 | Based on request volume |
| Amazon DynamoDB | <$1 | Minimal read/write activity |
| Amazon CloudWatch | ~$0 | Free tier includes logging and metrics |
| Total Baseline | $0–$5/month | Excludes variable Bedrock inference spend |
Costs below reflect a test run of 1,277 notes using Claude Sonnet 4.5. Token counts vary by note length; these are observed averages.
| Component | Avg Tokens/Note | Cost per Note |
|---|---|---|
| Input (non-cached) | ~2,750 | $0.0083 |
| Input (cached) | ~1,050 | $0.0003 |
| Cache write | ~5 | <$0.0001 |
| Output | ~650 | $0.0098 |
| Total per Note | ~4,455 | ~$0.018 |
| Batch Size | Estimated Cost | Notes |
|---|---|---|
| 100 notes | ~$1.80 | Minimal cache benefit |
| 1,000 notes | ~$18.00 | Cache warming reduces cost |
| 10,000 notes | ~$150–$170 | Higher cache hit rate reduces input cost |
Amazon Bedrock caches repeated prompt content (system instructions, few-shot examples) across requests within a batch. In our 1,277-note test run:
- Cache hit rate: 27.6%
- Savings from caching: $3.63 (13% reduction vs. no caching)
Larger batches benefit more from caching because the system prompt is reused across more notes. A batch of 10,000 notes will see a higher cache hit rate—and lower per-note cost—than a batch of 100.
PHI Deidentification Platform is an open-source project developed by the University of Pittsburgh Health Sciences and Sports Analytics Cloud Innovation Center.
Development Team:
Project Leadership:
- Technical Lead: Maciej Zukowski - Solutions Architect, Amazon Web Services (AWS)
- Program Manager: Kate Ulreich - Program Leader, University of Pittsburgh Health Sciences and Sports Analytics Cloud Innovation Center
Special Thanks:
- Dr. Gilles Clermont - Professor of Critical Care Medicine and Vice Chair for Research Operations at the University of Pittsburgh
This project is designed and developed with guidance and support from the Health Sciences and Sports Analytics Cloud Innovation Center, powered by AWS.
This project is licensed under the MIT License.
MIT License
Copyright (c) 2026 University of Pittsburgh Health Sciences and Sports Analytics Cloud Innovation Center
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Customers are responsible for making their own independent assessment of the information in this document.
This document:
(a) is for informational purposes only,
(b) references AWS product offerings and practices, which are subject to change without notice,
(c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided "as is" without warranties, representations, or conditions of any kind, whether express or implied. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers, and
(d) is not to be considered a recommendation or viewpoint of AWS.
Additionally, you are solely responsible for testing, security and optimizing all code and assets on GitHub repo, and all such code and assets should be considered:
(a) as-is and without warranties or representations of any kind,
(b) not suitable for production environments, or on production or other critical data, and
(c) to include shortcuts in order to support rapid prototyping such as, but not limited to, relaxed authentication and authorization and a lack of strict adherence to security best practices.
All work produced is open source. More information can be found in the GitHub repo.
For questions, issues, or contributions, please visit our GitHub repository or contact the development team.