Automated PII detection and redaction system for Indian Government IDs with OpenMetadata integration
Morolo is a PII governance system designed to detect and redact sensitive information from Indian documents. It integrates with OpenMetadata for metadata management and provides risk scoring for compliance purposes.
- Detects Indian Government IDs (Aadhaar, PAN, Driving License) plus email and phone numbers
- Calculates risk scores based on detected PII types
- Integrates with OpenMetadata for metadata tracking
- Provides document redaction capabilities
- Offers REST API for document processing
- This is a proof-of-concept/hackathon project, not production-ready
- Detection accuracy varies based on document quality
- No real-time processing (uses async task queue)
- Limited to text-based documents (OCR quality dependent)
docker --version # 20.10+
docker-compose --version # 2.0+git clone https://github.com/idhanx/Morolo.git
cd Morolocp .env.example .env
# Edit .env and set:
# - OM_TOKEN (get from OpenMetadata UI)
# - JWT_SECRET_KEY (generate with: openssl rand -hex 32)# Start OpenMetadata infrastructure first
docker-compose -f docker-compose-postgres.yml up -d
# Wait for services to be healthy (~60 seconds)
sleep 60
# Start application services
docker-compose up -d
# Check status
docker-compose ps# Check health
curl http://localhost:8000/health
# Upload a document (you'll need a PDF with PII)
curl -X POST "http://localhost:8000/upload" \
-F "file=@your_document.pdf" \
-F "redaction_level=FULL"
# Check status (replace {doc_id} with response from upload)
curl "http://localhost:8000/status/{doc_id}"| Service | URL | Credentials |
|---|---|---|
| FastAPI Docs | http://localhost:8000/docs | None |
| OpenMetadata | http://localhost:8585 | admin / admin |
| MinIO Console | http://localhost:9001 | minioadmin / minioadmin |
βββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β - Document upload β
β - PII detection (Presidio) β
β - Risk scoring β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β Celery Worker β
β - Async document processing β
β - Redaction β
β - OpenMetadata integration β
βββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββ
β Storage & Metadata β
β - PostgreSQL (metadata) β
β - MinIO (documents) β
β - OpenMetadata (governance) β
βββββββββββββββββββββββββββββββββββββββββββ
- Backend: FastAPI (Python)
- PII Detection: Microsoft Presidio with custom recognizers
- Task Queue: Celery + Redis
- Database: PostgreSQL
- Storage: MinIO (S3-compatible)
- Metadata: OpenMetadata 1.3.3
- Frontend: Next.js (basic UI)
POST /upload
Content-Type: multipart/form-data
Parameters:
- file: PDF/DOCX/PNG/JPG (max 10MB)
- redaction_level: LIGHT|FULL|SYNTHETIC|NONE
Response:
{
"doc_id": "uuid",
"filename": "document.pdf",
"status": "PENDING"
}GET /status/{doc_id}
Response:
{
"doc_id": "uuid",
"status": "PII_DETECTED",
"risk_score": 85.0,
"risk_band": "HIGH",
"pii_summary": {
"AADHAAR": 1,
"EMAIL": 2
}
}GET /risk/{doc_id}
Response:
{
"risk_score": 85.0,
"risk_band": "HIGH",
"entity_breakdown": {...}
}# Original document
GET /documents/{doc_id}/download?format=original
# Redacted document
GET /documents/{doc_id}/download?format=redacted| Entity Type | Detection Method | Notes |
|---|---|---|
| Aadhaar | Regex pattern | 12-digit format (spaced/hyphenated) |
| PAN | Regex pattern | 10-character alphanumeric |
| Driving License | Regex pattern | State-based format |
| Presidio built-in | Standard email detection | |
| Phone | Presidio built-in | Indian phone formats |
Risk scores are calculated based on:
- Type of PII detected (Aadhaar has highest weight)
- Number of entities found
- Diversity of PII types
- Confidence scores
Risk Bands:
- LOW (<5): Minimal PII
- MEDIUM (5-15): Some PII detected
- HIGH (15-30): Government ID detected
- CRITICAL (β₯30): Multiple government IDs
- Container Entities: Documents are registered as Container entities
- Custom Properties: Risk score, risk band, detected PII types
- Classifications: Hierarchical PII classifications (when working)
- Lineage: Tracks original β redacted transformation
- Classification tagging via API has compatibility issues with OM v1.3.3
- Manual tagging in OM UI works as workaround
- Lineage tracking is basic (no complex transformations)
# Check Docker is running
docker ps
# View logs
docker-compose logs backend
docker-compose logs -f celery_worker
# Restart everything
docker-compose down
docker-compose -f docker-compose-postgres.yml down
docker-compose -f docker-compose-postgres.yml up -d
sleep 60
docker-compose up -dThe docker-compose files use a shared network. Make sure:
- Start
docker-compose-postgres.ymlfirst - Wait for services to be healthy
- Then start
docker-compose.yml
- Ensure document has clear text (not scanned images)
- For scanned PDFs, OCR quality matters (300+ DPI recommended)
- Check confidence threshold in
.env(default: 0.7)
- Verify OM_TOKEN is set correctly in
.env - Check OpenMetadata is running:
curl http://localhost:8585/api/v1/health-check - System will work without OM (graceful degradation)
- Not Production Ready: This is a hackathon/proof-of-concept project
- No Authentication: API endpoints are open (add auth for production)
- Limited Error Handling: Some edge cases not covered
- OCR Quality: Depends on Tesseract, may miss text in poor scans
- Classification Tagging: API compatibility issue with OM v1.3.3
- No Real-time Updates: Uses polling, not WebSockets
- Single Server: Not designed for horizontal scaling
- Basic Frontend: Minimal UI, needs improvement
- Default credentials in
.env.exampleare for development only - No rate limiting on API endpoints
- No input validation for file types
- No virus scanning on uploads
- Audit logging is basic
- Document processing is async (10-30 seconds typical)
- Large documents (>5MB) may timeout
- Concurrent uploads limited by Celery worker count
- No caching implemented
This project is configured for local development. For production deployment, you would need:
- Proper authentication and authorization
- SSL/TLS certificates
- Managed databases (not Docker containers)
- Load balancing
- Monitoring and alerting
- Backup and disaster recovery
- Security hardening
- Rate limiting
- Input validation
- Virus scanning
# Run unit tests
docker-compose exec backend pytest backend/tests/unit/ -v
# Check specific component
docker-compose exec backend python -c "
from backend.services.pii_detector import PIIDetector
detector = PIIDetector()
print('PII Detector initialized')
"# Test database connection
docker-compose exec backend python -c "
from backend.core.database import get_db_engine
engine = get_db_engine()
print('Database connected')
"
# Test MinIO connection
docker-compose exec backend python -c "
from backend.core.storage import get_storage_client
client = get_storage_client()
print('MinIO connected')
"Morolo/
βββ backend/ # FastAPI application
β βββ api/ # REST endpoints
β βββ services/ # Business logic
β βββ models/ # Database models
β βββ tasks/ # Celery tasks
β βββ tests/ # Unit tests
βββ frontend/ # Next.js UI (basic)
βββ mcp-server/ # MCP interface (experimental)
βββ docker-compose.yml # Application services
βββ docker-compose-postgres.yml # OpenMetadata stack
βββ .env.example # Configuration template
This is a hackathon project. Contributions welcome but note the limitations above.
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
MIT License - See LICENSE file for details
- Microsoft Presidio - PII detection framework
- OpenMetadata - Metadata governance platform
- Tesseract - OCR engine
- FastAPI - Web framework
For issues or questions:
- Check logs:
docker-compose logs [service] - Review troubleshooting section above
- Create GitHub issue with details
Morolo - PII Governance for Indian Documents
Note: This is a proof-of-concept project developed for a hackathon. It demonstrates PII detection and OpenMetadata integration but is not production-ready.