Max Winning Project: Gemini-Centric Document Classification System
A comprehensive, enterprise-grade document classification system powered by Google's Gemini 2.0 Flash, featuring RAG (Retrieval Augmented Generation), CAG (Context Augmented Generation), Solana blockchain audit trails.
- β Policy Knowledge Base: Comprehensive category definitions, PII patterns, and SME-validated examples
- β Multi-Modal Document Processing: PDF parsing with OCR for text and images
- β Citation Mapping: Precise source location tracking with bounding boxes
- β Gemini File Search Store: RAG-based policy grounding
- β Dynamic Prompt Tree: Sequential classification flow (UNSAFE β CONFIDENTIAL β SENSITIVE β PUBLIC)
- β RAG + CAG Grounding: Policy knowledge base + cached document content
- β Structured JSON Output: Category, confidence, reasoning, and citations
- β Dual-Layer Validation: Consensus-based auto-approval (90%+ confidence threshold)
- β Solana Blockchain: Immutable audit trails on Solana devnet
- β SQLite Audit Logs: Complete classification history and HITL reviews
- β Web UI: Flask-based interface with HITL feedback loop
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Document Upload (PDF) β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Document Processing (PyMuPDF + OCR + Citation Mapping) β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Gemini Classifier (RAG + CAG Pipeline) β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Policy RAG β β Cached Doc β β Dual Layer β β
β β (File Search)β +β (CAG) β +β Validation β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Classification Result + Metadata β
βββββββββββ¬ββββββββββββ¬ββββββββββββ¬ββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ
β Solana β βSQLite Audit β
β Blockchain β β Logger β
ββββββββββββββββ ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Web UI (Dashboard + HITL Review Queue) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Python 3.9+
- Tesseract OCR
- API Keys:
- Google Gemini API
- Solana Devnet access
cd gemini-classifiermacOS:
brew install tesseractUbuntu/Debian:
sudo apt-get update
sudo apt-get install tesseract-ocrWindows: Download and install Tesseract from: https://github.com/UB-Mannheim/tesseract/wiki
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtThe .env file is already configured with your API keys:
GEMINI_API_KEY=AIzaSyA5CRA7vt5rLIVzrW9mTFOTMtFCasEhxlo
SOLANA_CLUSTER_URL=https://api.devnet.solana.comNote: In production, use environment variables or secure secret management instead of committing API keys.
python main.pyThe application will be available at:
- Main Upload: http://localhost:5000
- Dashboard: http://localhost:5000/dashboard
- HITL Queue: http://localhost:5000/hitl/queue
- Navigate to http://localhost:5000
- Upload a PDF file (drag-and-drop or click to browse)
- Wait for processing (typically 5-15 seconds)
- Review the classification result with:
- Category (UNSAFE/CONFIDENTIAL/SENSITIVE/PUBLIC)
- Confidence score
- Reasoning and citations
- Blockchain audit hash
- Navigate to HITL Queue (http://localhost:5000/hitl/queue)
- Click "Review Document" on any pending classification
- Verify or correct the classification
- Add reviewer notes
- Submit review
Important: Corrected classifications are automatically added to the RAG knowledge base as new few-shot examples, improving future accuracy.
- Harmful, violent, or threatening content
- Illegal activity instructions
- Malware or security exploits
- Action: Immediate rejection and escalation
- Trade secrets and proprietary algorithms
- Financial records (with SSN, credit cards)
- Legal documents (attorney-client privilege)
- M&A plans, executive compensation
- Source code and IP
- PII: SSN, credit cards, bank accounts, medical records, passports
- Internal memos and communications
- Employee directories
- Draft documents
- Internal project plans
- Non-executive budgets
- PII: Emails, phone numbers, addresses, employee IDs
- Published marketing materials
- Public website content
- Press releases
- Open-source code
- Public documentation
POST /upload
Content-Type: multipart/form-data
Response: {
"document_id": "DOC_abc123...",
"classification": "CONFIDENTIAL",
"confidence": 0.95,
"reasoning": "...",
"citation": "...",
"blockchain": {...},
"audio_available": true
}GET /api/statistics
Response: {
"total_classifications": 42,
"auto_approval_rate": 85.5,
"avg_processing_time": 8.3,
"by_category": {...}
}GET /api/classifications?limit=100&offset=0GET /api/classification/<document_id>POST /hitl/submit
## π§ͺ Testing
### Test with Sample Documents
Create test PDFs with different content types:
**Confidential Example:**CONFIDENTIAL - Board Meeting Minutes Acquisition Target: TechCorp Offer: $500M Employee Data: John Smith - SSN: 123-45-6789 Credit Card: 4532-1234-5678-9010
**Public Example:**
FOR IMMEDIATE RELEASE Product Launch Announcement Contact: press@company.com
### Verify System Components
1. **RAG Policy Upload**: Check console for "Policy uploaded successfully"
2. **Classification**: Verify JSON output with category, confidence, reasoning
3. **Blockchain**: Check for transaction hash (may be simulated if devnet is down)
4. **Database**: SQLite file at `data/audit_logs.db`
## π Project Structure
gemini-classifier/ βββ main.py # Main entry point βββ requirements.txt # Python dependencies βββ .env # Environment variables (API keys) βββ README.md # This file βββ src/ β βββ config.py # Configuration management β βββ audit_logger.py # SQLite audit logging β βββ processing/ β β βββ init.py β β βββ document_processor.py # PDF/OCR processing β βββ classification/ β β βββ init.py β β βββ policy_rag.py # RAG knowledge base β β βββ classifier.py # Core AI classifier β βββ blockchain/ β β βββ init.py β β βββ solana_audit.py # Solana integration β βββ ui/ β βββ init.py β βββ app.py # Flask web application βββ policies/ β βββ categories.json # Category definitions β βββ pii_patterns.json # PII detection patterns β βββ few_shot_examples.json # SME-validated examples βββ templates/ β βββ base.html β βββ index.html # Upload page β βββ dashboard.html # Statistics dashboard β βββ hitl_queue.html # Review queue β βββ hitl_review.html # Review detail page βββ data/ βββ uploads/ # Uploaded PDFs βββ cache/ # Cached content & audio βββ audit_logs/ # Log files βββ audit_logs.db # SQLite database
## π Key Technologies
| Component | Technology | Purpose |
|-----------|-----------|---------|
| AI Model | Gemini 2.0 Flash | Fast, high-quality classification |
| RAG | Gemini File Search | Policy knowledge grounding |
| CAG | Gemini Caching API | Document context optimization |
| Blockchain | Solana (Devnet) | Immutable audit trails |
| Database | SQLite | Local audit logging |
| Web Framework | Flask | REST API & web UI |
| OCR | Tesseract + PyMuPDF | Multi-modal document processing |
## π Security Considerations
1. **API Keys**: Never commit API keys to version control. Use environment variables.
2. **PII Detection**: High-risk PII triggers CONFIDENTIAL classification.
3. **Audit Trail**: All decisions are logged to SQLite and Solana blockchain.
4. **HITL Review**: Human oversight for low-confidence or mismatched validations.
5. **Safety Checks**: UNSAFE content is detected first and rejected immediately.
## π Performance Metrics
- **Processing Speed**: ~5-15 seconds per document (depends on page count)
- **Auto-Approval Rate**: Target 85%+ with dual validation
- **Confidence Threshold**: 90% for auto-approval
## π Troubleshooting
### "Tesseract not found"
Install Tesseract OCR (see Installation section)
### "File processing failed"
Check that the PDF is not corrupted or password-protected
### "Blockchain recording error"
The system will create a simulated transaction hash if Solana devnet is unavailable. This is normal for demo purposes.
### Gemini API errors
- Check API key validity
- Verify quota/billing is enabled
- Ensure Gemini 2.0 Flash access is enabled
## π€ HITL Feedback Loop
The system implements a continuous improvement cycle:
1. Document is classified by AI
2. If confidence < 90% or dual validation mismatch β HITL queue
3. SME reviews and corrects classification
4. Correction is added to `policies/few_shot_examples.json`
5. Policy RAG is updated automatically
6. Future similar documents benefit from the correction
## π License
This is a demonstration project for educational purposes.
## π Acknowledgments
- **Google Gemini**: Advanced AI classification engine
- **Solana**: Blockchain infrastructure for audit trails
- **Tesseract OCR**: Open-source OCR engine
## π Support
For issues or questions about this implementation, please review:
1. This README
2. The code comments (extensively documented)
3. The policy JSON files in `policies/` directory
---
**Built with β€οΈ using Gemini 2.0 Flash and Solana**