Skip to content

missaimaker/doc-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Torpe Hitachi Classifier

Max Winning Project: Gemini-Centric Document Classification System

A comprehensive, enterprise-grade document classification system powered by Google's Gemini 2.0 Flash, featuring RAG (Retrieval Augmented Generation), CAG (Context Augmented Generation), Solana blockchain audit trails.

🌟 Features

Phase 1: Foundation & Policy RAG

  • βœ… Policy Knowledge Base: Comprehensive category definitions, PII patterns, and SME-validated examples
  • βœ… Multi-Modal Document Processing: PDF parsing with OCR for text and images
  • βœ… Citation Mapping: Precise source location tracking with bounding boxes
  • βœ… Gemini File Search Store: RAG-based policy grounding

Phase 2: Core AI Engine with RAG/CAG

  • βœ… Dynamic Prompt Tree: Sequential classification flow (UNSAFE β†’ CONFIDENTIAL β†’ SENSITIVE β†’ PUBLIC)
  • βœ… RAG + CAG Grounding: Policy knowledge base + cached document content
  • βœ… Structured JSON Output: Category, confidence, reasoning, and citations
  • βœ… Dual-Layer Validation: Consensus-based auto-approval (90%+ confidence threshold)

Phase 3: Auditability, UX & Compliance

  • βœ… Solana Blockchain: Immutable audit trails on Solana devnet
  • βœ… SQLite Audit Logs: Complete classification history and HITL reviews
  • βœ… Web UI: Flask-based interface with HITL feedback loop

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Document Upload (PDF)                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Document Processing (PyMuPDF + OCR + Citation Mapping)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Gemini Classifier (RAG + CAG Pipeline)              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚ Policy RAG   β”‚  β”‚ Cached Doc   β”‚  β”‚ Dual Layer   β”‚     β”‚
β”‚  β”‚ (File Search)β”‚ +β”‚ (CAG)        β”‚ +β”‚ Validation   β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Classification Result + Metadata                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚           β”‚           β”‚
          β–Ό           β–Ό           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Solana     β”‚ β”‚SQLite Audit  β”‚
β”‚ Blockchain   β”‚ β”‚   Logger     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
                  β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚        Web UI (Dashboard + HITL Review Queue)               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‹ Requirements

  • Python 3.9+
  • Tesseract OCR
  • API Keys:
    • Google Gemini API
    • Solana Devnet access

πŸš€ Installation

1. Clone/Navigate to Project

cd gemini-classifier

2. Install System Dependencies

macOS:

brew install tesseract

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install tesseract-ocr

Windows: Download and install Tesseract from: https://github.com/UB-Mannheim/tesseract/wiki

3. Create Virtual Environment

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

4. Install Python Dependencies

pip install -r requirements.txt

5. Configure Environment Variables

The .env file is already configured with your API keys:

GEMINI_API_KEY=AIzaSyA5CRA7vt5rLIVzrW9mTFOTMtFCasEhxlo
SOLANA_CLUSTER_URL=https://api.devnet.solana.com

Note: In production, use environment variables or secure secret management instead of committing API keys.

🎯 Usage

Start the Web Application

python main.py

The application will be available at:

Classify a Document

  1. Navigate to http://localhost:5000
  2. Upload a PDF file (drag-and-drop or click to browse)
  3. Wait for processing (typically 5-15 seconds)
  4. Review the classification result with:
    • Category (UNSAFE/CONFIDENTIAL/SENSITIVE/PUBLIC)
    • Confidence score
    • Reasoning and citations
    • Blockchain audit hash

HITL Review Process

  1. Navigate to HITL Queue (http://localhost:5000/hitl/queue)
  2. Click "Review Document" on any pending classification
  3. Verify or correct the classification
  4. Add reviewer notes
  5. Submit review

Important: Corrected classifications are automatically added to the RAG knowledge base as new few-shot examples, improving future accuracy.

πŸ“Š Classification Categories

1. UNSAFE (Priority 1)

  • Harmful, violent, or threatening content
  • Illegal activity instructions
  • Malware or security exploits
  • Action: Immediate rejection and escalation

2. CONFIDENTIAL (Priority 2)

  • Trade secrets and proprietary algorithms
  • Financial records (with SSN, credit cards)
  • Legal documents (attorney-client privilege)
  • M&A plans, executive compensation
  • Source code and IP
  • PII: SSN, credit cards, bank accounts, medical records, passports

3. SENSITIVE (Priority 3)

  • Internal memos and communications
  • Employee directories
  • Draft documents
  • Internal project plans
  • Non-executive budgets
  • PII: Emails, phone numbers, addresses, employee IDs

4. PUBLIC (Priority 4)

  • Published marketing materials
  • Public website content
  • Press releases
  • Open-source code
  • Public documentation

πŸ”§ API Endpoints

Upload and Classify

POST /upload
Content-Type: multipart/form-data

Response: {
  "document_id": "DOC_abc123...",
  "classification": "CONFIDENTIAL",
  "confidence": 0.95,
  "reasoning": "...",
  "citation": "...",
  "blockchain": {...},
  "audio_available": true
}

Get Statistics

GET /api/statistics

Response: {
  "total_classifications": 42,
  "auto_approval_rate": 85.5,
  "avg_processing_time": 8.3,
  "by_category": {...}
}

Get All Classifications

GET /api/classifications?limit=100&offset=0

Get Specific Classification

GET /api/classification/<document_id>

Submit HITL Review

POST /hitl/submit

## πŸ§ͺ Testing

### Test with Sample Documents

Create test PDFs with different content types:

**Confidential Example:**

CONFIDENTIAL - Board Meeting Minutes Acquisition Target: TechCorp Offer: $500M Employee Data: John Smith - SSN: 123-45-6789 Credit Card: 4532-1234-5678-9010


**Public Example:**

FOR IMMEDIATE RELEASE Product Launch Announcement Contact: press@company.com


### Verify System Components

1. **RAG Policy Upload**: Check console for "Policy uploaded successfully"
2. **Classification**: Verify JSON output with category, confidence, reasoning
3. **Blockchain**: Check for transaction hash (may be simulated if devnet is down)
4. **Database**: SQLite file at `data/audit_logs.db`

## πŸ“ Project Structure

gemini-classifier/ β”œβ”€β”€ main.py # Main entry point β”œβ”€β”€ requirements.txt # Python dependencies β”œβ”€β”€ .env # Environment variables (API keys) β”œβ”€β”€ README.md # This file β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ config.py # Configuration management β”‚ β”œβ”€β”€ audit_logger.py # SQLite audit logging β”‚ β”œβ”€β”€ processing/ β”‚ β”‚ β”œβ”€β”€ init.py β”‚ β”‚ └── document_processor.py # PDF/OCR processing β”‚ β”œβ”€β”€ classification/ β”‚ β”‚ β”œβ”€β”€ init.py β”‚ β”‚ β”œβ”€β”€ policy_rag.py # RAG knowledge base β”‚ β”‚ └── classifier.py # Core AI classifier β”‚ β”œβ”€β”€ blockchain/ β”‚ β”‚ β”œβ”€β”€ init.py β”‚ β”‚ └── solana_audit.py # Solana integration β”‚ └── ui/ β”‚ β”œβ”€β”€ init.py β”‚ └── app.py # Flask web application β”œβ”€β”€ policies/ β”‚ β”œβ”€β”€ categories.json # Category definitions β”‚ β”œβ”€β”€ pii_patterns.json # PII detection patterns β”‚ └── few_shot_examples.json # SME-validated examples β”œβ”€β”€ templates/ β”‚ β”œβ”€β”€ base.html β”‚ β”œβ”€β”€ index.html # Upload page β”‚ β”œβ”€β”€ dashboard.html # Statistics dashboard β”‚ β”œβ”€β”€ hitl_queue.html # Review queue β”‚ └── hitl_review.html # Review detail page └── data/ β”œβ”€β”€ uploads/ # Uploaded PDFs β”œβ”€β”€ cache/ # Cached content & audio β”œβ”€β”€ audit_logs/ # Log files └── audit_logs.db # SQLite database


## πŸŽ“ Key Technologies

| Component | Technology | Purpose |
|-----------|-----------|---------|
| AI Model | Gemini 2.0 Flash | Fast, high-quality classification |
| RAG | Gemini File Search | Policy knowledge grounding |
| CAG | Gemini Caching API | Document context optimization |
| Blockchain | Solana (Devnet) | Immutable audit trails |
| Database | SQLite | Local audit logging |
| Web Framework | Flask | REST API & web UI |
| OCR | Tesseract + PyMuPDF | Multi-modal document processing |

## πŸ” Security Considerations

1. **API Keys**: Never commit API keys to version control. Use environment variables.
2. **PII Detection**: High-risk PII triggers CONFIDENTIAL classification.
3. **Audit Trail**: All decisions are logged to SQLite and Solana blockchain.
4. **HITL Review**: Human oversight for low-confidence or mismatched validations.
5. **Safety Checks**: UNSAFE content is detected first and rejected immediately.

## πŸ“ˆ Performance Metrics

- **Processing Speed**: ~5-15 seconds per document (depends on page count)
- **Auto-Approval Rate**: Target 85%+ with dual validation
- **Confidence Threshold**: 90% for auto-approval

## πŸ› Troubleshooting

### "Tesseract not found"
Install Tesseract OCR (see Installation section)

### "File processing failed"
Check that the PDF is not corrupted or password-protected

### "Blockchain recording error"
The system will create a simulated transaction hash if Solana devnet is unavailable. This is normal for demo purposes.

### Gemini API errors
- Check API key validity
- Verify quota/billing is enabled
- Ensure Gemini 2.0 Flash access is enabled

## 🀝 HITL Feedback Loop

The system implements a continuous improvement cycle:

1. Document is classified by AI
2. If confidence < 90% or dual validation mismatch β†’ HITL queue
3. SME reviews and corrects classification
4. Correction is added to `policies/few_shot_examples.json`
5. Policy RAG is updated automatically
6. Future similar documents benefit from the correction

## πŸ“ License

This is a demonstration project for educational purposes.

## πŸ™ Acknowledgments

- **Google Gemini**: Advanced AI classification engine
- **Solana**: Blockchain infrastructure for audit trails
- **Tesseract OCR**: Open-source OCR engine

## πŸ“ž Support

For issues or questions about this implementation, please review:
1. This README
2. The code comments (extensively documented)
3. The policy JSON files in `policies/` directory

---

**Built with ❀️ using Gemini 2.0 Flash and Solana**

About

AI-powered assistant that dynamically analyzes multi-page, multi-modal documents to classify them into Public, Confidential, Highly Sensitive, or Unsafe categories.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors