A human-in-the-loop data cleansing tool for Azure Blob Storage that leverages LLMs for scale.
DataAtelier is a lightweight, terminal-based (TUI) application for cleaning up Azure Blob Storage with intelligent LLM-assisted triage and human oversight. It helps you:
- π Inventory and preview blobs in Azure Storage
- π€ Automatically classify files using LLM (keep/delete/review)
- π€ Review and label files with an interactive TUI
- π Safely delete files with confirmation and audit trails
- π Learn from your decisions with few-shot examples
# Install
pip install -r requirements.txt
# Set up credentials
export AZURE_STORAGE_CONNECTION_STRING="..."
export OPENAI_API_KEY="sk-..."
# Run the tool
python azure_blob_cleanup_tui.py run --container mycontainerπ See examples/ for detailed usage examples
π See tests/ for integration tests
- Single TUI Application: Terminal-based interface drives the entire workflow
- Conservative LLM Triage: Auto-accepts only when completely certain (95%+ confidence)
- Human-in-the-Loop: Final decisions always under human control
- Local State Management: CSV/JSON/Parquet files for auditability
- Policy & Examples: Evolving policy file and few-shot examples improve LLM accuracy
- Safe Deletion: 100% labeling required, confirmation gate, full audit log
- Python 3.9 or higher
- Azure Storage account with container access
- (Optional) OpenAI API key or Azure OpenAI access for LLM features
# Clone the repository
git clone https://github.com/jwgwalton/DataAtelier.git
cd DataAtelier
# Install dependencies
pip install -r requirements.txt
# Or install with pip (editable mode)
pip install -e .Set up your environment variables:
# Azure Storage (required)
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=..."
# OR
export AZURE_STORAGE_ACCOUNT_URL="https://youraccount.blob.core.windows.net"
export AZURE_STORAGE_SAS_TOKEN="?sv=..." # Optional
# Azure Blob Container (required)
export AZURE_BLOB_CONTAINER="mycontainer"
# Optional: prefix filter
export AZURE_BLOB_PREFIX="logs/"
# OpenAI (for LLM features)
export OPENAI_API_KEY="sk-..."
export OPENAI_MODEL="gpt-4" # or gpt-3.5-turbo
# OR Azure OpenAI
# export AZURE_OPENAI_API_KEY="..."
# export AZURE_OPENAI_ENDPOINT="https://..."
# export AZURE_OPENAI_API_VERSION="2023-12-01-preview"
# Optional: tuning
export CERTAINTY_CONF_THRESHOLD="0.95"
export SELF_CONSISTENCY_PASSES="3"
export PREVIEW_MAX_BYTES="4096"
export WORK_DIR="."Or create a .env file with these variables.
# Run the TUI application
python azure_blob_cleanup_tui.py --container mycontainer
# Or if installed
blob-cleanup run --container mycontainer# Full options
blob-cleanup run \
--container mycontainer \
--prefix "logs/2023/" \
--max-files 100 \
--preview-bytes 4096 \
--work-dir ./cleanup-session \
--llm-model gpt-4 \
--confidence 0.95 \
--self-consistency 3
# Skip inventory (use existing manifest)
blob-cleanup run --container mycontainer --skip-inventory
# Disable LLM (manual mode only)
blob-cleanup run --container mycontainer --llm-off
# Use Azure OpenAI
blob-cleanup run --container mycontainer --azure-openai# Show statistics
blob-cleanup stats
# Archive artifacts
blob-cleanup archive| Key | Action |
|---|---|
j/k or β/β |
Navigate items |
Enter |
View item details |
K |
Mark as Keep |
D |
Mark as Delete |
H or Space |
Mark for Human Review |
R |
Refresh queues |
/ |
Search/filter |
F |
Toggle filters |
B |
Batch action mode |
A |
Add as few-shot example |
E |
Edit policy (external editor) |
G |
Run LLM triage now |
T |
Toggle LLM worker on/off |
C |
View coverage stats |
X |
Execute deletion (when 100% labeled) |
? |
Show help |
Q |
Quit |
- Inventory: App lists blobs and generates previews
- LLM Triage: Background worker classifies files (keep/delete/review)
- Human Review: Review and label items in TUI
- Learn: Add examples and edit policy to improve LLM
- Delete: When 100% labeled, execute deletion with confirmation
azure_blob_cleanup_tui.py - Main entry point
lib/
βββ storage_io.py - Azure Blob Storage operations
βββ state.py - State management (CSV/Parquet)
βββ policy.py - Policy file management
βββ examples.py - Few-shot examples (JSONL)
βββ triage_llm.py - LLM classification logic
βββ tui_app.py - Textual TUI application
Runtime artifacts:
βββ manifest.csv - Blob inventory
βββ to_review.csv - Items pending review
βββ to_delete.csv - Items marked for deletion
βββ to_keep.csv - Items marked to keep
βββ audit_log.csv - Full audit trail
βββ llm_predictions.parquet - LLM predictions cache
βββ few_shot_examples.jsonl - Training examples
βββ llm_policy.md - Classification policy
The LLM auto-accepts keep/delete decisions only when ALL of these are true:
- Confidence β₯ 0.95 (configurable)
- Self-consistency: Multiple passes agree on the same label
- Policy alignment: No violation of never-delete clauses
Otherwise, items go to the human review queue.
- 100% Labeling Required: Cannot delete until every blob is labeled
- Confirmation Gate: Must type "DELETE {count}" to proceed
- Audit Log: Every action logged with timestamp and actor
- Archive: All artifacts can be archived before deletion
- Dry Run: Review deletion list before executing
# Start the application
$ blob-cleanup run --container mydata --prefix "temp/"
π Starting inventory for container: mydata
Prefix filter: temp/
β³ Listing blobs...
β Found 247 blobs
β³ Generating previews...
Progress: 247/247
β Manifest updated
π€ LLM triage started (worker ON)
Auto-classified: 180 keep, 45 delete, 22 review
π€ Human review: 22 items
[K] Keep | [D] Delete | [H] Review
[A] Add example | [E] Edit policy
β Coverage: 247/247 (100%)
To delete: 45 blobs
ποΈ Execute deletion?
Type "DELETE 45" to confirm: DELETE 45
β Deleted 45 blobs
π¦ Archive artifacts? [Y/n]: y
β Archived to: ./archive/20260128_162500/
β Session complete# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Format code
black .
# Lint
ruff check .See tests/README.md for testing documentation
Check the examples/ directory for:
basic_usage.py- Quick start guidemanual_mode.py- No LLM requiredwith_policy.py- Custom policy tutorial
Run any example:
python examples/basic_usage.pyThe project includes comprehensive integration tests with mocked Azure Blob Storage and LLM:
# Run all tests (17 tests)
pytest tests/ -v
# Run specific test class
pytest tests/test_integration.py::TestStorageIntegration -vAll tests use mocks - no Azure credentials or API keys required!
See tests/README.md for details
MIT
Contributions welcome! Please open an issue or PR.
Built with:
- Textual - Modern TUI framework
- Rich - Beautiful terminal output
- Azure SDK for Python
- OpenAI Python