DataAtelier

A human-in-the-loop data cleansing tool for Azure Blob Storage that leverages LLMs for scale.

Overview

DataAtelier is a lightweight, terminal-based (TUI) application for cleaning up Azure Blob Storage with intelligent LLM-assisted triage and human oversight. It helps you:

📋 Inventory and preview blobs in Azure Storage
🤖 Automatically classify files using LLM (keep/delete/review)
👤 Review and label files with an interactive TUI
🔒 Safely delete files with confirmation and audit trails
📚 Learn from your decisions with few-shot examples

Quick Start

# Install
pip install -r requirements.txt

# Set up credentials
export AZURE_STORAGE_CONNECTION_STRING="..."
export OPENAI_API_KEY="sk-..."

# Run the tool
python azure_blob_cleanup_tui.py run --container mycontainer

👉 See examples/ for detailed usage examples
👉 See tests/ for integration tests

Features

Single TUI Application: Terminal-based interface drives the entire workflow
Conservative LLM Triage: Auto-accepts only when completely certain (95%+ confidence)
Human-in-the-Loop: Final decisions always under human control
Local State Management: CSV/JSON/Parquet files for auditability
Policy & Examples: Evolving policy file and few-shot examples improve LLM accuracy
Safe Deletion: 100% labeling required, confirmation gate, full audit log

Installation

Prerequisites

Python 3.9 or higher
Azure Storage account with container access
(Optional) OpenAI API key or Azure OpenAI access for LLM features

Install

# Clone the repository
git clone https://github.com/jwgwalton/DataAtelier.git
cd DataAtelier

# Install dependencies
pip install -r requirements.txt

# Or install with pip (editable mode)
pip install -e .

Configuration

Set up your environment variables:

# Azure Storage (required)
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=..."
# OR
export AZURE_STORAGE_ACCOUNT_URL="https://youraccount.blob.core.windows.net"
export AZURE_STORAGE_SAS_TOKEN="?sv=..."  # Optional

# Azure Blob Container (required)
export AZURE_BLOB_CONTAINER="mycontainer"

# Optional: prefix filter
export AZURE_BLOB_PREFIX="logs/"

# OpenAI (for LLM features)
export OPENAI_API_KEY="sk-..."
export OPENAI_MODEL="gpt-4"  # or gpt-3.5-turbo

# OR Azure OpenAI
# export AZURE_OPENAI_API_KEY="..."
# export AZURE_OPENAI_ENDPOINT="https://..."
# export AZURE_OPENAI_API_VERSION="2023-12-01-preview"

# Optional: tuning
export CERTAINTY_CONF_THRESHOLD="0.95"
export SELF_CONSISTENCY_PASSES="3"
export PREVIEW_MAX_BYTES="4096"
export WORK_DIR="."

Or create a .env file with these variables.

Usage

Quick Start

# Run the TUI application
python azure_blob_cleanup_tui.py --container mycontainer

# Or if installed
blob-cleanup run --container mycontainer

CLI Options

# Full options
blob-cleanup run \
  --container mycontainer \
  --prefix "logs/2023/" \
  --max-files 100 \
  --preview-bytes 4096 \
  --work-dir ./cleanup-session \
  --llm-model gpt-4 \
  --confidence 0.95 \
  --self-consistency 3

# Skip inventory (use existing manifest)
blob-cleanup run --container mycontainer --skip-inventory

# Disable LLM (manual mode only)
blob-cleanup run --container mycontainer --llm-off

# Use Azure OpenAI
blob-cleanup run --container mycontainer --azure-openai

Other Commands

# Show statistics
blob-cleanup stats

# Archive artifacts
blob-cleanup archive

TUI Key Bindings

Key	Action
`j/k` or `↓/↑`	Navigate items
`Enter`	View item details
`K`	Mark as Keep
`D`	Mark as Delete
`H` or `Space`	Mark for Human Review
`R`	Refresh queues
`/`	Search/filter
`F`	Toggle filters
`B`	Batch action mode
`A`	Add as few-shot example
`E`	Edit policy (external editor)
`G`	Run LLM triage now
`T`	Toggle LLM worker on/off
`C`	View coverage stats
`X`	Execute deletion (when 100% labeled)
`?`	Show help
`Q`	Quit

Workflow

Inventory: App lists blobs and generates previews
LLM Triage: Background worker classifies files (keep/delete/review)
Human Review: Review and label items in TUI
Learn: Add examples and edit policy to improve LLM
Delete: When 100% labeled, execute deletion with confirmation

Architecture

azure_blob_cleanup_tui.py  - Main entry point
lib/
  ├── storage_io.py         - Azure Blob Storage operations
  ├── state.py              - State management (CSV/Parquet)
  ├── policy.py             - Policy file management
  ├── examples.py           - Few-shot examples (JSONL)
  ├── triage_llm.py         - LLM classification logic
  └── tui_app.py            - Textual TUI application

Runtime artifacts:
  ├── manifest.csv          - Blob inventory
  ├── to_review.csv         - Items pending review
  ├── to_delete.csv         - Items marked for deletion
  ├── to_keep.csv           - Items marked to keep
  ├── audit_log.csv         - Full audit trail
  ├── llm_predictions.parquet - LLM predictions cache
  ├── few_shot_examples.jsonl - Training examples
  └── llm_policy.md         - Classification policy

LLM Certainty Gates

The LLM auto-accepts keep/delete decisions only when ALL of these are true:

Confidence ≥ 0.95 (configurable)
Self-consistency: Multiple passes agree on the same label
Policy alignment: No violation of never-delete clauses

Otherwise, items go to the human review queue.

Safety Features

100% Labeling Required: Cannot delete until every blob is labeled
Confirmation Gate: Must type "DELETE {count}" to proceed
Audit Log: Every action logged with timestamp and actor
Archive: All artifacts can be archived before deletion
Dry Run: Review deletion list before executing

Example Session

# Start the application
$ blob-cleanup run --container mydata --prefix "temp/"

📋 Starting inventory for container: mydata
   Prefix filter: temp/
⏳ Listing blobs...
✓ Found 247 blobs

⏳ Generating previews...
   Progress: 247/247
✓ Manifest updated

🤖 LLM triage started (worker ON)
   Auto-classified: 180 keep, 45 delete, 22 review

👤 Human review: 22 items
   [K] Keep | [D] Delete | [H] Review
   [A] Add example | [E] Edit policy

✓ Coverage: 247/247 (100%)
   To delete: 45 blobs

🗑️  Execute deletion?
   Type "DELETE 45" to confirm: DELETE 45
   ✓ Deleted 45 blobs

📦 Archive artifacts? [Y/n]: y
   ✓ Archived to: ./archive/20260128_162500/

✓ Session complete

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Format code
black .

# Lint
ruff check .

See tests/README.md for testing documentation

Examples

Check the examples/ directory for:

basic_usage.py - Quick start guide
manual_mode.py - No LLM required
with_policy.py - Custom policy tutorial

Run any example:

python examples/basic_usage.py

Testing

The project includes comprehensive integration tests with mocked Azure Blob Storage and LLM:

# Run all tests (17 tests)
pytest tests/ -v

# Run specific test class
pytest tests/test_integration.py::TestStorageIntegration -v

All tests use mocks - no Azure credentials or API keys required!

See tests/README.md for details

License

MIT

Contributing

Contributions welcome! Please open an issue or PR.

Acknowledgments

Built with:

Textual - Modern TUI framework
Rich - Beautiful terminal output
Azure SDK for Python
OpenAI Python

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
docs		docs
examples		examples
lib		lib
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
IMPLEMENTATION.md		IMPLEMENTATION.md
LICENSE		LICENSE
README.md		README.md
azure_blob_cleanup_tui.py		azure_blob_cleanup_tui.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataAtelier

Overview

Quick Start

Features

Installation

Prerequisites

Install

Configuration

Usage

Quick Start

CLI Options

Other Commands

TUI Key Bindings

Workflow

Architecture

LLM Certainty Gates

Safety Features

Example Session

Development

Examples

Testing

License

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

jwgwalton/DataAtelier

Folders and files

Latest commit

History

Repository files navigation

DataAtelier

Overview

Quick Start

Features

Installation

Prerequisites

Install

Configuration

Usage

Quick Start

CLI Options

Other Commands

TUI Key Bindings

Workflow

Architecture

LLM Certainty Gates

Safety Features

Example Session

Development

Examples

Testing

License

Contributing

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages