DataDoctor is a powerful Rust-based data validation and cleaning engine designed to detect, diagnose, and fix data quality issues in your datasets.
DataDoctor helps you maintain high-quality data by:
- Validating data against defined rules and schemas
- Detecting common data quality issues (missing values, invalid formats, duplicates)
- Cleaning and fixing data problems automatically
- Analyzing data quality metrics and generating reports
- ✅ Remove trailing commas
- ✅ Insert missing commas between fields
- ✅ Convert single quotes to double quotes
- ✅ Quote unquoted keys
- ✅ Fix unclosed quotes
- ✅ Normalize booleans (True → true, False → false)
- ✅ Normalize NULL values (NULL → null)
- ✅ Fix unclosed braces/brackets
- ✅ Detect duplicate keys
- ✅ Pretty format JSON
- ✅ Pad missing columns with empty values
- ✅ Trim extra columns
- ✅ Normalize whitespace
- ✅ Fix simple type mismatches
- ✅ Convert boolean variants (yes/no, 1/0 → true/false)
- ✅ Remove BOM headers
- ✅ Auto-detect delimiters
- ✅ Detect duplicate headers
- ✅ Fix trailing delimiters
- ✅ Normalize field values
- ✅ Type validation (Integer, Float, Boolean, Email, URL)
- ✅ Required field validation
- 📊 Detailed error tracking with line/column numbers
- 🏷️ Readable error codes (e.g.,
JSON_TRAILING_COMMA,CSV_MISMATCHED_COLUMNS) - 📈 Statistics on total issues, auto-fixes, and severity levels
- 🎯 Schema-based validation for structured data
- 💻 Human-friendly colored terminal output
- 📄 Machine-readable JSON reports
data-doctor/
├── core/ # Rust library crate (core engine)
│ ├── src/
│ │ └── lib.rs # Core data processing logic
│ └── Cargo.toml
├── cli/ # Rust binary crate (CLI tool)
│ ├── src/
│ │ └── main.rs # Command-line interface
│ └── Cargo.toml
├── examples/ # Example data files and usage demos
│ ├── sample_data.csv
│ └── README.md
├── Cargo.toml # Workspace configuration
├── LICENSE # MIT License
├── README.md # This file
└── .gitignore
- Rust 1.70 or higher
- Cargo (comes with Rust)
cargo install data-doctor-cliThat's it! DataDoctor is now globally available:
data-doctor --version
data-doctor --help# Clone the repository
git clone https://github.com/jeevanms003/data-doctor.git
cd data-doctor
# Build and install
cargo install --path cli
# The binary will be available as 'data-doctor'
data-doctor --version# Clone the repository
git clone https://github.com/jeevanms003/data-doctor.git
cd data-doctor
# Build in release mode
cargo build --release
# Binary will be at: target/release/data-doctor
./target/release/data-doctor --versionDownload the latest release for your platform:
- Linux (x86_64):
data-doctor-linux-x86_64.tar.gz - macOS (x86_64):
data-doctor-macos-x86_64.tar.gz - macOS (ARM64):
data-doctor-macos-arm64.tar.gz - Windows (x86_64):
data-doctor-windows-x86_64.zip
# Linux/macOS
tar -xzf data-doctor-*.tar.gz
sudo mv data-doctor /usr/local/bin/
# Windows
# Extract ZIP and add to PATHcargo install data-doctorAfter installation, use the data-doctor command:
# Show version
data-doctor --version
# Show help
data-doctor --help
# Run validation
data-doctor validate examples/broken.jsonDuring development, use cargo run:
cargo run --bin data-doctor -- validate examples/broken.jsonDataDoctor provides three main commands:
# Validate a JSON file
data-doctor validate examples/valid.json
# Validate a CSV file
data-doctor validate examples/broken.csv
# Get machine-readable JSON report
data-doctor validate examples/broken.json --report-json
# Manually specify format
data-doctor validate data.txt --format json# Fix broken JSON
data-doctor fix examples/broken.json --out fixed.json
# Fix broken CSV
data-doctor fix examples/broken.csv --out cleaned.csv
# Fix with format specification
data-doctor fix data.txt --format csv --out output.csv# Full analysis of JSON file
data-doctor doctor examples/broken.json --out repaired.json
# Full analysis with JSON report
data-doctor doctor examples/broken.csv --out fixed.csv --report-jsonDuring development, use cargo run:
# Validate
cargo run --bin data-doctor validate examples/valid.json
# Fix
cargo run --bin data-doctor fix examples/broken.json --out fixed.json
# Doctor
cargo run --bin data-doctor doctor examples/broken.csv --out repaired.csv
# Show help
cargo run --bin data-doctor --helpValidate Command:
📊 Validation Report
==================================================
Summary:
File: examples/broken.json
Status: FAILED ✗
Statistics:
Total Records: 1
Valid Records: 0
Invalid Records: 1
Total Issues: 4
Issues:
1. [Error] JSON_PARSE_ERROR JSON parse error: ...
Location: Row 5, Col 2
Fix Command:
✓ File fixed successfully!
Input: examples/broken.json
Output: fixed.json
📊 Fix Report
==================================================
Total Issues: 4
Auto-Fixed: 4
Doctor Command:
🩺 DataDoctor Analysis
==================================================
Step 1: Validating...
⚠ 4 issue(s) found
Step 2: Applying auto-fixes...
✓ 4 issue(s) auto-fixed
Step 3: Generating report...
[Detailed report shown]
✓ Analysis complete!
Output saved to: repaired.json
- High Performance: Leverage Rust's speed and memory safety for processing large datasets efficiently
- Extensibility: Provide a modular architecture for adding custom validation rules and cleaning strategies
- Developer-Friendly: Offer both a library (
core) and CLI (cli) for different use cases - Production-Ready: Build a robust tool suitable for real-world data pipelines
- Type Safety: Utilize Rust's type system to prevent common data processing errors
Run all tests:
cargo testRun tests for a specific crate:
cargo test -p data-doctor-core
cargo test -p data-doctor-cliThe core library providing data validation and cleaning functionality. Can be used as a dependency in other Rust projects.
Command-line interface for DataDoctor. Provides an easy-to-use CLI for common data quality tasks.
Contributions are welcome! Please feel free to submit issues and pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
- Implement CSV parsing and validation
- Add support for JSON formats
- Develop rule-based validation engine
- Create data cleaning strategies (missing values, type mismatches)
- Build data quality reporting with error codes
- Implement auto-fix capabilities for JSON and CSV
- Add support for XML formats
- Add configuration file support
- Implement parallel processing for large datasets
- Create web API for remote data validation
- Add support for Parquet and other binary formats
- Implement machine learning-based anomaly detection
For questions and support, please open an issue on GitHub.
Made with ❤️ and Rust