Skip to content

jeevanms003/data-doctor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataDoctor 🩺

Crates.io Downloads License: MIT CI

DataDoctor is a powerful Rust-based data validation and cleaning engine designed to detect, diagnose, and fix data quality issues in your datasets.

📋 Overview

DataDoctor helps you maintain high-quality data by:

  • Validating data against defined rules and schemas
  • Detecting common data quality issues (missing values, invalid formats, duplicates)
  • Cleaning and fixing data problems automatically
  • Analyzing data quality metrics and generating reports

✨ Features

JSON Auto-Fix Capabilities (10 features)

  • ✅ Remove trailing commas
  • ✅ Insert missing commas between fields
  • ✅ Convert single quotes to double quotes
  • ✅ Quote unquoted keys
  • ✅ Fix unclosed quotes
  • ✅ Normalize booleans (True → true, False → false)
  • ✅ Normalize NULL values (NULL → null)
  • ✅ Fix unclosed braces/brackets
  • ✅ Detect duplicate keys
  • ✅ Pretty format JSON

CSV Auto-Fix Capabilities (12 features)

  • ✅ Pad missing columns with empty values
  • ✅ Trim extra columns
  • ✅ Normalize whitespace
  • ✅ Fix simple type mismatches
  • ✅ Convert boolean variants (yes/no, 1/0 → true/false)
  • ✅ Remove BOM headers
  • ✅ Auto-detect delimiters
  • ✅ Detect duplicate headers
  • ✅ Fix trailing delimiters
  • ✅ Normalize field values
  • ✅ Type validation (Integer, Float, Boolean, Email, URL)
  • ✅ Required field validation

Validation & Reporting

  • 📊 Detailed error tracking with line/column numbers
  • 🏷️ Readable error codes (e.g., JSON_TRAILING_COMMA, CSV_MISMATCHED_COLUMNS)
  • 📈 Statistics on total issues, auto-fixes, and severity levels
  • 🎯 Schema-based validation for structured data
  • 💻 Human-friendly colored terminal output
  • 📄 Machine-readable JSON reports

🏗️ Project Structure

data-doctor/
├── core/              # Rust library crate (core engine)
│   ├── src/
│   │   └── lib.rs    # Core data processing logic
│   └── Cargo.toml
├── cli/               # Rust binary crate (CLI tool)
│   ├── src/
│   │   └── main.rs   # Command-line interface
│   └── Cargo.toml
├── examples/          # Example data files and usage demos
│   ├── sample_data.csv
│   └── README.md
├── Cargo.toml         # Workspace configuration
├── LICENSE            # MIT License
├── README.md          # This file
└── .gitignore

🚀 Getting Started

Prerequisites

  • Rust 1.70 or higher
  • Cargo (comes with Rust)

Installation

Option 1: Install from crates.io (Recommended) ⭐

cargo install data-doctor-cli

That's it! DataDoctor is now globally available:

data-doctor --version
data-doctor --help

Option 2: Install from Source

# Clone the repository
git clone https://github.com/jeevanms003/data-doctor.git
cd data-doctor

# Build and install
cargo install --path cli

# The binary will be available as 'data-doctor'
data-doctor --version

Option 2: Build from Source

# Clone the repository
git clone https://github.com/jeevanms003/data-doctor.git
cd data-doctor

# Build in release mode
cargo build --release

# Binary will be at: target/release/data-doctor
./target/release/data-doctor --version

Option 3: Download Pre-built Binaries

Download the latest release for your platform:

  • Linux (x86_64): data-doctor-linux-x86_64.tar.gz
  • macOS (x86_64): data-doctor-macos-x86_64.tar.gz
  • macOS (ARM64): data-doctor-macos-arm64.tar.gz
  • Windows (x86_64): data-doctor-windows-x86_64.zip
# Linux/macOS
tar -xzf data-doctor-*.tar.gz
sudo mv data-doctor /usr/local/bin/

# Windows
# Extract ZIP and add to PATH

Option 4: Install from crates.io (Future)

cargo install data-doctor

Running the CLI

After installation, use the data-doctor command:

# Show version
data-doctor --version

# Show help
data-doctor --help

# Run validation
data-doctor validate examples/broken.json

During development, use cargo run:

cargo run --bin data-doctor -- validate examples/broken.json

📖 Usage

CLI Commands

DataDoctor provides three main commands:

1. Validate - Check data without making changes

# Validate a JSON file
data-doctor validate examples/valid.json

# Validate a CSV file
data-doctor validate examples/broken.csv

# Get machine-readable JSON report
data-doctor validate examples/broken.json --report-json

# Manually specify format
data-doctor validate data.txt --format json

2. Fix - Auto-correct issues and save to new file

# Fix broken JSON
data-doctor fix examples/broken.json --out fixed.json

# Fix broken CSV
data-doctor fix examples/broken.csv --out cleaned.csv

# Fix with format specification
data-doctor fix data.txt --format csv --out output.csv

3. Doctor - Comprehensive analysis (validate + fix + report)

# Full analysis of JSON file
data-doctor doctor examples/broken.json --out repaired.json

# Full analysis with JSON report
data-doctor doctor examples/broken.csv --out fixed.csv --report-json

Running from Source

During development, use cargo run:

# Validate
cargo run --bin data-doctor validate examples/valid.json

# Fix
cargo run --bin data-doctor fix examples/broken.json --out fixed.json

# Doctor
cargo run --bin data-doctor doctor examples/broken.csv --out repaired.csv

# Show help
cargo run --bin data-doctor --help

Example Outputs

Validate Command:

📊 Validation Report
==================================================

Summary:
  File: examples/broken.json
  Status: FAILED ✗

Statistics:
  Total Records:   1
  Valid Records:   0
  Invalid Records: 1
  Total Issues:    4

Issues:
  1. [Error] JSON_PARSE_ERROR JSON parse error: ...
     Location: Row 5, Col 2

Fix Command:

✓ File fixed successfully!
  Input:  examples/broken.json
  Output: fixed.json

📊 Fix Report
==================================================
  Total Issues:    4
  Auto-Fixed:      4

Doctor Command:

🩺 DataDoctor Analysis
==================================================

Step 1: Validating...
  ⚠ 4 issue(s) found

Step 2: Applying auto-fixes...
  ✓ 4 issue(s) auto-fixed

Step 3: Generating report...
  [Detailed report shown]

✓ Analysis complete!
  Output saved to: repaired.json

🎯 Project Goals

  1. High Performance: Leverage Rust's speed and memory safety for processing large datasets efficiently
  2. Extensibility: Provide a modular architecture for adding custom validation rules and cleaning strategies
  3. Developer-Friendly: Offer both a library (core) and CLI (cli) for different use cases
  4. Production-Ready: Build a robust tool suitable for real-world data pipelines
  5. Type Safety: Utilize Rust's type system to prevent common data processing errors

🧪 Testing

Run all tests:

cargo test

Run tests for a specific crate:

cargo test -p data-doctor-core
cargo test -p data-doctor-cli

📦 Crates

data-doctor-core

The core library providing data validation and cleaning functionality. Can be used as a dependency in other Rust projects.

data-doctor-cli

Command-line interface for DataDoctor. Provides an easy-to-use CLI for common data quality tasks.

🤝 Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔮 Roadmap

  • Implement CSV parsing and validation
  • Add support for JSON formats
  • Develop rule-based validation engine
  • Create data cleaning strategies (missing values, type mismatches)
  • Build data quality reporting with error codes
  • Implement auto-fix capabilities for JSON and CSV
  • Add support for XML formats
  • Add configuration file support
  • Implement parallel processing for large datasets
  • Create web API for remote data validation
  • Add support for Parquet and other binary formats
  • Implement machine learning-based anomaly detection

📞 Support

For questions and support, please open an issue on GitHub.


Made with ❤️ and Rust

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages