Profile any dataset in seconds — powered by a C++ parallel engine.
CSV • Parquet • Arrow | 1TB files | One line of code
Every Data Scientist's first step is understanding the data. But existing tools force a painful tradeoff:
| Tool | 500MB CSV | 5GB Parquet | RAM Usage |
|---|---|---|---|
| Pandas Profiling | 12 min | ❌ Crash | 8 GB+ |
| ydata-profiling | 8 min | ❌ Crash | 6 GB+ |
| ZEDDA | 3 sec | 5 sec | < 200 MB |
ZEDDA achieves this through a multi-threaded C++ core that processes data in parallel, combined with intelligent sampling that gives you statistically accurate results without reading every single row.
pip install zeddaimport zedda as zd
zd.profile("transactions.csv")Output:
⚡ zedda v0.2.0
Scanning transactions.csv...
┌─────────── Dataset Overview ⚡ SAMPLED ───────────┐
│ File: transactions.csv │
│ ⚠ SAMPLED MODE (stratified, exact nulls & range)│
│ Rows: 6,362,620 │
│ Cols: 11 (8 numeric, 3 string) │
│ Nulls: 0.0% (0 cells) │
│ Scanned: 4,231 ms │
└────────────────────────────────────────────────────┘
Column Type Nulls Mean (±95% CI) Min Max Flags
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
step int 0.0% 192.6 ± 0.24 1 353 ok
amount float 0.0% 1.793e+05 ± 701 0 1.55e+07 HIGH CARD
oldbalanceOrg float 0.0% 8.553e+05 ± 5,714 0 3.894e+07 ok
isFraud int 0.0% 0.000659 ± 5.03e-05 0 1 ok
...
ℹ Means show 95% confidence interval. Null counts and min/max are exact (from Parquet footer).
ZEDDA's profiling engine is written entirely in C++17 and compiled natively for your platform. It uses BS::thread_pool to parse data across all CPU cores simultaneously — achieving 5–8x speedup over single-threaded Python.
Python (Pandas): 1 core → 12 seconds for 500MB
ZEDDA (C++): 8 cores → 1.5 seconds for 500MB
Files over 500 MB automatically trigger stratified sampling — ZEDDA reads 1 million representative rows instead of the entire file. This is configurable:
# Auto (default) — ZEDDA decides based on file size
zd.profile("huge_file.csv")
# Force exact scan — no sampling, read every row
zd.profile("huge_file.csv", sample_size=-1)
# Custom sample — e.g. 5 million rows
zd.profile("huge_file.csv", sample_size=5_000_000)Why is this safe?
- Statistics guarantees it: 1M rows is a massive sample — error margin is typically < 0.1%.
- 95% Confidence Intervals: Every mean is shown as
Mean ± CIso you can see exactly how precise the estimate is. - Parquet Footer Cheat Code: Min, Max, and Null counts are always exact — read directly from Parquet metadata in milliseconds, even for TB-scale files.
ZEDDA automatically detects data quality issues and flags them:
| Flag | Meaning | When |
|---|---|---|
HIGH NULL |
Column has too many missing values | Null% > 20% |
CONST |
Column has only one unique value | Useless for ML |
HIGH CARD |
Column has very high cardinality | May need encoding |
Compare two datasets (e.g., train vs test, v1 vs v2) and detect schema changes, null rate shifts, and distribution drift:
zd.compare("train.csv", "test.csv")⚡ zedda compare
A: train.csv (800,000 rows)
B: test.csv (200,000 rows)
Column Type A Type B Nulls A Nulls B Mean A Mean B Drift
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
age int int 0.0% 0.0% 29.7 29.4 ok
fare float float 0.0% 2.1% 32.2 35.8 SHIFT
cabin str MISSING 77.1% — — — REMOVED
embarked str str 0.2% 0.0% — — ok
DRIFT: Mean shifted significantly (z-score > 1.0) — model retraining may be needed.SHIFT: Moderate change detected (z-score > 0.3).NEW/REMOVED: Column added or dropped between datasets.
No Python script needed. Profile any file directly from the command line:
# Profile a file
zedda run data.csv
# Compare two files
zedda compare train.csv test.csv
# Quick file info (rows, size)
zedda info data.csv
# AI-powered insights (requires OPENAI_API_KEY)
zedda run data.csv --aiScan a file, print a beautiful terminal report, and return the result.
result = zd.profile("data.csv")
# Prints colored table to terminal
# Returns DatasetProfile objectScan a file and return the result without printing.
p = zd.scan("data.parquet")
# Access dataset-level stats
print(p.num_rows) # 6362620
print(p.num_cols) # 11
print(p.scan_time_ms) # 4231.5
print(p.is_sampled) # True
# Access column-level stats
for col in p.columns:
print(col.name) # "amount"
print(col.type_str) # "float"
print(col.mean) # 179329.4
print(col.stddev) # 603858.2
print(col.val_min) # 0.0
print(col.val_max) # 15500000.0
print(col.null_pct) # 0.0
print(col.unique_approx) # 978372Compare two datasets side by side with drift detection.
zd.compare("january_sales.csv", "february_sales.csv")| Parameter | Type | Default | Description |
|---|---|---|---|
path |
str |
required | Path to CSV, Parquet, or Arrow file |
sample_size |
int |
None |
Max rows to sample. None = auto, -1 = read all |
| Format | Extension | Zero-Copy |
|---|---|---|
| CSV | .csv |
— |
| Parquet | .parquet |
✅ via Arrow C Data Interface |
| Arrow IPC | .arrow |
✅ via Arrow C Data Interface |
┌──────────────────────────────────────────────────────────┐
│ Python API Layer │
│ zd.profile() / zd.scan() / zd.compare() │
└────────────────────────┬─────────────────────────────────┘
│
┌──────────▼──────────┐
│ Auto-Sampling │
│ Decision Engine │
│ (>500MB = sample) │
└──────────┬──────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌───────────┐ ┌──────────┐
│ CSV Path │ │ Parquet │ │ Arrow │
│ │ │ Path │ │ Path │
└────┬─────┘ └─────┬─────┘ └────┬─────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌───────────┐ ┌──────────┐
│ C++ Multi│ │ PyArrow │ │ PyArrow │
│ Threaded │ │ Stratified│ │ Batched │
│ Chunked │ │ Row Group │ │ Reader │
│ Parser │ │ Sampling │ │ │
└────┬─────┘ └─────┬─────┘ └────┬─────┘
│ │ │
└──────────────┼──────────────┘
▼
┌───────────────────┐
│ C++ Profile │
│ Builder Engine │
│ (BS::thread_pool)│
│ ──────────────── │
│ • Welford Online │
│ Mean/Variance │
│ • HyperLogLog │
│ Unique Approx │
│ • Streaming │
│ Min/Max/Nulls │
└─────────┬─────────┘
▼
┌───────────────────┐
│ DatasetProfile │
│ Result Object │
└─────────┬─────────┘
▼
┌───────────────────┐
│ Rich Terminal │
│ Pretty Printer │
│ (colored tables) │
└───────────────────┘
| Component | Algorithm | Why |
|---|---|---|
| Mean & Variance | Welford's Online Algorithm | Numerically stable, single-pass |
| Unique Count | HyperLogLog (approx) | O(1) memory, works on billions of values |
| Thread Pool | BS::thread_pool | Zero-overhead, lock-free task scheduling |
| Parquet I/O | Arrow C Data Interface | True zero-copy — no serialization |
| Sampling | Stratified Row Groups | Covers start, middle, and end of file |
zedda/
├── src/core/ # C++ engine
│ ├── profile_builder.cpp # Multi-threaded profiling logic
│ ├── arrow_profiler.cpp # Arrow C Data Interface consumer
│ └── stream_reader.cpp # CSV chunked reader
├── include/zedda/ # C++ headers
│ ├── profile_builder.hpp
│ ├── profile_result.hpp # DatasetProfile struct
│ ├── stream_reader.hpp
│ └── BS_thread_pool.hpp # Thread pool (MIT, header-only)
├── python/zedda/ # Python package
│ ├── __init__.py # Public API (profile, scan, compare)
│ └── cli.py # Typer CLI app
├── tests/ # Test suites
├── CMakeLists.txt # Build configuration
└── pyproject.toml # Package metadata
# Clone with submodules
git clone --recursive https://github.com/prince3235/fasteda.git
cd fasteda
# Install in editable mode
pip install -e . --no-build-isolation
# Run tests
python -X utf8 tests/test_phase3.py- Python ≥ 3.9
- C++ Compiler with C++17 support (MSVC 19+, GCC 9+, Clang 10+)
- CMake ≥ 3.21
- Phase 1 — Multi-threaded CSV parsing (5–8x speedup)
- Phase 2 — Zero-copy Parquet via Arrow C Data Interface
- Phase 3 — Intelligent Sampling Engine (1TB support)
- Phase 4 — SIMD/AVX-512 vectorized numeric parsing
- Phase 5 — Interactive HTML reports & dashboards
- Phase 6 — AI-powered data insights (GPT integration)
We welcome contributions! Here's how:
- Fork the repo
- Create your feature branch (
git checkout -b feat/amazing-feature) - Commit your changes (
git commit -m 'feat: add amazing feature') - Push to the branch (
git push origin feat/amazing-feature) - Open a Pull Request
MIT License — see LICENSE for details.
Built with ❤️ and C++ by Prince Patel
