Original project site: http://www.datasetgenerator.com/
DatGen is one of the earliest synthetic data generators (1997-1999), created before synthetic data became a mainstream concept. This tool pioneered approaches that modern synthetic data platforms still use today.
- 1997: Initial development for classification algorithm research
- 1998: Published as "Synthetic Classification Data Sets (SCDS) generator"
- 1999: Version 3.1 released (current C implementation)
- 2025: Modernization initiative launched
DatGen was designed to solve critical problems in machine learning research:
- Controlled Experimentation: Generate datasets with known statistical properties
- Reproducible Benchmarks: Create consistent test data for algorithm comparison
- Privacy-Safe Testing: Test algorithms without real-world data concerns
- Scalability Testing: Generate datasets of arbitrary size
These needs remain relevant today, making DatGen's concepts timeless.
# Compile the classic version
cd src
make
# Generate a simple dataset
./datgen -n 1000 -m 10 -c 3 > dataset.csv-n: Number of instances to generate-m: Number of attributes/features-c: Number of classes-s: Random seed for reproducibility
- Core generation engine in Python
- NumPy/Pandas integration
- Maintain backward compatibility with C version
- Add JSON/Parquet output formats
- Scikit-learn compatible API
- Time series data generation
- Text/NLP synthetic data
- Graph/network data structures
- REST API interface
- Docker containerization
- Streaming data generation
- Distributed generation for big data
- GAN-based generation options
- Differential privacy guarantees
- LLM integration for semantic data
from datgen import DataGenerator
# Classic compatible mode
gen = DataGenerator.classic(
n_samples=1000,
n_features=10,
n_classes=3,
random_state=42
)
df = gen.generate()
# Modern fluent API
gen = DataGenerator() \
.with_samples(10000) \
.with_features(20, types=['numeric', 'categorical']) \
.with_classes(5, balance='imbalanced') \
.with_noise(0.1)
# Generate different formats
df = gen.to_pandas()
gen.to_parquet("data.parquet")
gen.to_json("data.json", streaming=True)Modern use cases that DatGen addresses:
- Privacy Compliance: GDPR/CCPA compliant testing
- ML Pipeline Testing: CI/CD for ML systems
- Edge Case Generation: Test rare events
- Fairness Testing: Generate balanced datasets
- Teaching Tool: Learn ML without real data
DatGen influenced early synthetic data research and classification algorithm benchmarking.
@software{melli1997datgen,
author = {Melli, Gabor},
title = {DatGen: A Synthetic Data Generator},
year = {1997},
url = {https://github.com/gmelli/DatGen}
}This project bridges 28 years of synthetic data evolution. Contributions welcome for:
- Python port development
- Modern feature additions
- Documentation improvements
- Use case examples
MIT License - Same as 1997, because good ideas should be free.
"Perfect data for an imperfect world" - Still true after 28 years
Repository Status: 🏛️ Historical Artifact | 🚧 Active Modernization | 📚 Educational Resource