Artificial Data Generation

Synthetic VIP customer data generation package with Bayesian analysis, imputation, visualization, and experiment tracking.

What the package provides

This package focuses on reusable Python functionality for:

generating synthetic VIP customer data
bootstrapping seed and checkpoint datasets
imputing missing values
comparing real vs. synthetic data with Bayesian fidelity analysis
producing comparison plots and correlation heatmaps
persisting metrics, tables, and plots to structured experiment outputs with MLflow

Primary business-facing columns used across the package:

avg_transaction
frequency
credit_score
vip_status

Package modules

`src/generators.py`

Synthetic data generators:

RandomGenerator — independent statistical baseline
BayesianNetworkGenerator — posterior-inspired multivariate generator conditioned on seed data
GANGenerator — neural-network GAN generator with a structured fallback when PyTorch is unavailable

Use:

from src.generators import get_generator

gen = get_generator("gan")
df = gen.generate(1000, random_state=42)

`src/workflows.py`

Reusable workflow functions:

create_vip_seed_data(...)
create_corrupted_vip_data(...)
impute_vip_data(...)
compare_numeric_columns(...)
run_generation_workflow(...)
analyze_and_log_experiment(...)

`src/analysis.py`

Analysis utilities:

calculate_kl_divergence(...)
recover_parameters(...)
bayesian_fidelity_report(...)

`src/processing.py`

Lower-level preprocessing helpers:

introduce_nans(...)
bayesian_impute(...)

`src/viz.py`

Plotting helpers:

plot_comparison(...)
plot_correlation_heatmap(...)

`src/cli.py`

Command-line entry points:

generate
bootstrap
impute
analyze
run-notebook
export-notebooks
test

Installation

Base package

pip install -e .

Notebook tooling

pip install -e ".[notebooks]"

GAN support

pip install -e ".[gan]"

CLI examples

Bootstrap seed and checkpoint data

python -m src.cli bootstrap \
  --samples 1000 \
  --raw-output data/raw/vip_seed.csv \
  --checkpoint-output data/checkpoints/vip_seed.csv \
  --random-state 42

Generate synthetic data

python -m src.cli generate \
  --method specialized \
  --samples 1000 \
  --seed-data data/raw/vip_seed.csv \
  --output data/synthetic/spec_out.csv

Generate neural GAN data

python -m src.cli generate \
  --method gan \
  --samples 1000 \
  --seed-data data/raw/vip_seed.csv \
  --epochs 200 \
  --output data/synthetic/gan_out.csv

Impute missing values

python -m src.cli impute \
  --input data/checkpoints/vip_seed.csv \
  --output data/synthetic/imputed.csv

Analyze and log experiments with MLflow

python -m src.cli analyze \
  --real data/raw/vip_seed.csv \
  --synth data/synthetic/gan_out.csv \
  --column avg_transaction \
  --plot-output data/synthetic/comparison.png \
  --experiment-dir experiments/latest \
  --experiment-name artificial-data-generation

This analysis command:

computes KL divergence
runs PyMC-based posterior recovery
writes plots and tables into a structured experiment folder
logs metrics and artifacts to MLflow

Experiment outputs

Structured outputs are written under the selected experiment directory, for example:

experiments/latest/
├── plots/
│   ├── avg_transaction_comparison.png
│   ├── frequency_comparison.png
│   ├── credit_score_comparison.png
│   ├── real_correlation.png
│   └── synthetic_correlation.png
└── tables/
    ├── comparison_metrics.csv
    └── bayesian_fidelity.csv

MLflow is used to persist:

metrics
run parameters
comparison tables
generated plot artifacts

Current implementation notes

Bayesian generation is implemented as a posterior-inspired multivariate sampler conditioned on optional seed data.
Bayesian analysis uses PyMC for posterior recovery of selected columns.
GAN generation uses a lightweight neural network implemented with PyTorch when available.
If PyTorch is not installed, the GAN generator falls back to a structured simulator so the package remains usable.
MLflow is a required dependency for experiment tracking in the current package implementation.

Development

Run tests:

pytest

Run CLI help:

python -m src.cli --help

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
data		data
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Artificial Data Generation

What the package provides

Package modules

`src/generators.py`

`src/workflows.py`

`src/analysis.py`

`src/processing.py`

`src/viz.py`

`src/cli.py`

Installation

Base package

Notebook tooling

GAN support

CLI examples

Bootstrap seed and checkpoint data

Generate synthetic data

Generate neural GAN data

Impute missing values

Analyze and log experiments with MLflow

Experiment outputs

Current implementation notes

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Artificial Data Generation

What the package provides

Package modules

src/generators.py

src/workflows.py

src/analysis.py

src/processing.py

src/viz.py

src/cli.py

Installation

Base package

Notebook tooling

GAN support

CLI examples

Bootstrap seed and checkpoint data

Generate synthetic data

Generate neural GAN data

Impute missing values

Analyze and log experiments with MLflow

Experiment outputs

Current implementation notes

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`src/generators.py`

`src/workflows.py`

`src/analysis.py`

`src/processing.py`

`src/viz.py`

`src/cli.py`

Packages