A minimal Python Protocol for data-generating processes (DGPs).
A Protocol (DataGeneratingProcess) with two members –
data (a frozen property returning the observed realization)
and draw(size=..., *, rng=...) (a method returning a fresh
realization) – plus a small set of composition primitives
(TwoStageDGP, with_data) and thin convenience wrappers
(EmpiricalDGP, ParametricDGP) for working with DGPs as
first-class objects.
The package is not a library of working DGPs. Concrete DGPs
live in consumer packages – e.g.
ManifoldGMM ships its
own moment-side DGPs. The role of DGP_Protocol is to define
the contract that lets such consumers interoperate.
The Protocol promotes the stand-in distribution from
Manski’s analog estimation framework (Manski 1988,
Analog Estimation Methods in Econometrics) to a first-class
Python object. In that framework, an estimator is defined by a
population functional plus a sample-based stand-in for the
population; DataGeneratingProcess is that stand-in. Different
stand-ins yield different analog estimators:
- The empirical distribution -> nonparametric plug-in estimators.
- A parametric family fitted to the data -> MLE-style estimators.
- A bootstrap distribution -> bootstrap inference.
- A null-imposed restriction -> constrained estimators.
pip install DGP_ProtocolThe import path is PEP-8 lowercase:
from dgp_protocol import DataGeneratingProcess, EmpiricalDGP, TwoStageDGPimport numpy as np
from dgp_protocol import EmpiricalDGP
data = np.random.default_rng(0).standard_normal(size=(100, 3))
# The DGP owns its own RNG. Pass `seed` for reproducibility;
# `draw()` itself takes no `rng` argument.
dgp = EmpiricalDGP(observation=data, seed=1)
print(dgp.data.shape) # (100, 3) -- the frozen realization
print(dgp.draw().shape) # (100, 3) -- a fresh bootstrap resample
# Rebind to a different realization while keeping the distributional
# structure. The child gets an independent (spawned) Generator.
fresh = dgp.with_data(np.random.default_rng(2).standard_normal(size=(50, 3)))
print(fresh.data.shape) # (50, 3)For more substantial examples – parametric DGPs, two-stage composition (hierarchical sampling), cluster-block bootstrap – see the test suite under tests/.
The design is intentionally minimal: data + draw are the
only required members. Composition primitives (TwoStageDGP,
with_data) take DGPs and return DGPs without expanding the
Protocol.
The design note that motivated this package lives in the sibling
ManifoldGMM repo at
docs/design/dgp.org – DGP_Protocol was extracted from that
design conversation. See also AGENTS.md for
the package’s scope discipline and the list of intentionally
deferred features.
If you use DGP_Protocol in academic work, please cite it. The
repository’s CITATION.cff is recognised by GitHub and provides
one-click citation export in APA, BibTeX, and other formats from
the repo’s main page.
A BibTeX entry suitable for paper drafts:
@software{ligon_dgp_protocol_2026,
author = {Ligon, Ethan},
title = {DGP\_Protocol: A Protocol for data-generating processes},
year = {2026},
publisher = {GitHub},
url = {https://github.com/ligon/DGP_Protocol},
version = {0.1.0a0},
license = {BSD-3-Clause},
}BSD 3-Clause (BSD-3-Clause). See the LICENSE file at the root
of this repository. In short: permissive use including commercial,
modification, and redistribution; preserve the copyright notice and
license text in redistributions; no use of the author’s name to
endorse derived products.
Ethan Ligon, UC Berkeley.