-
Notifications
You must be signed in to change notification settings - Fork 0
Dataset
This page is the specification for the open data shipped in data/. Everything here is licensed CC BY 4.0 (code is MIT — see Home). The data is computed, not asserted: every number comes from the stdlib-only engine documented in Engine and API and can be regenerated by anyone following Reproduce. Where we describe what a number means, we are careful to separate what is proven (with citations) from what is empirical (measured here) — see The Research and Results and Findings.
There are four files in data/:
| File | What it is | Format |
|---|---|---|
census_dataset.json |
The headline open-data contribution: the normalization census, with metadata, densities, Wilson CIs, and the smallest-separator witness. | JSON |
census_dataset.csv |
A flat, spreadsheet-friendly subset of the same per-size rows. | CSV |
bgk_lab_results.json |
The companion BGK lab: 7 experiments (conservation, separator onset, perpetual strategy, ι-translation transfer, reduction geometry, parallel speedup, reduction-graph zoo). | JSON |
crabcc_kernel_index.json |
A bonus: symbol-index/call-graph statistics from running the crabcc engine over the Linux 6.18 LTS kernel. | JSON |
We classify closed untyped λ-terms by natural size in the de Bruijn convention:
Under this convention the count of closed terms of each size matches OEIS A275057 (Lescanne 2016) exactly, validated term-for-term — this is what tells us the enumerator is correct. The sequence (sizes
0, 1, 1, 3, 6, 17, 41, 116, 313, 895, 2550, 7450, 21881, 65168, 195370, 591007, ...
The size convention is not cosmetic — it controls the asymptotic answer. See The Research for why (the BGZL-2017 vs David-et-al-2013 contrast).
Every term is sorted into exactly one of four buckets:
| Class | Meaning | Definition |
|---|---|---|
| SN | strongly normalizing | every reduction path terminates |
| SEP | separator | weakly normalizing but not strongly: |
| NWN | no normal form | not weakly normalizing (no normal form is reachable at all) |
| UND | undecided | the engine could not decide within its bounded caps (node_cap=3000, size_cap=200) |
A term is decided if it is SN, SEP, or NWN; decided = SN + SEP + NWN. The UND class is an honest accounting of the engine's limits, not a fifth mathematical category — as size grows, non-SN mass increasingly hides in this tail (the "decidability horizon"; see Results and Findings). How SN/WN/NWN are actually decided (bounded α-quotiented reduction graph: SN ⟺ finite acyclic, ¬SN ⟺ a β-cycle, WN ⟺ a true normal form is reachable) is documented in Engine and API.
Two methods produce the rows, recorded in the method field:
-
exact— full exhaustive enumeration of every closed term of size$n$ , for$n \le 16$ . These counts are ground truth; theirtotalequals the A275057 value for that size. -
montecarlo— for$n \in {17, 18, 20, 22, 25, 28, 32, 38, 45, 55}$ , a uniform random sample of K = 50 000 terms drawn by exact rank-unranking over the size-$n$ population (so each closed term is equally likely). Thetotalfield is then 50 000, not the astronomically large true population.
The Monte-Carlo run became intractable around
For Monte-Carlo rows only, the JSON carries Wilson score 95% confidence intervals (z = 1.96) for the two density proportions of interest. The Wilson interval is used (rather than the naive normal approximation) because the proportions are extreme — very close to 1 for SN, very close to 0 for SEP — where Wilson stays well-behaved. The implementation:
def wilson(k, n, z=1.96):
if n == 0: return [0, 0]
p = k / n; d = 1 + z*z/n
c = (p + z*z/(2*n)) / d
h = z*((p*(1-p)/n + z*z/(4*n*n))**0.5) / d
return [round(max(0, c-h), 6), round(min(1, c+h), 6)]These CIs are computed against the decided count (the denominator), not the raw total. For exact rows the CI fields are null (there is no sampling uncertainty — the whole population was enumerated).
The dataset's flagship object is smallest_separator, the smallest closed term that is weakly but not strongly normalizing. It is recomputed at build time by enumerating size-13 terms and returning the first one that classifies as SEP:
"smallest_separator": {
"natural_size": 13,
"named": "λv0.λv1.v1 (λv0.v0 v0 λv0.v0 v0)"
}That term is NWN.
Top-level keys:
| Key | Meaning |
|---|---|
title |
Human-readable dataset title. |
model |
The size convention, spelled out. |
oeis_closed_term_counts |
The OEIS sequence the closed-term counts match (A275057), and that it is a verified exact match. |
classes |
The four class codes mapped to their definitions. |
method |
One-line description of how SN/WN are decided and the exact-vs-Monte-Carlo split. |
smallest_separator |
The witness object: {natural_size, named} (see above). |
key_findings |
Short prose list of the main takeaways, honestly framed. |
closed_term_counts_A275057 |
Array of closed-term counts |
rows |
Array of per-size records — the actual census table (see below). |
provenance |
Run metadata: engine, host, node_cap, size_cap, K. |
Each element of rows:
| Field | Type | Meaning |
|---|---|---|
n |
int | Natural size of the terms in this row. |
method |
string |
"exact" (full enumeration, "montecarlo" (K=50 000 uniform samples). |
total |
int | Number of terms considered: the full population for exact, K for montecarlo. |
SN |
int | Count classified strongly normalizing. |
SEP |
int | Count classified separator (WN ∧ ¬SN). |
NWN |
int | Count classified not-weakly-normalizing (no normal form). |
UND |
int | Count undecided within the engine caps. |
decided |
int |
SN + SEP + NWN. |
dens_SN_of_total |
float |
SN / total — SN fraction including the undecided tail. |
dens_SN_of_decided |
float |
SN / decided — SN fraction among decided terms. |
dens_SEP_of_decided |
float |
SEP / decided — the separator-density curve. |
dens_NWN_of_decided |
float |
NWN / decided. |
frac_undecided |
float |
UND / total — climbs with |
SN_of_decided_ci95 |
[lo, hi] or null
|
Wilson 95% CI for dens_SN_of_decided; null for exact rows. |
SEP_of_decided_ci95 |
[lo, hi] or null
|
Wilson 95% CI for dens_SEP_of_decided; null for exact rows. |
The CSV is a flat subset of the JSON rows — convenient for spreadsheets and quick plotting. It carries one header row plus one row per size. Note it omits dens_NWN_of_decided and the two Wilson CI columns (those live only in the JSON). Columns, in order:
n, method, total, SN, SEP, NWN, UND, decided,
dens_SN_of_total, dens_SN_of_decided, dens_SEP_of_decided, frac_undecided
Pure standard library, no dependencies:
import json
ds = json.load(open("data/census_dataset.json"))
print(ds["smallest_separator"]) # {'natural_size': 13, 'named': '...'}
for r in ds["rows"]:
if r["method"] == "montecarlo":
lo, hi = r["SEP_of_decided_ci95"] # Wilson 95% CI
print(r["n"], r["dens_SEP_of_decided"], (lo, hi))import csv
with open("data/census_dataset.csv", newline="") as f:
rows = list(csv.DictReader(f)) # all values are strings — cast as neededA companion run of 7 experiments (the "BGK lab", src/bgk_lab.py) that probe the mechanism behind the conjecture — that erasure is the obstruction to WN⟹SN. None of these resolve the open BGK conjecture; they corroborate the mechanism empirically. Full discussion in Contributing and Experiments and The Research.
Top-level keys: meta plus E1…E7.
-
meta—cores,budget_per_idea_s,pythonversion,total_seconds(ran on a 16-core box).
| Key | Experiment | Headline numbers in the file |
|---|---|---|
E1 |
Conservation (λI exhaustive) | In the erasure-free λI-calculus, WN ⟺ SN (Church 1941; Barendregt Thm 9.1.5). total_separators: 0 across total_terms: 3 128 692, up to max_size: 15. |
E2 |
λK separator census | Once erasure is allowed (λK), separators appear: onset_size: 12 (this is the lab's own ad-hoc enumeration convention — distinct from the census's closed-de-Bruijn natural-size 13). Includes rows and example samples. |
E3 |
Reduction geometry (SN terms) | Longest vs shortest reduction lengths (de Vrijer's $L(\cdot)$), blow-up factors; points (4016 SN terms, cols describing them) and per_size summary stats. |
E4 |
Perpetual-strategy law: diverge ⟺ ¬SN |
checked: 762 446, agreements: 762 446, violations: 0. |
E5 |
ι-translation transfer: SN(M) ⟺ SN(ι(M)) | Klop's ι-translation into the non-erasing λI[,] calculus. checked: 710 758, violations: 0, mismatches: []. |
E6 |
Parallel speedup (analyze workload) |
workload: 40 000, rows of timings by core count. |
E7 |
Reduction-graph zoo | A small gallery of terms and their reduction-graph shapes. |
Each experiment also records its own seconds. The rows/points/samples arrays are the raw per-experiment data behind those headline counts.
A note on the two separator sizes. E2's
onset_size: 12and the census's smallest separator at natural size 13 are both correct — they use different size conventions and different term universes. Don't conflate them. The census number is the rigorously specified, OEIS-validated one.
Loading:
import json
lab = json.load(open("data/bgk_lab_results.json"))
print(lab["E1"]["total_separators"], "/", lab["E1"]["total_terms"]) # 0 / 3128692
print(lab["E4"]["violations"], lab["E5"]["violations"]) # 0 0Statistics from running crabcc v6.3 (a Rust symbol-index/call-graph engine for AI agents) over the Linux 6.18 LTS kernel. This is a side artifact, unrelated to the λ-calculus census, included because it is interesting and reproducible. Selected fields:
| Field | Value / meaning |
|---|---|
crabcc_version |
6.3.0 |
kernel |
6.18.35 (LTS) |
index.files |
68 875 files indexed |
index.symbols |
3 872 317 symbols |
index.edges |
5 324 709 |
index.wall_s |
176.7 s to index |
graph |
call-graph build stats (callees, callers, edges, wall_s) |
by_ext |
file counts by extension (.c, .h, .yaml, …) |
symbols |
sample [name, fan, "file:line"] triples (e.g. kmalloc, vfs_read) |
kmalloc_callers |
4 889 |
sccs |
328 mutual-recursion strongly-connected components |
compare |
before/after index sizes across crabcc versions |
import json
k = json.load(open("data/crabcc_kernel_index.json"))
print(k["index"]["symbols"], "symbols in", k["kernel"]) # 3872317 symbols in 6.18.35 (LTS)All four files are regenerable from src/ with Python 3.9+ and no third-party dependencies (see Reproduce). The census JSON/CSV are assembled by src/build_census_dataset.py, which recomputes the OEIS counts and the smallest-separator witness locally rather than copying them in. This is machine-assisted (AI-orchestrated) research, cross-checked against primary sources. We distinguish throughout between what is proven (Church 1941; Barendregt; BGZL 2017; David et al. 2013 — cited in The Research) and what is empirical (everything we measured), and we state the undecided remainder explicitly (the UND class and the decidability horizon) rather than rounding it away.
Data and docs are CC BY 4.0 (data/LICENSE). Suggested attribution:
"λ-Normalization Census" by Peter Lodri (peterlodri-sec), CC BY 4.0, https://github.com/peterlodri-sec/lambda-normalization-census
Contributions of new sizes, tighter caps, alternative deciders, or independent re-runs are very welcome — see Contributing and Experiments. Questions: cabotage@pm.me · newsletter and cross-project notes at crabcc.app/about · support via GitHub Sponsors. See also Results and Findings, Engine and API, and the FAQ.