Skip to content

Dataset

Peter Lodri edited this page Jun 10, 2026 · 1 revision

Dataset

This page is the specification for the open data shipped in data/. Everything here is licensed CC BY 4.0 (code is MIT — see Home). The data is computed, not asserted: every number comes from the stdlib-only engine documented in Engine and API and can be regenerated by anyone following Reproduce. Where we describe what a number means, we are careful to separate what is proven (with citations) from what is empirical (measured here) — see The Research and Results and Findings.

There are four files in data/:

File What it is Format
census_dataset.json The headline open-data contribution: the normalization census, with metadata, densities, Wilson CIs, and the smallest-separator witness. JSON
census_dataset.csv A flat, spreadsheet-friendly subset of the same per-size rows. CSV
bgk_lab_results.json The companion BGK lab: 7 experiments (conservation, separator onset, perpetual strategy, ι-translation transfer, reduction geometry, parallel speedup, reduction-graph zoo). JSON
crabcc_kernel_index.json A bonus: symbol-index/call-graph statistics from running the crabcc engine over the Linux 6.18 LTS kernel. JSON

1. The census: census_dataset.{json,csv}

What is being counted

We classify closed untyped λ-terms by natural size in the de Bruijn convention:

$$|,\text{index } k,| = k+1,\qquad |\lambda M| = 1 + |M|,\qquad |M,N| = 1 + |M| + |N|.$$

Under this convention the count of closed terms of each size matches OEIS A275057 (Lescanne 2016) exactly, validated term-for-term — this is what tells us the enumerator is correct. The sequence (sizes $n = 1, 2, 3, \dots$) is:

0, 1, 1, 3, 6, 17, 41, 116, 313, 895, 2550, 7450, 21881, 65168, 195370, 591007, ...

The size convention is not cosmetic — it controls the asymptotic answer. See The Research for why (the BGZL-2017 vs David-et-al-2013 contrast).

The four classes

Every term is sorted into exactly one of four buckets:

Class Meaning Definition
SN strongly normalizing every reduction path terminates
SEP separator weakly normalizing but not strongly: $\mathrm{WN} \wedge \neg\mathrm{SN}$ — the "WN∖SN gap" the BGK conjecture is about
NWN no normal form not weakly normalizing (no normal form is reachable at all)
UND undecided the engine could not decide within its bounded caps (node_cap=3000, size_cap=200)

A term is decided if it is SN, SEP, or NWN; decided = SN + SEP + NWN. The UND class is an honest accounting of the engine's limits, not a fifth mathematical category — as size grows, non-SN mass increasingly hides in this tail (the "decidability horizon"; see Results and Findings). How SN/WN/NWN are actually decided (bounded α-quotiented reduction graph: SN ⟺ finite acyclic, ¬SN ⟺ a β-cycle, WN ⟺ a true normal form is reachable) is documented in Engine and API.

Exact vs Monte-Carlo

Two methods produce the rows, recorded in the method field:

  • exact — full exhaustive enumeration of every closed term of size $n$, for $n \le 16$. These counts are ground truth; their total equals the A275057 value for that size.
  • montecarlo — for $n \in {17, 18, 20, 22, 25, 28, 32, 38, 45, 55}$, a uniform random sample of K = 50 000 terms drawn by exact rank-unranking over the size-$n$ population (so each closed term is equally likely). The total field is then 50 000, not the astronomically large true population.

The Monte-Carlo run became intractable around $n = 70$ (the bounded analysis is too slow there) — that horizon is itself a finding, not an omission.

Wilson confidence intervals

For Monte-Carlo rows only, the JSON carries Wilson score 95% confidence intervals (z = 1.96) for the two density proportions of interest. The Wilson interval is used (rather than the naive normal approximation) because the proportions are extreme — very close to 1 for SN, very close to 0 for SEP — where Wilson stays well-behaved. The implementation:

def wilson(k, n, z=1.96):
    if n == 0: return [0, 0]
    p = k / n; d = 1 + z*z/n
    c = (p + z*z/(2*n)) / d
    h = z*((p*(1-p)/n + z*z/(4*n*n))**0.5) / d
    return [round(max(0, c-h), 6), round(min(1, c+h), 6)]

These CIs are computed against the decided count (the denominator), not the raw total. For exact rows the CI fields are null (there is no sampling uncertainty — the whole population was enumerated).

Smallest-separator witness

The dataset's flagship object is smallest_separator, the smallest closed term that is weakly but not strongly normalizing. It is recomputed at build time by enumerating size-13 terms and returning the first one that classifies as SEP:

"smallest_separator": {
  "natural_size": 13,
  "named": "λv0.λv1.v1 (λv0.v0 v0 λv0.v0 v0)"
}

That term is $(\lambda a.\lambda b.,b),\Omega$ — an erasing two-argument constant function applied to the diverging combinator $\Omega = (\lambda x.,x,x)(\lambda x.,x,x)$. It reduces to a normal form (the erasing function throws $\Omega$ away) and has a non-terminating path (reduce inside $\Omega$ forever). As far as the literature shows, this witness — and the WN∖SN separator-density curve built around it — appears previously unrecorded. This is an empirical observation from our enumeration, not a theorem. For context, $\Omega$ itself (the smallest closed non-WN term, at natural size 9) appears at $n=9$ in the rows as the first NWN.

census_dataset.json field reference

Top-level keys:

Key Meaning
title Human-readable dataset title.
model The size convention, spelled out.
oeis_closed_term_counts The OEIS sequence the closed-term counts match (A275057), and that it is a verified exact match.
classes The four class codes mapped to their definitions.
method One-line description of how SN/WN are decided and the exact-vs-Monte-Carlo split.
smallest_separator The witness object: {natural_size, named} (see above).
key_findings Short prose list of the main takeaways, honestly framed.
closed_term_counts_A275057 Array of closed-term counts $L(n,0)$ for $n = 1 \dots 30$ (recomputed locally, equals A275057).
rows Array of per-size records — the actual census table (see below).
provenance Run metadata: engine, host, node_cap, size_cap, K.

Each element of rows:

Field Type Meaning
n int Natural size of the terms in this row.
method string "exact" (full enumeration, $n \le 16$) or "montecarlo" (K=50 000 uniform samples).
total int Number of terms considered: the full population for exact, K for montecarlo.
SN int Count classified strongly normalizing.
SEP int Count classified separator (WN ∧ ¬SN).
NWN int Count classified not-weakly-normalizing (no normal form).
UND int Count undecided within the engine caps.
decided int SN + SEP + NWN.
dens_SN_of_total float SN / total — SN fraction including the undecided tail.
dens_SN_of_decided float SN / decided — SN fraction among decided terms.
dens_SEP_of_decided float SEP / decided — the separator-density curve.
dens_NWN_of_decided float NWN / decided.
frac_undecided float UND / total — climbs with $n$ (0 → ~0.097 at $n=55$); the decidability horizon.
SN_of_decided_ci95 [lo, hi] or null Wilson 95% CI for dens_SN_of_decided; null for exact rows.
SEP_of_decided_ci95 [lo, hi] or null Wilson 95% CI for dens_SEP_of_decided; null for exact rows.

census_dataset.csv

The CSV is a flat subset of the JSON rows — convenient for spreadsheets and quick plotting. It carries one header row plus one row per size. Note it omits dens_NWN_of_decided and the two Wilson CI columns (those live only in the JSON). Columns, in order:

n, method, total, SN, SEP, NWN, UND, decided,
dens_SN_of_total, dens_SN_of_decided, dens_SEP_of_decided, frac_undecided

Loading it

Pure standard library, no dependencies:

import json
ds = json.load(open("data/census_dataset.json"))

print(ds["smallest_separator"])           # {'natural_size': 13, 'named': '...'}
for r in ds["rows"]:
    if r["method"] == "montecarlo":
        lo, hi = r["SEP_of_decided_ci95"]  # Wilson 95% CI
        print(r["n"], r["dens_SEP_of_decided"], (lo, hi))
import csv
with open("data/census_dataset.csv", newline="") as f:
    rows = list(csv.DictReader(f))         # all values are strings — cast as needed

2. The BGK lab: bgk_lab_results.json

A companion run of 7 experiments (the "BGK lab", src/bgk_lab.py) that probe the mechanism behind the conjecture — that erasure is the obstruction to WN⟹SN. None of these resolve the open BGK conjecture; they corroborate the mechanism empirically. Full discussion in Contributing and Experiments and The Research.

Top-level keys: meta plus E1E7.

  • metacores, budget_per_idea_s, python version, total_seconds (ran on a 16-core box).
Key Experiment Headline numbers in the file
E1 Conservation (λI exhaustive) In the erasure-free λI-calculus, WN ⟺ SN (Church 1941; Barendregt Thm 9.1.5). total_separators: 0 across total_terms: 3 128 692, up to max_size: 15.
E2 λK separator census Once erasure is allowed (λK), separators appear: onset_size: 12 (this is the lab's own ad-hoc enumeration convention — distinct from the census's closed-de-Bruijn natural-size 13). Includes rows and example samples.
E3 Reduction geometry (SN terms) Longest vs shortest reduction lengths (de Vrijer's $L(\cdot)$), blow-up factors; points (4016 SN terms, cols describing them) and per_size summary stats.
E4 Perpetual-strategy law: diverge ⟺ ¬SN checked: 762 446, agreements: 762 446, violations: 0.
E5 ι-translation transfer: SN(M) ⟺ SN(ι(M)) Klop's ι-translation into the non-erasing λI[,] calculus. checked: 710 758, violations: 0, mismatches: [].
E6 Parallel speedup (analyze workload) workload: 40 000, rows of timings by core count.
E7 Reduction-graph zoo A small gallery of terms and their reduction-graph shapes.

Each experiment also records its own seconds. The rows/points/samples arrays are the raw per-experiment data behind those headline counts.

A note on the two separator sizes. E2's onset_size: 12 and the census's smallest separator at natural size 13 are both correct — they use different size conventions and different term universes. Don't conflate them. The census number is the rigorously specified, OEIS-validated one.

Loading:

import json
lab = json.load(open("data/bgk_lab_results.json"))
print(lab["E1"]["total_separators"], "/", lab["E1"]["total_terms"])  # 0 / 3128692
print(lab["E4"]["violations"], lab["E5"]["violations"])              # 0 0

3. Bonus: crabcc_kernel_index.json

Statistics from running crabcc v6.3 (a Rust symbol-index/call-graph engine for AI agents) over the Linux 6.18 LTS kernel. This is a side artifact, unrelated to the λ-calculus census, included because it is interesting and reproducible. Selected fields:

Field Value / meaning
crabcc_version 6.3.0
kernel 6.18.35 (LTS)
index.files 68 875 files indexed
index.symbols 3 872 317 symbols
index.edges 5 324 709
index.wall_s 176.7 s to index
graph call-graph build stats (callees, callers, edges, wall_s)
by_ext file counts by extension (.c, .h, .yaml, …)
symbols sample [name, fan, "file:line"] triples (e.g. kmalloc, vfs_read)
kmalloc_callers 4 889
sccs 328 mutual-recursion strongly-connected components
compare before/after index sizes across crabcc versions
import json
k = json.load(open("data/crabcc_kernel_index.json"))
print(k["index"]["symbols"], "symbols in", k["kernel"])  # 3872317 symbols in 6.18.35 (LTS)

Provenance and honesty

All four files are regenerable from src/ with Python 3.9+ and no third-party dependencies (see Reproduce). The census JSON/CSV are assembled by src/build_census_dataset.py, which recomputes the OEIS counts and the smallest-separator witness locally rather than copying them in. This is machine-assisted (AI-orchestrated) research, cross-checked against primary sources. We distinguish throughout between what is proven (Church 1941; Barendregt; BGZL 2017; David et al. 2013 — cited in The Research) and what is empirical (everything we measured), and we state the undecided remainder explicitly (the UND class and the decidability horizon) rather than rounding it away.

License and citation

Data and docs are CC BY 4.0 (data/LICENSE). Suggested attribution:

"λ-Normalization Census" by Peter Lodri (peterlodri-sec), CC BY 4.0, https://github.com/peterlodri-sec/lambda-normalization-census

Contributions of new sizes, tighter caps, alternative deciders, or independent re-runs are very welcome — see Contributing and Experiments. Questions: cabotage@pm.me · newsletter and cross-project notes at crabcc.app/about · support via GitHub Sponsors. See also Results and Findings, Engine and API, and the FAQ.

Clone this wiki locally