An empirical study using CPython 3.14 (no GIL, free-threading build).
The removal of the Global Interpreter Lock (GIL) in CPython 3.14 enables true thread-level parallelism for CPU-bound Python code. Multiple approaches can exploit this: imperative shared-state code (with locks), thread-indexed pre-allocated structures (without locks), and pure functional code (with no shared objects at all).
This study addresses a precise question:
Among the advantages provided by free-threading Python, which ones can only be unlocked through functional programming — and not through imperative code?
The answer is not "parallelism" — locks can exploit that too. The answer lies in the structural properties of pure functions and the Writer monad:
-
Structural correctness without runtime synchronisation. Pure functions carry no shared state. Correctness is a property of the code structure, not of runtime coordination. Imperative code must add a lock; functional code has nothing to lock.
-
O(1) coordination cost regardless of work size. Monoid reduce combines results in the main thread after all workers complete. A coarse lock still requires N_THREADS acquisitions; a fine lock requires N_ITEMS acquisitions. Pure functions require zero acquisitions at any scale.
-
Lock-free structured accumulation for arbitrary result types. The Writer monad returns structured, multi-field results as plain values with an audit log attached at zero extra cost. A pre-allocated thread-indexed dict can avoid locks for scalar results but requires knowing the worker count upfront, cannot compose, and provides no log. The Writer monad is general, composable, and log-accumulating by design.
Free-threading Python (CPython 3.14, PYTHONGIL=0) is the most significant change
to the CPython runtime since the introduction of the GIL itself. It enables
CPU-bound parallel computation inside a single process without a multiprocessing
boundary. But it also exposes every shared-state read-modify-write operation to race
conditions that the GIL previously masked.
The functional programming approach — pure functions, immutable data flow, monoid reduction, and the Writer monad — provides a principled response: rather than adding locks to correct imperative code, the design prevents races from arising. This study provides empirical evidence that the functional approach is not merely equivalent to the lock-based approach under free-threading: it is strictly better on coordination cost and generality.
Race condition invisibility. Under the standard GIL build, shared += val appears
correct because the GIL serialises bytecode execution. Under free-threading the
LOAD/ADD/STORE sequence is non-atomic; interleaved threads produce data loss that
grows with thread count and is silent (no exception raised, no error flag set).
Coordination cost at scale. Fine-grained locking does not scale. A lock acquired once per work item becomes a hard serialisation bottleneck at high thread counts: threads spend more time waiting for the lock than computing, and wall time approaches the single-thread baseline regardless of core count.
The thread-indexed dict counter-argument. Pre-allocating results[thread_id]
avoids locks for simple cases. This study directly addresses this pattern as a third
experimental baseline and identifies its limitations: it requires pre-coordination on
worker count, cannot accommodate dynamic dispatch, does not compose with chained
computations, and provides no structured log without additional bookkeeping.
P/E-core heterogeneity. The target hardware has P-cores (5200 MHz) and E-cores (4100 MHz). Threads beyond n=16 spill onto E-cores and may show reduced efficiency. All experiments mark this boundary explicitly.
Question: Under free-threading Python, does structural immutability alone — without any runtime lock — guarantee correctness across all thread counts?
Two approaches are swept over THREAD_COUNTS:
| Approach | Mechanism | Expected outcome |
|---|---|---|
| Imperative | shared_counter += 1 (no lock) |
Data loss grows with thread count |
| Functional | Per-thread slot + functools.reduce |
0% loss at every thread count |
The experiment quantifies mean, min, and max data-loss percentage for the imperative
approach across N_RUNS repetitions. The functional approach reports a binary
correctness flag (always correct by construction).
Run:
python -m research_freethreading.experiments.exp_1_race_detection
python -m research_freethreading.experiments.exp_1_race_detection --plotQuestion: Does the coordination cost of synchronisation primitives prevent threads from exploiting free-threading's parallelism, and is the functional approach the only one that achieves near-linear scaling while remaining correct?
Four approaches are swept over THREAD_COUNTS, measuring wall time and parallel
efficiency ((t1 / n) / t_n):
| Approach | Coordination cost | Correctness |
|---|---|---|
| FP (monoid reduce) | O(1) | Always correct |
| Coarse lock (1 acq/thread) | O(N_threads) | Correct |
| Fine lock (1 acq/item) | O(N_items) | Correct |
| No lock | O(1) | Incorrect (race) |
Efficiency of 1.00× = perfect linear scaling. The fine lock is expected to collapse toward 0.0× at high thread counts as threads serialise on the lock.
Run:
python -m research_freethreading.experiments.exp_2_scaling
python -m research_freethreading.experiments.exp_2_scaling --plotQuestion: Can the Writer monad provide lock-free structured accumulation for arbitrary result types — and what does it offer beyond a thread-indexed dict?
Three approaches collect compute_chunk_stats results (a multi-field dict) and a
log string from each worker:
| Approach | Shared object | Lock | Pre-coordination | Log |
|---|---|---|---|---|
| Lock | shared dict + log list | 1 acq/thread | none | explicit append under lock |
| Indexed | results[tid], logs[tid] | none | worker count known upfront | pre-allocated slot |
| Writer | none | none | none | free (monoid) |
All three are correct. The experiment measures wall time across THREAD_COUNTS and
verifies that aggregate sums match the sequential reference for every approach.
The structural advantages of the Writer monad over the Indexed approach are printed
at the end of each run.
Run:
python -m research_freethreading.experiments.exp_3_writer_monad
python -m research_freethreading.experiments.exp_3_writer_monad --plotresearch_freethreading/
├── __init__.py
├── config.py # THREAD_COUNTS, N_ITEMS, N_RUNS, core lists
├── workload.py # compute_item, compute_chunk_stats — pure kernels
├── monads.py # Writer, merge_writers, Either, Left, Right
├── experiments/
│ ├── __init__.py
│ ├── exp_1_race_detection.py # Imperative vs functional correctness sweep
│ ├── exp_2_scaling.py # FP / coarse lock / fine lock / no-lock scaling
│ └── exp_3_writer_monad.py # Writer monad vs lock vs indexed accumulation
├── tests/
│ ├── __init__.py
│ ├── test_config.py # Constants: types, values, core-list consistency
│ ├── test_workload.py # compute_item, compute_chunk_stats: purity + math
│ └── test_monads.py # Writer monad laws, merge_writers, Either
└── README.md
With the conda environment active, run:
conda env export > environment.ymlThis produces a fully pinned environment.yml listing every package and its
exact version. Place the file at the root of CategoryTheory/.
Note:
conda env export --from-historyproduces a leaner file with only explicitly installed packages. The full export above is recommended for exact reproducibility across machines.
conda env create -f environment.yml
conda activate <env-name>The environment name is embedded in the first line of environment.yml.
You can override it at creation time:
conda env create -f environment.yml --name my-env
conda activate my-envThe experiments require a free-threading CPython 3.14 build (PYTHONGIL=0).
Confirm before running:
python -c "import sys; print(sys._is_gil_enabled())" # must print FalseIf it prints True, the environment does not use a free-threading build.
Install CPython 3.14t via conda-forge:
conda install -c conda-forge python=3.14 python-freethreadingFrom the CategoryTheory/ directory so package-relative imports resolve:
python -m research_freethreading.experiments.exp_1_race_detection
python -m research_freethreading.experiments.exp_2_scaling
python -m research_freethreading.experiments.exp_3_writer_monadPass --plot to any experiment to save a PNG figure alongside the script.
python -m pytest research_freethreading/tests/ -vTests are platform-independent and pass on standard CPython with or without the GIL — no free-threading build required for testing.
All experiments target the following machine:
OS: Ubuntu Linux (single-socket) CPU: Intel Core i7-13700 (13th Gen Raptor Lake)
| Core type | Logical CPUs | Physical cores | HT layout | Max MHz |
|---|---|---|---|---|
| P-cores | 0–15 | 8 | physical: 0,2,4,6,8,10,12,14 / siblings: 1,3,5,7,9,11,13,15 | 5200 |
| E-cores | 16–23 | 8 | none (1 logical CPU per core) | 4100 |
Cache hierarchy:
| Level | Size | Instances | Notes |
|---|---|---|---|
| L1d | 640 KiB | 16 | one per logical P-core thread |
| L1i | 768 KiB | 16 | one per logical P-core thread |
| L2 | 24 MiB | 10 | one per P-core; shared per cluster of 4 E-cores |
| L3 | 30 MiB | 1 | shared across all cores |
os.sched_setaffinity(0, set(range(24))) is called at startup in each experiment
to pin the process to all 24 cores. The call is silently skipped on macOS and Windows.
| Paper claim | Evidence |
|---|---|
| Structural immutability prevents races without runtime locks | Experiment 1: 0% loss at every thread count, always |
| FP achieves O(1) coordination; fine lock degrades to serial | Experiment 2: efficiency curves across THREAD_COUNTS |
| Writer monad accumulates structured results without a shared object | Experiment 3: all three approaches correct; Writer requires no pre-coordination |
| Thread-indexed dict requires pre-coordination; Writer does not | Experiment 3: Indexed vs Writer comparison and printed structural analysis |