Skip to content

micknikolic2/functional-freethreading-python

Repository files navigation

What Advantages of Free-Threading Python Can Only Be Unlocked Through Functional Programming?

An empirical study using CPython 3.14 (no GIL, free-threading build).


Research Scope

The removal of the Global Interpreter Lock (GIL) in CPython 3.14 enables true thread-level parallelism for CPU-bound Python code. Multiple approaches can exploit this: imperative shared-state code (with locks), thread-indexed pre-allocated structures (without locks), and pure functional code (with no shared objects at all).

This study addresses a precise question:

Among the advantages provided by free-threading Python, which ones can only be unlocked through functional programming — and not through imperative code?

The answer is not "parallelism" — locks can exploit that too. The answer lies in the structural properties of pure functions and the Writer monad:

  1. Structural correctness without runtime synchronisation. Pure functions carry no shared state. Correctness is a property of the code structure, not of runtime coordination. Imperative code must add a lock; functional code has nothing to lock.

  2. O(1) coordination cost regardless of work size. Monoid reduce combines results in the main thread after all workers complete. A coarse lock still requires N_THREADS acquisitions; a fine lock requires N_ITEMS acquisitions. Pure functions require zero acquisitions at any scale.

  3. Lock-free structured accumulation for arbitrary result types. The Writer monad returns structured, multi-field results as plain values with an audit log attached at zero extra cost. A pre-allocated thread-indexed dict can avoid locks for scalar results but requires knowing the worker count upfront, cannot compose, and provides no log. The Writer monad is general, composable, and log-accumulating by design.


Importance

Free-threading Python (CPython 3.14, PYTHONGIL=0) is the most significant change to the CPython runtime since the introduction of the GIL itself. It enables CPU-bound parallel computation inside a single process without a multiprocessing boundary. But it also exposes every shared-state read-modify-write operation to race conditions that the GIL previously masked.

The functional programming approach — pure functions, immutable data flow, monoid reduction, and the Writer monad — provides a principled response: rather than adding locks to correct imperative code, the design prevents races from arising. This study provides empirical evidence that the functional approach is not merely equivalent to the lock-based approach under free-threading: it is strictly better on coordination cost and generality.


Challenges

Race condition invisibility. Under the standard GIL build, shared += val appears correct because the GIL serialises bytecode execution. Under free-threading the LOAD/ADD/STORE sequence is non-atomic; interleaved threads produce data loss that grows with thread count and is silent (no exception raised, no error flag set).

Coordination cost at scale. Fine-grained locking does not scale. A lock acquired once per work item becomes a hard serialisation bottleneck at high thread counts: threads spend more time waiting for the lock than computing, and wall time approaches the single-thread baseline regardless of core count.

The thread-indexed dict counter-argument. Pre-allocating results[thread_id] avoids locks for simple cases. This study directly addresses this pattern as a third experimental baseline and identifies its limitations: it requires pre-coordination on worker count, cannot accommodate dynamic dispatch, does not compose with chained computations, and provides no structured log without additional bookkeeping.

P/E-core heterogeneity. The target hardware has P-cores (5200 MHz) and E-cores (4100 MHz). Threads beyond n=16 spill onto E-cores and may show reduced efficiency. All experiments mark this boundary explicitly.


Experiments

Experiment 1 — Race Condition Detection (exp_1_race_detection.py)

Question: Under free-threading Python, does structural immutability alone — without any runtime lock — guarantee correctness across all thread counts?

Two approaches are swept over THREAD_COUNTS:

Approach Mechanism Expected outcome
Imperative shared_counter += 1 (no lock) Data loss grows with thread count
Functional Per-thread slot + functools.reduce 0% loss at every thread count

The experiment quantifies mean, min, and max data-loss percentage for the imperative approach across N_RUNS repetitions. The functional approach reports a binary correctness flag (always correct by construction).

Run:

python -m research_freethreading.experiments.exp_1_race_detection
python -m research_freethreading.experiments.exp_1_race_detection --plot

Experiment 2 — Scaling Under Thread-Count Sweep (exp_2_scaling.py)

Question: Does the coordination cost of synchronisation primitives prevent threads from exploiting free-threading's parallelism, and is the functional approach the only one that achieves near-linear scaling while remaining correct?

Four approaches are swept over THREAD_COUNTS, measuring wall time and parallel efficiency ((t1 / n) / t_n):

Approach Coordination cost Correctness
FP (monoid reduce) O(1) Always correct
Coarse lock (1 acq/thread) O(N_threads) Correct
Fine lock (1 acq/item) O(N_items) Correct
No lock O(1) Incorrect (race)

Efficiency of 1.00× = perfect linear scaling. The fine lock is expected to collapse toward 0.0× at high thread counts as threads serialise on the lock.

Run:

python -m research_freethreading.experiments.exp_2_scaling
python -m research_freethreading.experiments.exp_2_scaling --plot

Experiment 3 — Writer Monad as Lock-Free Structured Accumulator (exp_3_writer_monad.py)

Question: Can the Writer monad provide lock-free structured accumulation for arbitrary result types — and what does it offer beyond a thread-indexed dict?

Three approaches collect compute_chunk_stats results (a multi-field dict) and a log string from each worker:

Approach Shared object Lock Pre-coordination Log
Lock shared dict + log list 1 acq/thread none explicit append under lock
Indexed results[tid], logs[tid] none worker count known upfront pre-allocated slot
Writer none none none free (monoid)

All three are correct. The experiment measures wall time across THREAD_COUNTS and verifies that aggregate sums match the sequential reference for every approach. The structural advantages of the Writer monad over the Indexed approach are printed at the end of each run.

Run:

python -m research_freethreading.experiments.exp_3_writer_monad
python -m research_freethreading.experiments.exp_3_writer_monad --plot

Repository Structure

research_freethreading/
├── __init__.py
├── config.py                          # THREAD_COUNTS, N_ITEMS, N_RUNS, core lists
├── workload.py                        # compute_item, compute_chunk_stats — pure kernels
├── monads.py                          # Writer, merge_writers, Either, Left, Right
├── experiments/
│   ├── __init__.py
│   ├── exp_1_race_detection.py        # Imperative vs functional correctness sweep
│   ├── exp_2_scaling.py               # FP / coarse lock / fine lock / no-lock scaling
│   └── exp_3_writer_monad.py          # Writer monad vs lock vs indexed accumulation
├── tests/
│   ├── __init__.py
│   ├── test_config.py                 # Constants: types, values, core-list consistency
│   ├── test_workload.py               # compute_item, compute_chunk_stats: purity + math
│   └── test_monads.py                 # Writer monad laws, merge_writers, Either
└── README.md

Setup

1. Export the environment (on the original machine)

With the conda environment active, run:

conda env export > environment.yml

This produces a fully pinned environment.yml listing every package and its exact version. Place the file at the root of CategoryTheory/.

Note: conda env export --from-history produces a leaner file with only explicitly installed packages. The full export above is recommended for exact reproducibility across machines.

2. Recreate the environment (on another machine)

conda env create -f environment.yml
conda activate <env-name>

The environment name is embedded in the first line of environment.yml. You can override it at creation time:

conda env create -f environment.yml --name my-env
conda activate my-env

3. Verify the GIL is disabled

The experiments require a free-threading CPython 3.14 build (PYTHONGIL=0). Confirm before running:

python -c "import sys; print(sys._is_gil_enabled())"  # must print False

If it prints True, the environment does not use a free-threading build. Install CPython 3.14t via conda-forge:

conda install -c conda-forge python=3.14 python-freethreading

4. Run the experiments

From the CategoryTheory/ directory so package-relative imports resolve:

python -m research_freethreading.experiments.exp_1_race_detection
python -m research_freethreading.experiments.exp_2_scaling
python -m research_freethreading.experiments.exp_3_writer_monad

Pass --plot to any experiment to save a PNG figure alongside the script.

5. Run the tests

python -m pytest research_freethreading/tests/ -v

Tests are platform-independent and pass on standard CPython with or without the GIL — no free-threading build required for testing.


Hardware Target

All experiments target the following machine:

OS: Ubuntu Linux (single-socket) CPU: Intel Core i7-13700 (13th Gen Raptor Lake)

Core type Logical CPUs Physical cores HT layout Max MHz
P-cores 0–15 8 physical: 0,2,4,6,8,10,12,14 / siblings: 1,3,5,7,9,11,13,15 5200
E-cores 16–23 8 none (1 logical CPU per core) 4100

Cache hierarchy:

Level Size Instances Notes
L1d 640 KiB 16 one per logical P-core thread
L1i 768 KiB 16 one per logical P-core thread
L2 24 MiB 10 one per P-core; shared per cluster of 4 E-cores
L3 30 MiB 1 shared across all cores

os.sched_setaffinity(0, set(range(24))) is called at startup in each experiment to pin the process to all 24 cores. The call is silently skipped on macOS and Windows.


Relation to the Paper

Paper claim Evidence
Structural immutability prevents races without runtime locks Experiment 1: 0% loss at every thread count, always
FP achieves O(1) coordination; fine lock degrades to serial Experiment 2: efficiency curves across THREAD_COUNTS
Writer monad accumulates structured results without a shared object Experiment 3: all three approaches correct; Writer requires no pre-coordination
Thread-indexed dict requires pre-coordination; Writer does not Experiment 3: Indexed vs Writer comparison and printed structural analysis

About

Empirical study of which free-threading CPython 3.14 (no-GIL) advantages require functional programming: race-freedom, O(1) coordination via monoid reduce, and lock-free structured accumulation via the Writer monad.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages