Reducing memory footprint - what is the purpose of the 'materialise_blocked_pairs' argument for link prediction? #3066

gringer · 2026-05-11T23:07:24Z

gringer
May 11, 2026

I'm trying to hunt for ways to reduce the memory footprint of Splink, because we're running up against machine resource memory limits. Most recently I've been working on implementing staged blocking, while trying to keep things as Splink-like as possible (see #3044), but even with that we're still running into memory issues of various different sorts.

In my hunt, I found a setting in the predict function called materialise_blocked_pairs, and was wondering if anyone could explain what it's meant to do. From my impression looking at the code, it doesn't seem like it would have a beneficial impact. A table is created here, and then it looks like that same table is deleted here.

What's the purpose of doing that? Is it simply to get timings for blocking / linking? Will setting this argument to False speed linking up?

Answered by RobinL

May 14, 2026

Thanks, these are useful findings.

One primary reason for blocking, materialising the blocked pairs, and then predicting from the blocked pairs is that it helps parallelisation. Blocking joins do not parallelise well (at least, it is unpredictable) because SQL engines don't tend to be well optimised for queries that create lots of new rows (i.e. the mini cartesian products you get 'within' blocking rules).

As a result, on big workflows, you often see a failure to parallelise e.g. due to straggler tasks or other issues.

If you create the blocked pairs first, then you effectively eliminate skew, and also the database engine 'knows' how many rows its dealing with, so can split up the workloa…

View full answer

gringer · 2026-05-12T00:31:59Z

gringer
May 12, 2026
Author

My own testing with this parameter adjustment in Splink for one of our datasets:

    df_predict_full = linker.inference.predict(
        threshold_match_weight = params.get("inference_cut_off_weight", -8))

Link prediction times:
  Stage 1: 3m 7s
  Stage 2: 40s
  Stage 3: 2m 40s

    df_predict_full = linker.inference.predict(
        threshold_match_weight = params.get("inference_cut_off_weight", -8),
        materialise_after_computing_term_frequencies = False,
        materialise_blocked_pairs = False)

Link prediction times:
  Stage 1: 22s
  Stage 2: 7s
  Stage 3: 52s

The time for blocking and link prediction combined without materialisation is less than the time for blocking alone when materialisation is included.

0 replies

RobinL · 2026-05-14T18:00:20Z

RobinL
May 14, 2026
Maintainer

Thanks, these are useful findings.

One primary reason for blocking, materialising the blocked pairs, and then predicting from the blocked pairs is that it helps parallelisation. Blocking joins do not parallelise well (at least, it is unpredictable) because SQL engines don't tend to be well optimised for queries that create lots of new rows (i.e. the mini cartesian products you get 'within' blocking rules).

As a result, on big workflows, you often see a failure to parallelise e.g. due to straggler tasks or other issues.

If you create the blocked pairs first, then you effectively eliminate skew, and also the database engine 'knows' how many rows its dealing with, so can split up the workload more efficiently.

All of this is quite sensitive to a few points:

Whether you materialise all predictions, or only predictions above some thresholds. If you have a very large number of blocked pairs, but only a few of these cross the threshold, then materialising the blocked pairs can impair performance.
The backend you're using (spark or duckdb)
How intensive your comparisons are vs your blocking.
- If you have computationally intensive comparison levels (e.g. Damarau Levenshtein, see here) then the blocking phase is very quick relative to the predict phase, and so you want to optimise the predict calculation.
- If your comparisons are computationally cheap (e.g. an ExactMatch rule), then it makes less sense to split into blocking + materialise, and then predict

One other aspect to this is whether you're being bottlenecked by disk/IO. Materialising the blocked pairs to disk matters if e.g. you use an EC2 instance with EBS on default settings (throughput i.e. speed of disk write is slow on these machines, maybe 500MiB/s). It matters less if you're on a fast SSD (e.g. 8GiB/s)

0 replies

RobinL · 2026-05-14T18:12:33Z

RobinL
May 14, 2026
Maintainer

Incidentally, chunking in the upcoming Splink 5 should help with memory. But - with the duckdb backend - setting the memory limit pragma should allow you to limit memory usage (though note you want 1-4Gb per thread so you may also need to limit threads if you do this). In this case, Splink should just spill to disk (make sure you set the temp directory to a location on an SSD)

There's some notes on progress on Splink 5 here. Ai generated but I scanned them and didn't see anything wrong

2 replies

gringer May 16, 2026
Author

setting the memory limit pragma should allow you to limit memory usage

Oh, thanks. Looks like there's yet another setting that we're using wrong. We had a host / container memory resource limit of 24 GiB, and I think we have been setting the DuckDB memory limit pragma to 24 GiB.

I wasn't aware of the "spill to disk" fallback; if DuckDB is trying and failing to allocate more than the container limit before it spills to disk, that would explain a lot about all the memory issues we're having.

RobinL May 16, 2026
Maintainer

There's another important possible mem issue when running in 'non vanilla ' computers: if you're sharing a machine ith other users the amount of ram reported by the system and available to you may differ. Duckdb then tries to grab all the ram on the whole machine, and you get a hard out of memory message rather than it gracefully spilling to disk.

gringer · 2026-05-27T04:26:36Z

gringer
May 27, 2026
Author

I've done a bit more testing with the DuckDB memory limit pragma, and it seems like there were a few other intermittent memory issues due to running Splink on a shared server, but in general, thinking about the DuckDB memory limit as a "spill to disk" threshold helped a lot.

Our system was running with a 48 GB host memory limit on the container. When I set the DuckDB pragma to 48 GB with 24 threads, 42GB with 24 threads, or even 36 GB with 18 threads, we ran into memory issues like this:

Error was: Out of Memory Error: Failed to allocate block of 1073741824 bytes (bad allocation)

Possible solutions:
* Reducing the number of threads (SET threads=X)
* Disabling insertion-order preservation (SET preserve_insertion_order=false)
* Increasing the memory limit (SET memory_limit='...GB')

See also https://duckdb.org/docs/stable/guides/performance/how_to_tune_workloads

Taking into account Robin's comment - and contrary to what that error message suggests - the situation was helped by lowering the memory limit. In our particular case, 12 threads and 24 GB on the shared system is working well so far.

[but I've got another dataset that is about five times the size coming up for testing, so we'll see how well it does with that]

0 replies

gringer · 2026-06-03T04:07:13Z

gringer
Jun 3, 2026
Author

I ended up shifting to Splink v5.0.0-dev3, which required some tweaks for registering the input tables, but the workflow was otherwise essentially identical.

v5 has substantially increased the linking speed, but didn't really seem to change the memory consumption aspects [but I haven't tried chunking yet...]. I have been able to get our bigger dataset workflow (12.8M rows x 54.3M rows) running on 12 threads, using 36 GB for training, and 24 GB for linking, separated into 15 stages with input table filtering carried out between stages to exclude already-linked records.

36 GB for linking had memory issues:

Error was: Out of Memory Error: Failed to allocate block of 1073741824 bytes (bad allocation)

24 GB for training had a slightly different memory issue:

Error was: Out of Memory Error: could not allocate block of size 256.0 KiB (22.3 GiB/22.3 GiB used)

4 replies

RobinL Jun 3, 2026
Maintainer

Thanks, these notes are useful:

the situation was helped by lowering the memory limit. In our particular case, 12 threads and 24 GB on the shared system is working well so far.

Yes - this is both counterintuitive and also not very surprising to me. We should state this more clearly somewhere!

You may also find some of the information here useful - it's not directly relevant but does demonstrate that performance can be counterintuitive:
https://www.robinlinacre.com/optimising_duckdb_performance_large_ec2_instances/

I think, but am not 100% sure that chunking should help. It should be straightforward to enable, just by adding the num_chunks_ arguments as follows:

pairwise_predictions = linker.inference.predict(num_chunks_left=5, num_chunks_right=5)
``

gringer Jun 4, 2026
Author

Yes, chunking definitely helps. I've been trying it out now, and it has allowed me to include all our rule sets within a 36 GB memory envelope (I had previously excluded one rule set to get the linking job to run to completion); it's working great!

RobinL Jun 4, 2026
Maintainer

Excellent news, thanks for the feedback. Really useful to know it helps real world.

gringer Jun 7, 2026
Author

I've been thinking again about the chunking, and was wondering if functionality for a non-cartesian chunking would be possible as well. An obvious usage example (which I've mentioned a few times in the past) would be blocking on DOB only.

When blocking only on exact matches by DOB, there's no need to compare, for example, January dates of birth with July dates of birth, so instead of doing 144 chunk comparisons when splitting on month it would only be necessary to do 12 chunk comparisons.

Instead of explicitly specifying the split, I could imagine an implementation where blocking variables are combined then hashed, and the split is carried out on that hash. Where the combined hash is different (and the blocking rules aren't using complex SQL that forces a complete search over all records), there's no chance that two records will end up within the same block, so they can be safely partitioned into separate chunks without having to worry about cross comparisons. I'm imagining this being implemented in tandem with the existing chunking by a call something like this:

pairwise_predictions = linker.inference.predict(hash_chunks=12)

... and after thinking this through, I now see why doing something like this doesn't make sense with a non-staged workflow, because the rule set combinations make it very unlikely that such an approach would work. For example, if there were the following rule sets:

Date of birth [only]
First Name + Last Name + Year of birth

Then splitting by the month of birth would interfere with that second rule set. Doing a non-cartesian chunking only makes sense where the rule sets are considered separately.

Reducing memory footprint - what is the purpose of the 'materialise_blocked_pairs' argument for link prediction? #3066

Uh oh!

Uh oh!

gringer May 11, 2026

Replies: 5 comments · 6 replies

Uh oh!

gringer May 12, 2026 Author

Uh oh!

Uh oh!

RobinL May 14, 2026 Maintainer

Uh oh!

Uh oh!

RobinL May 14, 2026 Maintainer

Uh oh!

Uh oh!

gringer May 16, 2026 Author

Uh oh!

RobinL May 16, 2026 Maintainer

Uh oh!

Uh oh!

gringer May 27, 2026 Author

Uh oh!

Uh oh!

gringer Jun 3, 2026 Author

Uh oh!

Uh oh!

RobinL Jun 3, 2026 Maintainer

Uh oh!

gringer Jun 4, 2026 Author

Uh oh!

RobinL Jun 4, 2026 Maintainer

Uh oh!

Uh oh!

gringer Jun 7, 2026 Author

gringer
May 11, 2026

Replies: 5 comments 6 replies

gringer
May 12, 2026
Author

RobinL
May 14, 2026
Maintainer

RobinL
May 14, 2026
Maintainer

gringer May 16, 2026
Author

RobinL May 16, 2026
Maintainer

gringer
May 27, 2026
Author

gringer
Jun 3, 2026
Author

RobinL Jun 3, 2026
Maintainer

gringer Jun 4, 2026
Author

RobinL Jun 4, 2026
Maintainer

gringer Jun 7, 2026
Author