-
|
I'm trying to hunt for ways to reduce the memory footprint of Splink, because we're running up against machine resource memory limits. Most recently I've been working on implementing staged blocking, while trying to keep things as Splink-like as possible (see #3044), but even with that we're still running into memory issues of various different sorts. In my hunt, I found a setting in the predict function called What's the purpose of doing that? Is it simply to get timings for blocking / linking? Will setting this argument to |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 6 replies
-
|
My own testing with this parameter adjustment in Splink for one of our datasets: The time for blocking and link prediction combined without materialisation is less than the time for blocking alone when materialisation is included. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks, these are useful findings. One primary reason for blocking, materialising the blocked pairs, and then predicting from the blocked pairs is that it helps parallelisation. Blocking joins do not parallelise well (at least, it is unpredictable) because SQL engines don't tend to be well optimised for queries that create lots of new rows (i.e. the mini cartesian products you get 'within' blocking rules). As a result, on big workflows, you often see a failure to parallelise e.g. due to straggler tasks or other issues. If you create the blocked pairs first, then you effectively eliminate skew, and also the database engine 'knows' how many rows its dealing with, so can split up the workload more efficiently. All of this is quite sensitive to a few points:
One other aspect to this is whether you're being bottlenecked by disk/IO. Materialising the blocked pairs to disk matters if e.g. you use an EC2 instance with EBS on default settings (throughput i.e. speed of disk write is slow on these machines, maybe 500MiB/s). It matters less if you're on a fast SSD (e.g. 8GiB/s) |
Beta Was this translation helpful? Give feedback.
-
|
Incidentally, chunking in the upcoming Splink 5 should help with memory. But - with the duckdb backend - setting the memory limit pragma should allow you to limit memory usage (though note you want 1-4Gb per thread so you may also need to limit threads if you do this). In this case, Splink should just spill to disk (make sure you set the temp directory to a location on an SSD) There's some notes on progress on Splink 5 here. Ai generated but I scanned them and didn't see anything wrong |
Beta Was this translation helpful? Give feedback.
-
|
I've done a bit more testing with the DuckDB memory limit pragma, and it seems like there were a few other intermittent memory issues due to running Splink on a shared server, but in general, thinking about the DuckDB memory limit as a "spill to disk" threshold helped a lot. Our system was running with a 48 GB host memory limit on the container. When I set the DuckDB pragma to 48 GB with 24 threads, 42GB with 24 threads, or even 36 GB with 18 threads, we ran into memory issues like this: Taking into account Robin's comment - and contrary to what that error message suggests - the situation was helped by lowering the memory limit. In our particular case, 12 threads and 24 GB on the shared system is working well so far. [but I've got another dataset that is about five times the size coming up for testing, so we'll see how well it does with that] |
Beta Was this translation helpful? Give feedback.
-
|
I ended up shifting to Splink v5.0.0-dev3, which required some tweaks for registering the input tables, but the workflow was otherwise essentially identical. v5 has substantially increased the linking speed, but didn't really seem to change the memory consumption aspects [but I haven't tried chunking yet...]. I have been able to get our bigger dataset workflow (12.8M rows x 54.3M rows) running on 12 threads, using 36 GB for training, and 24 GB for linking, separated into 15 stages with input table filtering carried out between stages to exclude already-linked records. 36 GB for linking had memory issues: 24 GB for training had a slightly different memory issue: |
Beta Was this translation helpful? Give feedback.
Thanks, these are useful findings.
One primary reason for blocking, materialising the blocked pairs, and then predicting from the blocked pairs is that it helps parallelisation. Blocking joins do not parallelise well (at least, it is unpredictable) because SQL engines don't tend to be well optimised for queries that create lots of new rows (i.e. the mini cartesian products you get 'within' blocking rules).
As a result, on big workflows, you often see a failure to parallelise e.g. due to straggler tasks or other issues.
If you create the blocked pairs first, then you effectively eliminate skew, and also the database engine 'knows' how many rows its dealing with, so can split up the workloa…