Merge pull request #1877 from moj-analytical-services/document_duckdb…

…_parallelism Document duckdb parallelism
moj-analytical-services · Jan 23, 2024 · 2e9393d · 2e9393d
2 parents 8e36034 + f7a2846
commit 2e9393d
Show file tree

Hide file tree

Showing 3 changed files with 154 additions and 8 deletions.
diff --git a/docs/topic_guides/performance/optimising_duckdb.md b/docs/topic_guides/performance/optimising_duckdb.md
@@ -0,0 +1,131 @@
+---
+tags:
+  - Performance
+  - DuckDB
+  - Salting
+  - Parallelism
+---
+
+## Optimising DuckDB jobs
+
+This topic guide describes how to configure DuckDB to optimise performance
+
+It is assumed readers have already read the more general [guide to linking big data](./drivers_of_performance.md), and have chosen appropriate blocking rules.
+
+## Summary:
+
+- From `splink==3.9.11` onwards, DuckDB generally parallelises jobs well, so you should see 100% usage of all CPU cores for the main Splink operations (parameter estimation and prediction)
+- In some cases `predict()` needs salting on `blocking_rules_to_generate_predictions` to achieve 100% CPU use. You're most likely to need this in the following scenarios:
+    - Very high core count machines
+    - Splink models that contain a small number of `blocking_rules_to_generate_predictions`
+    - Splink models that have a relatively small number of input rows (less than around 500k)
+- If you are facing memory issues with DuckDB, you have the option of using an on-disk database.
+- Reducing the amount of parallelism by removing salting can also sometimes reduce memory usage
+
+You can find a blog post with formal benchmarks of DuckDB performance on a variety of machine types [here](https://www.robinlinacre.com/fast_deduplication/).
+
+## Configuration
+
+### Ensuring 100% CPU usage across all cores on `predict()`
+
+The aim is for overall parallelism of the predict() step to closely align to the number of thread/vCPU cores you have:
+- If parallelism is too low, you won't use all your threads
+- If parallelism is too high, runtime will be longer.
+
+The number of CPU cores used is given by the following formula:
+
+$\text{base parallelism} = \frac{\text{number of input rows}}{122,880}$
+
+$\text{blocking rule parallelism}$
+
+$= \text{count of blocking rules} \times$ $\text{number of salting partitions per blocking rule}$
+
+$\text{overall parallelism} = \text{base parallelism} \times \text{blocking rule parallelism}$
+
+If overall parallelism is less than the total number of threads, then you won't achieve 100% CPU usage.
+
+#### Example
+
+Consider a deduplication job with 1,000,000 input rows, on a machine with 32 cores (64 threads)
+
+In our Splink suppose we set:
+
+```python
+settings =  {
+    ...
+    "blocking_rules_to_generate_predictions" ; [
+        block_on(["first_name"], salting_partitions=2),
+        block_on(["dob"], salting_partitions=2),
+        block_on(["surname"], salting_partitions=2),
+    ]
+    ...
+}
+```
+
+Then we have:
+
+- Base parallelism of 9.
+- 3 blocking rules
+- 2 salting partitions per blocking rule
+
+We therefore have paralleism of $9 \times 3 \times 2 = 54$, which is less than the 64 threads, and therefore we won't quite achieve full parallelism.
+
+### Generalisation
+
+The above formula for overall parallelism assumes all blocking rules have the same number of salting partitions, which is not necessarily the case. In the more general case of variable numbers of salting partitions, the formula becomes
+
+$$
+\text{overall parallelism} =
+\text{base parallelism} \times \text{total number of salted blocking partitions across all blocking rules}
+$$
+
+So for example, with two blocking rules, if the first has 2 salting partitions, and the second has 10 salting partitions, when we would multiply base parallelism by 12.
+
+This may be useful in the case one of the blocking rules produces more comparisons than another: the 'bigger' blocking rule can be salted more.
+
+For further information about how parallelism works in DuckDB, including links to relevant DuckDB documentation and discussions, see [here](https://github.com/moj-analytical-services/splink/discussions/1830).
+
+### Running out of memory
+
+If your job is running out of memory, the first thing to consider is tightening your blocking rules, or running the workload on a larger machine.
+
+If these are not possible, the following config options may help reduce memory usage:
+
+#### Using an on-disk database
+
+DuckDB can spill to disk using several settings:
+
+Use the special `:temporary:` connection built into Splink that creates a temporary on disk database
+
+```python
+
+linker = DuckDBLinker(
+    df, settings, connection=":temporary:"
+)
+```
+
+Use an on-disk database:
+
+```python
+con = duckdb.connect(database='my-db.duckdb')
+linker = DuckDBLinker(
+    df, settings, connection=con
+)
+```
+
+Use an in-memory database, but ensure it can spill to disk:
+
+```python
+con = duckdb.connect(":memory:")
+
+con.execute("SET temp_directory='/path/to/temp';")
+linker = DuckDBLinker(
+    df, settings, connection=con
+)
+```
+
+See also [this section](https://duckdb.org/docs/guides/performance/how-to-tune-workloads.html#larger-than-memory-workloads-out-of-core-processing) of the DuckDB docs
+
+#### Reducing salting
+
+Empirically we have noticed that there is a tension between parallelism and total memory usage. If you're running out of memory, you could consider reducing parallelism.
diff --git a/docs/topic_guides/splink_fundamentals/backends/backends.md b/docs/topic_guides/splink_fundamentals/backends/backends.md
@@ -17,29 +17,44 @@ The Splink code you write is almost identical between backends, so it's straight
 ## Choosing a backend
 
 ### Considerations when choosing a SQL backend for Splink
+
 When choosing which backend to use when getting started with Splink, there are a number of factors to consider:
 
 - the size of the dataset(s)
 - the amount of boilerplate code/configuration required
 - access to specific (sometimes proprietary) platforms
 - the backend-specific features offered by Splink
-- the level of support and active development offered by Splink  
+- the level of support and active development offered by Splink
 
 Below is a short summary of each of the backends available in Splink.
 
 ### :simple-duckdb: DuckDB
 
-DuckDB is recommended for smaller datasets (1-2 million records), and would be our primary recommendation for getting started with Splink. It is fast, easy to set up, can be run on any device with python installed and it is installed automatically with Splink via `pip install splink`. DuckDB has complete coverage for the functions in the Splink [comparison libraries](../../../comparison_level_library.md) and, as a mainstay of the Splink development team, is actively maintained with features being added regularly.
+DuckDB is recommended for most users. It is the fastest backend, and is capable of linking large datasets, especially if you have access to high-spec machines.
+
+As a rough guide it can:
+
+- Link up to around 5 million records on a modern laptop (4 core/16GB RAM)
+- Link tens of millions of records on high spec cloud computers very fast.
+
+For further details, see the results of formal benchmarking [here](https://www.robinlinacre.com/fast_deduplication/).
+
+DuckDB is also recommended because for many users its simplest to set up.
 
-Often, it's a good idea to start working using DuckDB on a sample of data, because it will produce results very quickly. When you're comfortable with your model, you may wish to migrate to a big data backend to estimate/predict on the full dataset.
+It can be run on any device with python installed and it is installed automatically with Splink via `pip install splink`. DuckDB has complete coverage for the functions in the Splink [comparison libraries](../../../comparison_level_library.md) and, as a mainstay of the Splink development team, is actively maintained with features being added regularly.
 
 See the DuckDB [deduplication example notebook](../../../demos/examples/deduplicate_50k_synthetic.ipynb) to get a better idea of how Splink works with DuckDB.
 
 ### :simple-apachespark: Spark
 
-Spark is a system for distributed computing which is great for large datasets (10-100+ million records). It is more involved in terms of configuration, with more boilerplate code than the likes of DuckDB. Spark has complete coverage for the functions in the Splink [comparison libraries](../../../comparison_level_library.md) and, as a mainstay of the Splink development team, is actively maintained with features being added regularly.
+Spark is a system for distributed computing which is great for large datasets. It is more involved in terms of configuration, with more boilerplate code than the likes of DuckDB. Spark has complete coverage for the functions in the Splink [comparison libraries](../../../comparison_level_library.md) and, as a mainstay of the Splink development team, is actively maintained with features being added regularly.
 
-If working with Databricks, the Spark backend is recommended, however as the Splink development team does not have access to a Databricks environment there will be instances where we will be unable to provide support.
+Spark is primarily recommended for users who either:
+
+- Need to link enormous datasets (100 million records+), and have experience out of memory/out of disk problems with DuckDB
+- Or have easier access to a Spark cluster than a single high-spec instance to run DuckDB
+
+If working with Databricks note that the Splink development team does not have access to a Databricks environment there will be instances where we will be unable to provide support.
 
 See the Spark [deduplication example notebook](../../../demos/examples/spark/deduplicate_1k_synthetic.ipynb) to get a better idea of how Splink works with Spark.
 
@@ -119,8 +134,6 @@ Once you have initialised the `linker` object, there is no difference in the sub
 
     ```
 
-
-
 ## Additional Information for specific backends
 
 ### :simple-sqlite: SQLite
@@ -139,4 +152,4 @@ However, there are a couple of points to note:
 ```py
 linker = SQLiteLinker(df, settings, ..., register_udfs=False)
 ```
-* As these functions are implemented in python they will be considerably slower than any native-SQL comparisons. If you find that your model-training or predictions are taking a large time to run, you may wish to consider instead switching to DuckDB (or some other backend).
+* As these functions are implemented in python they will be considerably slower than any native-SQL comparisons. If you find that your model-training or predictions are taking a large time to run, you may wish to consider instead switching to DuckDB (or some other backend).
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -165,6 +165,8 @@ nav:
         - Spark Performance:
             - Optimising Spark performance: "topic_guides/performance/optimising_spark.md"
             - Salting blocking rules: "topic_guides/performance/salting.md"
+        - DuckDB Performance:
+            - Optimising DuckDB performance: "topic_guides/performance/optimising_duckdb.md"
   - Documentation:
       - Introduction: "documentation_index.md"
       - API: