Merge pull request #711 from moj-analytical-services/transpilation_guide

[Docs] Dev guide to transpilation
moj-analytical-services · Aug 14, 2022 · 14c1f1a · 14c1f1a
2 parents bf12f31 + 3db9a40
commit 14c1f1a
Show file tree

Hide file tree

Showing 2 changed files with 54 additions and 0 deletions.
diff --git a/docs/dev_guides/transpilation.md b/docs/dev_guides/transpilation.md
@@ -0,0 +1,52 @@
+# SQL Transpilation in Splink, and how we support multiple SQL backends
+
+In Splink, all the core data linking algorithms are implemented in SQL. This allows computation to be offloaded to a SQL backend of the users choice.
+
+One difficulty with this paradigm is that SQL implementations differ - the functions available in (say) the Spark dialect of SQL differ from those available in DuckDB SQL. And to make matters worse, functions with the same name may behave differently (e.g. different arguments, arguments in different orders, etc.).
+
+Splink therefore needs the ability to translate (transpile) between different SQL dialects. We use `sqlglot` for this purpose.
+
+Details are as follows:
+
+### 1. Core data linking algorithms are Splink
+
+Core data linking algorithms are implmented in 'backend agnostic' SQL. That is, written in the dialect that SQLGlot considers to be 'vanilla SQL'. So the code can be compiled using:
+
+```
+sqlglot.transpile(sql, read=None, write=target_backend)
+```
+
+It has been possible to write all of the core Splink logic in SQL that is consistent between dialects.
+
+When this SQL is parsed by `SQLGlot` with `dialect` set to any target backends (`DuckDB`, `Spark` etc), you get the same result (the abstract synatax tree is identical).
+
+On the face of it, this suggests no transpilation is necessary. Unfortunately, this is not the case, because within Splink config, the user has the opportunity to specify custom SQL expressions, and these expressions may be backend specfic (dialect specific).
+
+### 2. User-provided SQL is interpolated into these dialect-agnostic SQL statements
+
+The user provides custom SQL is two places in Splink:
+
+1. Blocking rules
+2. The `sql_condition` (see [here](https://moj-analytical-services.github.io/splink/settings_dict_guide.html#sql_condition)) provided as part of a `Comparison`
+
+The user is free to write this SQL however they want. It's common for the user to use functions only available in that specific backend.
+
+The custom SQL is interpolated into the the SQL statements generated by Splink.
+
+### 3. Each backend implements a SQL transpilation step
+
+Each backend implements its own `_execute_sql_against_backend` method e.g. see [here](https://github.com/moj-analytical-services/splink/blob/bf12f3159fe9287482f93202b946ea12fb3b0a9b/splink/duckdb/duckdb_linker.py#L178) for DuckDB.
+
+This method typically runs the following step:
+
+```
+sql = sqlglot.transpile(sql, read=None, write="duckdb", pretty=True)[0]
+```
+
+Before execution.
+
+See here for each backend:
+
+- [DuckDB](https://github.com/moj-analytical-services/splink/blob/bf12f3159fe9287482f93202b946ea12fb3b0a9b/splink/duckdb/duckdb_linker.py#L178)
+- [Spark](https://github.com/moj-analytical-services/splink/blob/bf12f3159fe9287482f93202b946ea12fb3b0a9b/splink/spark/spark_linker.py#L260)
+- [Athena](https://github.com/moj-analytical-services/splink/blob/bf12f3159fe9287482f93202b946ea12fb3b0a9b/splink/athena/athena_linker.py#L322)
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -94,5 +94,7 @@ nav:
       - Caching and pipelining: "dev_guides/caching.md"
       - Understanding and debugging Splink: "dev_guides/debug_modes.md"
       - Spark caching: "dev_guides/spark_pipelining_and_caching.md"
+      - Transpilation using sqlglot: "dev_guides/transpilation.md"
+
 extra_css:
 - css/custom.css