Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #711 from moj-analytical-services/transpilation_guide
[Docs] Dev guide to transpilation
- Loading branch information
Showing
2 changed files
with
54 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# SQL Transpilation in Splink, and how we support multiple SQL backends | ||
|
||
In Splink, all the core data linking algorithms are implemented in SQL. This allows computation to be offloaded to a SQL backend of the users choice. | ||
|
||
One difficulty with this paradigm is that SQL implementations differ - the functions available in (say) the Spark dialect of SQL differ from those available in DuckDB SQL. And to make matters worse, functions with the same name may behave differently (e.g. different arguments, arguments in different orders, etc.). | ||
|
||
Splink therefore needs the ability to translate (transpile) between different SQL dialects. We use `sqlglot` for this purpose. | ||
|
||
Details are as follows: | ||
|
||
### 1. Core data linking algorithms are Splink | ||
|
||
Core data linking algorithms are implmented in 'backend agnostic' SQL. That is, written in the dialect that SQLGlot considers to be 'vanilla SQL'. So the code can be compiled using: | ||
|
||
``` | ||
sqlglot.transpile(sql, read=None, write=target_backend) | ||
``` | ||
|
||
It has been possible to write all of the core Splink logic in SQL that is consistent between dialects. | ||
|
||
When this SQL is parsed by `SQLGlot` with `dialect` set to any target backends (`DuckDB`, `Spark` etc), you get the same result (the abstract synatax tree is identical). | ||
|
||
On the face of it, this suggests no transpilation is necessary. Unfortunately, this is not the case, because within Splink config, the user has the opportunity to specify custom SQL expressions, and these expressions may be backend specfic (dialect specific). | ||
|
||
### 2. User-provided SQL is interpolated into these dialect-agnostic SQL statements | ||
|
||
The user provides custom SQL is two places in Splink: | ||
|
||
1. Blocking rules | ||
2. The `sql_condition` (see [here](https://moj-analytical-services.github.io/splink/settings_dict_guide.html#sql_condition)) provided as part of a `Comparison` | ||
|
||
The user is free to write this SQL however they want. It's common for the user to use functions only available in that specific backend. | ||
|
||
The custom SQL is interpolated into the the SQL statements generated by Splink. | ||
|
||
### 3. Each backend implements a SQL transpilation step | ||
|
||
Each backend implements its own `_execute_sql_against_backend` method e.g. see [here](https://github.com/moj-analytical-services/splink/blob/bf12f3159fe9287482f93202b946ea12fb3b0a9b/splink/duckdb/duckdb_linker.py#L178) for DuckDB. | ||
|
||
This method typically runs the following step: | ||
|
||
``` | ||
sql = sqlglot.transpile(sql, read=None, write="duckdb", pretty=True)[0] | ||
``` | ||
|
||
Before execution. | ||
|
||
See here for each backend: | ||
|
||
- [DuckDB](https://github.com/moj-analytical-services/splink/blob/bf12f3159fe9287482f93202b946ea12fb3b0a9b/splink/duckdb/duckdb_linker.py#L178) | ||
- [Spark](https://github.com/moj-analytical-services/splink/blob/bf12f3159fe9287482f93202b946ea12fb3b0a9b/splink/spark/spark_linker.py#L260) | ||
- [Athena](https://github.com/moj-analytical-services/splink/blob/bf12f3159fe9287482f93202b946ea12fb3b0a9b/splink/athena/athena_linker.py#L322) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters