Skip to content

Commit

Permalink
Merge pull request #711 from moj-analytical-services/transpilation_guide
Browse files Browse the repository at this point in the history
[Docs] Dev guide to transpilation
  • Loading branch information
RobinL committed Aug 14, 2022
2 parents bf12f31 + 3db9a40 commit 14c1f1a
Show file tree
Hide file tree
Showing 2 changed files with 54 additions and 0 deletions.
52 changes: 52 additions & 0 deletions docs/dev_guides/transpilation.md
@@ -0,0 +1,52 @@
# SQL Transpilation in Splink, and how we support multiple SQL backends

In Splink, all the core data linking algorithms are implemented in SQL. This allows computation to be offloaded to a SQL backend of the users choice.

One difficulty with this paradigm is that SQL implementations differ - the functions available in (say) the Spark dialect of SQL differ from those available in DuckDB SQL. And to make matters worse, functions with the same name may behave differently (e.g. different arguments, arguments in different orders, etc.).

Splink therefore needs the ability to translate (transpile) between different SQL dialects. We use `sqlglot` for this purpose.

Details are as follows:

### 1. Core data linking algorithms are Splink

Core data linking algorithms are implmented in 'backend agnostic' SQL. That is, written in the dialect that SQLGlot considers to be 'vanilla SQL'. So the code can be compiled using:

```
sqlglot.transpile(sql, read=None, write=target_backend)
```

It has been possible to write all of the core Splink logic in SQL that is consistent between dialects.

When this SQL is parsed by `SQLGlot` with `dialect` set to any target backends (`DuckDB`, `Spark` etc), you get the same result (the abstract synatax tree is identical).

On the face of it, this suggests no transpilation is necessary. Unfortunately, this is not the case, because within Splink config, the user has the opportunity to specify custom SQL expressions, and these expressions may be backend specfic (dialect specific).

### 2. User-provided SQL is interpolated into these dialect-agnostic SQL statements

The user provides custom SQL is two places in Splink:

1. Blocking rules
2. The `sql_condition` (see [here](https://moj-analytical-services.github.io/splink/settings_dict_guide.html#sql_condition)) provided as part of a `Comparison`

The user is free to write this SQL however they want. It's common for the user to use functions only available in that specific backend.

The custom SQL is interpolated into the the SQL statements generated by Splink.

### 3. Each backend implements a SQL transpilation step

Each backend implements its own `_execute_sql_against_backend` method e.g. see [here](https://github.com/moj-analytical-services/splink/blob/bf12f3159fe9287482f93202b946ea12fb3b0a9b/splink/duckdb/duckdb_linker.py#L178) for DuckDB.

This method typically runs the following step:

```
sql = sqlglot.transpile(sql, read=None, write="duckdb", pretty=True)[0]
```

Before execution.

See here for each backend:

- [DuckDB](https://github.com/moj-analytical-services/splink/blob/bf12f3159fe9287482f93202b946ea12fb3b0a9b/splink/duckdb/duckdb_linker.py#L178)
- [Spark](https://github.com/moj-analytical-services/splink/blob/bf12f3159fe9287482f93202b946ea12fb3b0a9b/splink/spark/spark_linker.py#L260)
- [Athena](https://github.com/moj-analytical-services/splink/blob/bf12f3159fe9287482f93202b946ea12fb3b0a9b/splink/athena/athena_linker.py#L322)
2 changes: 2 additions & 0 deletions mkdocs.yml
Expand Up @@ -94,5 +94,7 @@ nav:
- Caching and pipelining: "dev_guides/caching.md"
- Understanding and debugging Splink: "dev_guides/debug_modes.md"
- Spark caching: "dev_guides/spark_pipelining_and_caching.md"
- Transpilation using sqlglot: "dev_guides/transpilation.md"

extra_css:
- css/custom.css

0 comments on commit 14c1f1a

Please sign in to comment.