Remove `_pipeline` from linker and refactor CTE pipeline #2069

RobinL · 2024-03-18T10:36:23Z

This PR:

eliminates the shared linker._pipeline, so that all SQL operations create and use fresh pipeline(s).
eliminated associated linker methods linker._enqueue_sql. linker._execute_sql_pipeline
all SQL execution now happens via a pipeline that gets passed to database_api.sql_pipeline_to_splink_dataframe
eliminates confusing linker methods _initialise_df_concat and _initialise_df_concat_with_tf, instead relying on more explicit calculations using functions in vertically_concatenate.py
clarifies the naming of pipeline classes to be more precise (Task is now CTE, and SQLPipeline is now CTEPipeline)
removes the option of passing input_dataframes to sql_pipeline_to_splink_dataframe, instead, input dataframes can be added to a CTEPipeline with .append_input_dataframe. Users can therefore add dataframes in places that make the logic flow more clearly
Pipelines are now 'single use'. Once executed, they will not be allowed to be re-used; a fresh one has to be created. A pipeline.spent property enforces this.

Motivation for this PR

Consider the existing _initialise_df_concat_with_tf.

The return type and the mutations of state it performs are confusing:

If you set materialise=True it returns a Splink dataframe.
If you set materialise=False, it enqueues sql on linker._pipeline() and returns None

So it relies on:

the ability to mutate linker._pipeline
The ability to execute linker._pipeline, and reuse it

This function is not really compatible with the idea of using a fresh SQL pipeline each time we want to queue sql. You'd need to pass a pipeline in, but it's not clear what comes out.

By allowing input tables to be queued directly onto the CTEPipeline, then we can write a new function linker._enqueue_df_concat_with_tf which:

If you set materialise=True, runs the SQL, and returns a CTEPipeline with the result already enqueued
If you set materialise=False, enqueues the SQL without running it to the pipeline, and returns a CTEPipeine

Thus allowing us to replace all use of _initialise_df_concat_with_tf

Also closes #1696

Reviewing

The main changes have been on:

pipeline.py
linker.py
database_api.py

All the other changes are all just downstream consequences of changing those files

I was getting weirdness with the tests so you'll see i had to reset the cache

Related possible future PRs

Should the pipeline have a pipeline.to_splink_dataframe(db_api) method? This feels clearer than having to call db_api.sql_pipeline_to_splink_dataframe(pipeline)

…tf from cache after calculating it

…pute_methods Enqueue and compute methods

ADBond

I think this is great, looks like a big QoL improvement! Feels a lot clearer treating input frames in this way also - used to often get a bit tripped up by the old way, as it didn't line up between when you need/use them.

Haven't looked in detail at everything, but happy that the shape of this is good, and sure we can pick up any small issues if any arise

RobinL added 3 commits March 18, 2024 09:37

refactor sql pipeline and allow adding input dataframes

362644b

foramt sql a bit better for logging

c5bb27d

allow ctepipeline to be instantiated with input dataframe(s)

f09b264

RobinL changed the base branch from master to splink4_dev March 18, 2024 10:36

RobinL added 26 commits March 18, 2024 10:43

forgot to remove input_dataframes

b380674

deterministic link mode

f9fb040

fix some tests

1e79a4f

fix debug mode

28ca93e

fix mypy

94c6e7c

estimate u

fd2cdcb

get from cache with logging

dacadba

the new estimate_u implementation retrieves __splink__df_concat_with_…

bacc765

…tf from cache after calculating it

explicitly set logging levels to capture sql

5361a1e

fix typo

3cd02d7

add break lineage function

e77b2fd

truth_space_table_from_labels_table

5b3b366

prediction_errors_from_labels_table

bbbd2f5

exploding blocking

6431361

EM training

116bc0b

fix test failures

cfc2ad3

use new compute and enqueue functions

8a9429d

simplify api

e70d66d

Merge pull request #2086 from moj-analytical-services/enqueue_and_com…

7878be8

…pute_methods Enqueue and compute methods

find matches to new records

e1cad14

self link

3f8880f

cluster pairwise predictions

cde7efa

m from pairwise lables

4f2da9b

train m from labels column

cbfd8b7

remove references to _initialise_df_concat_with_tf from testing

8996c82

test accuracy

51af313

RobinL added 24 commits March 25, 2024 15:17

remove linker._enqueue_sql from cluster studio

57a9cf9

remove linker._enqueue_sql from connected components

b1abade

remove linker._enqueue_sql from edge metrics

500fca2

remove linker._enqueue_sql from match weights histogram

1a9ec59

remove linker._enqueue_sql from match weights histogram

d1aaef6

unlinkables

99bea33

compare two records

8da6e70

graph fns

62c50c9

comparison viewer

be2d0ef

count_num_comparisons_from_blocking_rule

cc1bfc2

fix comparison viewer

7916db2

remove _sql_to_splink_dataframe_checking_cache

9ca5b30

lint with black

25e69af

last reference to _sql_to_splink_dataframe_checking_cache

8c2d2d4

fix typo

b947a52

remove uses of _execute_sql in tests

0425565

remove linker._pipeline

889a9b2

restore table to splink dataframe

896607b

pipelines cannot be reused

99634d2

fix mypy prior to removing reusable flag

8d625a9

fix pipeline alraedy used error on profile columns

30ab004

remove input dataframes from db_api.sql_pipeline_to_splink_dataframe

9ea84d1

increment cache to check build

16a9e93

remove reusable arg

d204ee9

RobinL requested a review from ADBond March 26, 2024 10:14

RobinL changed the title ~~(WIP) CTE pipeline~~ Remove _pipeline from linker and refactor CTE pipeline Mar 26, 2024

ADBond approved these changes Mar 27, 2024

View reviewed changes

RobinL merged commit 834f1e2 into splink4_dev Mar 27, 2024
11 checks passed

RobinL deleted the cte_pipeline branch March 27, 2024 09:52

RobinL mentioned this pull request Apr 17, 2024

[Splink4] Use fresh SQLPipeline for all linker methods #2060

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `_pipeline` from linker and refactor CTE pipeline #2069

Remove `_pipeline` from linker and refactor CTE pipeline #2069

RobinL commented Mar 18, 2024 •

edited

ADBond left a comment

Remove _pipeline from linker and refactor CTE pipeline #2069

Remove _pipeline from linker and refactor CTE pipeline #2069

Conversation

RobinL commented Mar 18, 2024 • edited

Motivation for this PR

Reviewing

Related possible future PRs

ADBond left a comment

Choose a reason for hiding this comment

Remove `_pipeline` from linker and refactor CTE pipeline #2069

Remove `_pipeline` from linker and refactor CTE pipeline #2069

RobinL commented Mar 18, 2024 •

edited