Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove _pipeline from linker and refactor CTE pipeline #2069

Merged
merged 62 commits into from
Mar 27, 2024
Merged

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Mar 18, 2024

This PR:

  • eliminates the shared linker._pipeline, so that all SQL operations create and use fresh pipeline(s).
  • eliminated associated linker methods linker._enqueue_sql. linker._execute_sql_pipeline
  • all SQL execution now happens via a pipeline that gets passed to database_api.sql_pipeline_to_splink_dataframe
  • eliminates confusing linker methods _initialise_df_concat and _initialise_df_concat_with_tf, instead relying on more explicit calculations using functions in vertically_concatenate.py
  • clarifies the naming of pipeline classes to be more precise (Task is now CTE, and SQLPipeline is now CTEPipeline)
  • removes the option of passing input_dataframes to sql_pipeline_to_splink_dataframe, instead, input dataframes can be added to a CTEPipeline with .append_input_dataframe. Users can therefore add dataframes in places that make the logic flow more clearly
  • Pipelines are now 'single use'. Once executed, they will not be allowed to be re-used; a fresh one has to be created. A pipeline.spent property enforces this.

Motivation for this PR

Consider the existing _initialise_df_concat_with_tf.

The return type and the mutations of state it performs are confusing:

  • If you set materialise=True it returns a Splink dataframe.
  • If you set materialise=False, it enqueues sql on linker._pipeline() and returns None

So it relies on:

  • the ability to mutate linker._pipeline
  • The ability to execute linker._pipeline, and reuse it

This function is not really compatible with the idea of using a fresh SQL pipeline each time we want to queue sql. You'd need to pass a pipeline in, but it's not clear what comes out.

By allowing input tables to be queued directly onto the CTEPipeline, then we can write a new function linker._enqueue_df_concat_with_tf which:

  • If you set materialise=True, runs the SQL, and returns a CTEPipeline with the result already enqueued
  • If you set materialise=False, enqueues the SQL without running it to the pipeline, and returns a CTEPipeine

Thus allowing us to replace all use of _initialise_df_concat_with_tf

Also closes #1696

Reviewing

The main changes have been on:

  • pipeline.py
  • linker.py
  • database_api.py

All the other changes are all just downstream consequences of changing those files

I was getting weirdness with the tests so you'll see i had to reset the cache

Related possible future PRs

  • Should the pipeline have a pipeline.to_splink_dataframe(db_api) method? This feels clearer than having to call db_api.sql_pipeline_to_splink_dataframe(pipeline)

@RobinL RobinL changed the base branch from master to splink4_dev March 18, 2024 10:36
@RobinL RobinL requested a review from ADBond March 26, 2024 10:14
@RobinL RobinL changed the title (WIP) CTE pipeline Remove _pipeline from linker and refactor CTE pipeline Mar 26, 2024
Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great, looks like a big QoL improvement! Feels a lot clearer treating input frames in this way also - used to often get a bit tripped up by the old way, as it didn't line up between when you need/use them.

Haven't looked in detail at everything, but happy that the shape of this is good, and sure we can pick up any small issues if any arise

@RobinL RobinL merged commit 834f1e2 into splink4_dev Mar 27, 2024
11 checks passed
@RobinL RobinL deleted the cte_pipeline branch March 27, 2024 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants