Deletion of __splink tables causes the developed code to fail on the estiamate_u step #2039

sthamodh · 2024-03-07T16:12:31Z

sthamodh
Mar 7, 2024

Hi all,

I was able to figure out that splink creates tables by the name "__splink" and writes them to a default place.
In my case, I use databricks with a 12.2 LTS ML compute and was able to figure out that these table are being written to the dbfs location,

"dbfs:/user/hive/metastore"

I then proceeded to manually delete these tables (as there were 100s of them from previous runs) using the code below:

Once these tables were deleted, my previously tested solution fails on the estimating u step as shown below:

As highlighted in the red box above, I added the _set_catalog_and_database_if_not_provided function and specified where I would like these intermediate tables to be written.

Error:

How would one make sure that intermediate tables are created in a specific place and it does not write these "__splink" tables to a default space?
How would one delete these tables after each run? Is there a method for the linker object that can do this?

ADBond · 2024-03-11T08:15:35Z

ADBond
Mar 11, 2024
Maintainer

Without having looked in much detail, I think the issue you are running into here will be to do with caching. For performance reasons Splink keeps information about tables it has already computed so that if they are required at other steps of the pipeline they can be read from the db rather than re-computed. So I think the trouble is that you delete all the Splink tables, but the linker does not know that these tables no longer exist, so you hit an error when it tries to read from one.

There is a linker method that will handle this for you: linker.delete_tables_created_by_splink_from_db(), which will also delete them from the cache. If you ever want to clear the cache without deleting tables from the database you can run linker.invalidate_cache().

Also worth mentioning that if you create a new linker object, this will use a separate cache + a different set of tables, which might be useful if you need to do separate linkings without deleting everything.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deletion of __splink tables causes the developed code to fail on the estiamate_u step #2039

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Deletion of __splink tables causes the developed code to fail on the estiamate_u step #2039

sthamodh Mar 7, 2024

Replies: 1 comment

ADBond Mar 11, 2024 Maintainer

sthamodh
Mar 7, 2024

ADBond
Mar 11, 2024
Maintainer