Replies: 1 comment
-
Without having looked in much detail, I think the issue you are running into here will be to do with caching. For performance reasons Splink keeps information about tables it has already computed so that if they are required at other steps of the pipeline they can be read from the db rather than re-computed. So I think the trouble is that you delete all the Splink tables, but the linker does not know that these tables no longer exist, so you hit an error when it tries to read from one. There is a linker method that will handle this for you: Also worth mentioning that if you create a new linker object, this will use a separate cache + a different set of tables, which might be useful if you need to do separate linkings without deleting everything. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I was able to figure out that splink creates tables by the name "__splink" and writes them to a default place.
In my case, I use databricks with a 12.2 LTS ML compute and was able to figure out that these table are being written to the dbfs location,
"dbfs:/user/hive/metastore"
I then proceeded to manually delete these tables (as there were 100s of them from previous runs) using the code below:
Once these tables were deleted, my previously tested solution fails on the estimating u step as shown below:
As highlighted in the red box above, I added the _set_catalog_and_database_if_not_provided function and specified where I would like these intermediate tables to be written.
Error:
Beta Was this translation helpful? Give feedback.
All reactions