New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-29849: Make emptyTrash more efficient #517
Conversation
These were effectively methods on the object anyhow and were probably in the wrong place.
5692daf
to
ecaca7f
Compare
@TallJimbo I've tried your new approach and the bad news is that the second query takes a long time. Here are some results from a repo with 10k datasets deleting a 5k collection. The bottom line is that the extra query to find preserved artifacts is taking 14 seconds but doing a simple "give me all refs associated with these paths" takes a fraction of a second. This may be down entirely to the lack of index on file_datastore_records. Dumping the sqlite table seems to indicate that there is no index on file_datastore_records at all and now I'm confused. Original code:INFO 2021-05-07T16:10:42.830-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1650)- Emptying trash 51 seconds One join in bridgeWith join with records table but still doing check of single-file per multi ref INFO 2021-05-07T16:13:12.185-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1609)- Emptying trash Trash time: 25 seconds Single join + separate INWith standard join in bridge and then separate IN query INFO 2021-05-07T16:24:07.506-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1634)- Emptying trash Trash time: 7 seconds Double query in bridgeWith double query in bridge for empty preserved artifacts INFO 2021-05-07T16:17:38.926-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1635)- Emptying trash Noting that the query part now takes 14 seconds. The file delete itself is 7 seconds. Trash time: 21 seconds |
Now with an index and magically it is fast:
Diff is: diff --git a/python/lsst/daf/butler/datastores/fileDatastore.py b/python/lsst/daf/butler/datastores/fileDatastore.py
index 7d276f50..a1669e5e 100644
--- a/python/lsst/daf/butler/datastores/fileDatastore.py
+++ b/python/lsst/daf/butler/datastores/fileDatastore.py
@@ -228,6 +228,7 @@ class FileDatastore(GenericBaseDatastore):
ddl.FieldSpec(name="file_size", dtype=BigInteger, nullable=True),
],
unique=frozenset(),
+ indexes=(("dataset_id",), ("path",), ("dataset_id", "path")),
)
def __init__(self, config: Union[DatastoreConfig, str], I'm not entirely sure what the right answer is and "component" should be involved since it's dataset_id+component that is unique. I am surprised it didn't automatically add an index for dataset_id. We don't have an index on dataset_locations either since it's dataset_id+datastoreName that is the unique quantity. |
I'd guess that And yes, It looks like we could get away with adding the index and unique constraints on this ticket (and patching repos we know about manually) without a formal migration, because those changes won't be noticed by @andy-slac 's digest function that checks to see if the schema in the database matches what daf_butler would create. But the primary key changes would have to wait, and I'd want Andy to weigh on whether taking advantage of the digest function gap for the rest is inadvisable for some reason. |
|
Ok, I guess that's also unique, so it shouldn't matter much whether that or |
I can't work out how to add indexes to the dataset_location tables. Somehow it's ignoring the indexes attribute. I'm also a bit worried that I messed up my testing since even if I add indexes manually to both tables I can't get the new double query to go fast. It's possible I was running the test with the |
I would expect it to created the indexes specified in the tableSpec only if you're creating a new repo. |
I've been creating new repos for testing but for some reason adding |
Yeah, those FKs will create indexes on |
That would be it. datastore_name and dataset_id are both primary keys already hence indexing request would be ignored. Next week I'll send you my "create a test repo" script so you can try this branch to see what I'm missing in terms of why it's so slow with the double query. |
I finally read back up the discussion here and noticed that there was an extra index I had talked about early on and which in subsequent discussions I had ignored. Adding an index for "path" alone was what was missing. Doing that fixes things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I have one suggestion for the mypy situation that might help (but what you have is ok). Also a few other minor comments.
I'm in favor of the solution that requires adding the indexes to avoid a regression, but we do need to be careful with how we we stage that. I think everybody maintaining a big repo could very quickly get a few CREATE INDEX
commands run, but maybe a community announcement post with the instructions is in order? I can of course take care of the repos at NCSA, but I am worried about DESC@NERSC, IN2P3, IDF, and OODS.
raise ValueError("This implementation requires a records table.") | ||
|
||
if not isinstance(records_table, ByNameOpaqueTableStorage): | ||
raise ValueError(f"Records table must support hidden attributes. Got {type(records_table)}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should just make OpaqueTableStorage._table
public, or add a public table
property that returns it.
(Wouldn't change the need for this check, but "hidden" here reminded me that we're using an at least sort-of private attribute).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Dummy registry table storage class doesn't have a table at all (it's all a python dict).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. A property that can return None
isn't terribly satisfying, but it may still be better than what we've got. And yet maybe not enough better to be worth the trouble.
If the QuantumDirectory stuff all works out, that would require adjustments to the opaque-table interface anyway, so I guess we can come back to this then.
Now joins the trash table with the records table and returns all info up front rather than doing a per-ref query. Still includes a per-artifact look up to check for DECam case.
Were logging before and after -- now only log after and also include log entry for failure.
Significantly reduces time for trash emptying by doing all database querying up front.
The bridge emptyTrash can now return a set of files that should not be deleted.
We do many searches on the path alone when deleting so need an index on that.
As stated in the comments, if a datastore uses user-level permissions it's entirely possible for the deletion to fail.
No longer do an exists check before deleting. Now do the delete and catch FileNotFoundError. This can be more efficient especially with an object store. Dataset empty trash does not really care if the file is missing.
This can speed up StoredFileInfo.file_location by 25% because it no longer needs to do the double creation of a ButlerURI.
No description provided.