Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-29849: Make emptyTrash more efficient #517

Merged
merged 13 commits into from May 11, 2021
Merged

DM-29849: Make emptyTrash more efficient #517

merged 13 commits into from May 11, 2021

Conversation

timj
Copy link
Member

@timj timj commented May 6, 2021

No description provided.

These were effectively methods on the object anyhow
and were probably in the wrong place.
@timj timj force-pushed the tickets/DM-29849 branch 2 times, most recently from 5692daf to ecaca7f Compare May 7, 2021 20:48
@timj
Copy link
Member Author

timj commented May 7, 2021

@TallJimbo I've tried your new approach and the bad news is that the second query takes a long time. Here are some results from a repo with 10k datasets deleting a 5k collection.

The bottom line is that the extra query to find preserved artifacts is taking 14 seconds but doing a simple "give me all refs associated with these paths" takes a fraction of a second. This may be down entirely to the lack of index on file_datastore_records. Dumping the sqlite table seems to indicate that there is no index on file_datastore_records at all and now I'm confused.

Original code:

INFO 2021-05-07T16:10:42.830-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1650)- Emptying trash
INFO 2021-05-07T16:10:42.884-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1653)- Requested trashed datasets
INFO 2021-05-07T16:11:34.477-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1706)- Trash complete

51 seconds

One join in bridge

With join with records table but still doing check of single-file per multi ref

INFO 2021-05-07T16:13:12.185-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1609)- Emptying trash
INFO 2021-05-07T16:13:12.264-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1614)- Trash table queried
INFO 2021-05-07T16:13:37.261-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1649)- Trash complete

Trash time: 25 seconds

Single join + separate IN

With standard join in bridge and then separate IN query

INFO 2021-05-07T16:24:07.506-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1634)- Emptying trash
INFO 2021-05-07T16:24:07.571-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1639)- Trash table queried
INFO 2021-05-07T16:24:07.690-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1654)- Database queries complete
INFO 2021-05-07T16:24:14.527-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1699)- Trash complete

Trash time: 7 seconds

Double query in bridge

With double query in bridge for empty preserved artifacts

INFO 2021-05-07T16:17:38.926-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1635)- Emptying trash
INFO 2021-05-07T16:17:52.690-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1649)- Trash table queried
INFO 2021-05-07T16:17:59.273-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1708)- Trash complete

Noting that the query part now takes 14 seconds. The file delete itself is 7 seconds.

Trash time: 21 seconds

@timj
Copy link
Member Author

timj commented May 8, 2021

Now with an index and magically it is fast:

INFO  2021-05-07T17:05:14.431-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1636)- Emptying trash
INFO  2021-05-07T17:05:14.591-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1650)- Trash table queried
INFO  2021-05-07T17:05:21.870-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1709)- Trash complete

Diff is:

diff --git a/python/lsst/daf/butler/datastores/fileDatastore.py b/python/lsst/daf/butler/datastores/fileDatastore.py
index 7d276f50..a1669e5e 100644
--- a/python/lsst/daf/butler/datastores/fileDatastore.py
+++ b/python/lsst/daf/butler/datastores/fileDatastore.py
@@ -228,6 +228,7 @@ class FileDatastore(GenericBaseDatastore):
                 ddl.FieldSpec(name="file_size", dtype=BigInteger, nullable=True),
             ],
             unique=frozenset(),
+            indexes=(("dataset_id",), ("path",), ("dataset_id", "path")),
         )
 
     def __init__(self, config: Union[DatastoreConfig, str],

I'm not entirely sure what the right answer is and "component" should be involved since it's dataset_id+component that is unique. I am surprised it didn't automatically add an index for dataset_id. We don't have an index on dataset_locations either since it's dataset_id+datastoreName that is the unique quantity.

@TallJimbo
Copy link
Member

TallJimbo commented May 8, 2021

I'd guess that (dataset_id, path) as the only index for the records table would work as well as all three of those, but that's really just a guess; I'm slightly more confident that that removes the need for an index on dataset_id alone. Sounds like we should also make (dataset_id, component) the primary key, especially if the table doesn't have a primary key already.

And yes, dataset_locations and dataset_locations_trash should have a unique constraint on (datastore_name, dataset_id) in that order, which will also create an index for them.

It looks like we could get away with adding the index and unique constraints on this ticket (and patching repos we know about manually) without a formal migration, because those changes won't be noticed by @andy-slac 's digest function that checks to see if the schema in the database matches what daf_butler would create. But the primary key changes would have to wait, and I'd want Andy to weigh on whether taking advantage of the digest function gap for the rest is inadvisable for some reason.

@timj
Copy link
Member Author

timj commented May 8, 2021

(dataset_id, component) is already the primary key for file_datastore_records. I'll see about adding the others though.

@TallJimbo
Copy link
Member

Ok, I guess that's also unique, so it shouldn't matter much whether that or (dataset_id, component) is the PK vs another unique index.

@timj
Copy link
Member Author

timj commented May 8, 2021

I can't work out how to add indexes to the dataset_location tables. Somehow it's ignoring the indexes attribute.

I'm also a bit worried that I messed up my testing since even if I add indexes manually to both tables I can't get the new double query to go fast. It's possible I was running the test with the IN version (which you can trigger by passing None in as the column name of interest. I'll revisit another day.

@TallJimbo
Copy link
Member

I would expect it to created the indexes specified in the tableSpec only if you're creating a new repo.

@timj
Copy link
Member Author

timj commented May 8, 2021

I've been creating new repos for testing but for some reason adding indexes to https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/registry/bridge/monolithic.py#L99 isn't working. Maybe something about the adding of the foreign key a few lines later?

@TallJimbo
Copy link
Member

Yeah, those FKs will create indexes on dataset_id alone (because the FK is on dataset_id alone), and would block redundant manual indexes on that alone. I would not expect them to block other indexes, like those on a combination of dataset_id and something else.

@timj
Copy link
Member Author

timj commented May 8, 2021

That would be it. datastore_name and dataset_id are both primary keys already hence indexing request would be ignored. Next week I'll send you my "create a test repo" script so you can try this branch to see what I'm missing in terms of why it's so slow with the double query.

@timj
Copy link
Member Author

timj commented May 8, 2021

I finally read back up the discussion here and noticed that there was an extra index I had talked about early on and which in subsequent discussions I had ignored. Adding an index for "path" alone was what was missing. Doing that fixes things.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I have one suggestion for the mypy situation that might help (but what you have is ok). Also a few other minor comments.

I'm in favor of the solution that requires adding the indexes to avoid a regression, but we do need to be careful with how we we stage that. I think everybody maintaining a big repo could very quickly get a few CREATE INDEX commands run, but maybe a community announcement post with the instructions is in order? I can of course take care of the repos at NCSA, but I am worried about DESC@NERSC, IN2P3, IDF, and OODS.

python/lsst/daf/butler/datastores/fileDatastore.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/interfaces/_bridge.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/interfaces/_bridge.py Outdated Show resolved Hide resolved
raise ValueError("This implementation requires a records table.")

if not isinstance(records_table, ByNameOpaqueTableStorage):
raise ValueError(f"Records table must support hidden attributes. Got {type(records_table)}.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just make OpaqueTableStorage._table public, or add a public table property that returns it.

(Wouldn't change the need for this check, but "hidden" here reminded me that we're using an at least sort-of private attribute).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dummy registry table storage class doesn't have a table at all (it's all a python dict).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. A property that can return None isn't terribly satisfying, but it may still be better than what we've got. And yet maybe not enough better to be worth the trouble.

If the QuantumDirectory stuff all works out, that would require adjustments to the opaque-table interface anyway, so I guess we can come back to this then.

python/lsst/daf/butler/datastores/fileDatastore.py Outdated Show resolved Hide resolved
timj and others added 6 commits May 10, 2021 14:06
Now joins the trash table with the records table and returns
all info up front rather than doing a per-ref query.

Still includes a per-artifact look up to check for DECam
case.
Were logging before and after -- now only log after and also
include log entry for failure.
Significantly reduces time for trash emptying by doing all
database querying up front.
The bridge emptyTrash can now return a set of files that should
not be deleted.
We do many searches on the path alone when deleting so
need an index on that.
timj added 3 commits May 10, 2021 14:29
As stated in the comments, if a datastore uses user-level permissions
it's entirely possible for the deletion to fail.
No longer do an exists check before deleting. Now do the
delete and catch FileNotFoundError. This can be more efficient
especially with an object store. Dataset empty trash does not
really care if the file is missing.
This can speed up StoredFileInfo.file_location by 25% because
it no longer needs to do the double creation of a ButlerURI.
@timj timj merged commit 9576105 into master May 11, 2021
@timj timj deleted the tickets/DM-29849 branch May 11, 2021 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants