DM-29849: Make emptyTrash more efficient #517

timj · 2021-05-06T23:23:47Z

No description provided.

python/lsst/daf/butler/datastores/fileDatastore.py

python/lsst/daf/butler/registry/bridge/monolithic.py

These were effectively methods on the object anyhow and were probably in the wrong place.

timj · 2021-05-07T23:48:22Z

@TallJimbo I've tried your new approach and the bad news is that the second query takes a long time. Here are some results from a repo with 10k datasets deleting a 5k collection.

The bottom line is that the extra query to find preserved artifacts is taking 14 seconds but doing a simple "give me all refs associated with these paths" takes a fraction of a second. This may be down entirely to the lack of index on file_datastore_records. Dumping the sqlite table seems to indicate that there is no index on file_datastore_records at all and now I'm confused.

Original code:

INFO 2021-05-07T16:10:42.830-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1650)- Emptying trash
INFO 2021-05-07T16:10:42.884-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1653)- Requested trashed datasets
INFO 2021-05-07T16:11:34.477-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1706)- Trash complete

51 seconds

One join in bridge

With join with records table but still doing check of single-file per multi ref

INFO 2021-05-07T16:13:12.185-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1609)- Emptying trash
INFO 2021-05-07T16:13:12.264-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1614)- Trash table queried
INFO 2021-05-07T16:13:37.261-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1649)- Trash complete

Trash time: 25 seconds

Single join + separate IN

With standard join in bridge and then separate IN query

INFO 2021-05-07T16:24:07.506-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1634)- Emptying trash
INFO 2021-05-07T16:24:07.571-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1639)- Trash table queried
INFO 2021-05-07T16:24:07.690-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1654)- Database queries complete
INFO 2021-05-07T16:24:14.527-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1699)- Trash complete

Trash time: 7 seconds

Double query in bridge

With double query in bridge for empty preserved artifacts

INFO 2021-05-07T16:17:38.926-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1635)- Emptying trash
INFO 2021-05-07T16:17:52.690-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1649)- Trash table queried
INFO 2021-05-07T16:17:59.273-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1708)- Trash complete

Noting that the query part now takes 14 seconds. The file delete itself is 7 seconds.

Trash time: 21 seconds

timj · 2021-05-08T00:07:34Z

Now with an index and magically it is fast:

INFO  2021-05-07T17:05:14.431-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1636)- Emptying trash
INFO  2021-05-07T17:05:14.591-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1650)- Trash table queried
INFO  2021-05-07T17:05:21.870-0700 lsst.daf.butler.datastores.fileDatastore ()(fileDatastore.py:1709)- Trash complete

Diff is:

diff --git a/python/lsst/daf/butler/datastores/fileDatastore.py b/python/lsst/daf/butler/datastores/fileDatastore.py
index 7d276f50..a1669e5e 100644
--- a/python/lsst/daf/butler/datastores/fileDatastore.py
+++ b/python/lsst/daf/butler/datastores/fileDatastore.py
@@ -228,6 +228,7 @@ class FileDatastore(GenericBaseDatastore):
                 ddl.FieldSpec(name="file_size", dtype=BigInteger, nullable=True),
             ],
             unique=frozenset(),
+            indexes=(("dataset_id",), ("path",), ("dataset_id", "path")),
         )
 
     def __init__(self, config: Union[DatastoreConfig, str],

I'm not entirely sure what the right answer is and "component" should be involved since it's dataset_id+component that is unique. I am surprised it didn't automatically add an index for dataset_id. We don't have an index on dataset_locations either since it's dataset_id+datastoreName that is the unique quantity.

TallJimbo · 2021-05-08T00:50:36Z

I'd guess that (dataset_id, path) as the only index for the records table would work as well as all three of those, but that's really just a guess; I'm slightly more confident that that removes the need for an index on dataset_id alone. Sounds like we should also make (dataset_id, component) the primary key, especially if the table doesn't have a primary key already.

And yes, dataset_locations and dataset_locations_trash should have a unique constraint on (datastore_name, dataset_id) in that order, which will also create an index for them.

It looks like we could get away with adding the index and unique constraints on this ticket (and patching repos we know about manually) without a formal migration, because those changes won't be noticed by @andy-slac 's digest function that checks to see if the schema in the database matches what daf_butler would create. But the primary key changes would have to wait, and I'd want Andy to weigh on whether taking advantage of the digest function gap for the rest is inadvisable for some reason.

timj · 2021-05-08T01:02:22Z

(dataset_id, component) is already the primary key for file_datastore_records. I'll see about adding the others though.

TallJimbo · 2021-05-08T01:07:13Z

Ok, I guess that's also unique, so it shouldn't matter much whether that or (dataset_id, component) is the PK vs another unique index.

timj · 2021-05-08T04:59:27Z

I can't work out how to add indexes to the dataset_location tables. Somehow it's ignoring the indexes attribute.

I'm also a bit worried that I messed up my testing since even if I add indexes manually to both tables I can't get the new double query to go fast. It's possible I was running the test with the IN version (which you can trigger by passing None in as the column name of interest. I'll revisit another day.

TallJimbo · 2021-05-08T13:43:27Z

I would expect it to created the indexes specified in the tableSpec only if you're creating a new repo.

timj · 2021-05-08T14:57:31Z

I've been creating new repos for testing but for some reason adding indexes to https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/registry/bridge/monolithic.py#L99 isn't working. Maybe something about the adding of the foreign key a few lines later?

TallJimbo · 2021-05-08T15:09:55Z

Yeah, those FKs will create indexes on dataset_id alone (because the FK is on dataset_id alone), and would block redundant manual indexes on that alone. I would not expect them to block other indexes, like those on a combination of dataset_id and something else.

timj · 2021-05-08T15:22:31Z

That would be it. datastore_name and dataset_id are both primary keys already hence indexing request would be ignored. Next week I'll send you my "create a test repo" script so you can try this branch to see what I'm missing in terms of why it's so slow with the double query.

timj · 2021-05-08T16:59:13Z

I finally read back up the discussion here and noticed that there was an extra index I had talked about early on and which in subsequent discussions I had ignored. Adding an index for "path" alone was what was missing. Doing that fixes things.

TallJimbo

Looks good! I have one suggestion for the mypy situation that might help (but what you have is ok). Also a few other minor comments.

I'm in favor of the solution that requires adding the indexes to avoid a regression, but we do need to be careful with how we we stage that. I think everybody maintaining a big repo could very quickly get a few CREATE INDEX commands run, but maybe a community announcement post with the instructions is in order? I can of course take care of the repos at NCSA, but I am worried about DESC@NERSC, IN2P3, IDF, and OODS.

python/lsst/daf/butler/datastores/fileDatastore.py

python/lsst/daf/butler/registry/interfaces/_bridge.py

TallJimbo · 2021-05-10T13:50:48Z

python/lsst/daf/butler/registry/bridge/monolithic.py

+            raise ValueError("This implementation requires a records table.")
+
+        if not isinstance(records_table, ByNameOpaqueTableStorage):
+            raise ValueError(f"Records table must support hidden attributes. Got {type(records_table)}.")


I wonder if we should just make OpaqueTableStorage._table public, or add a public table property that returns it.

(Wouldn't change the need for this check, but "hidden" here reminded me that we're using an at least sort-of private attribute).

The Dummy registry table storage class doesn't have a table at all (it's all a python dict).

Ok. A property that can return None isn't terribly satisfying, but it may still be better than what we've got. And yet maybe not enough better to be worth the trouble.

If the QuantumDirectory stuff all works out, that would require adjustments to the opaque-table interface anyway, so I guess we can come back to this then.

python/lsst/daf/butler/registry/bridge/monolithic.py

python/lsst/daf/butler/datastores/fileDatastore.py

Now joins the trash table with the records table and returns all info up front rather than doing a per-ref query. Still includes a per-artifact look up to check for DECam case.

Were logging before and after -- now only log after and also include log entry for failure.

Significantly reduces time for trash emptying by doing all database querying up front.

The bridge emptyTrash can now return a set of files that should not be deleted.

We do many searches on the path alone when deleting so need an index on that.

As stated in the comments, if a datastore uses user-level permissions it's entirely possible for the deletion to fail.

No longer do an exists check before deleting. Now do the delete and catch FileNotFoundError. This can be more efficient especially with an object store. Dataset empty trash does not really care if the file is missing.

This can speed up StoredFileInfo.file_location by 25% because it no longer needs to do the double creation of a ButlerURI.

Fix some typos

4182ea5

timj commented May 6, 2021

View reviewed changes

python/lsst/daf/butler/datastores/fileDatastore.py Outdated Show resolved Hide resolved

TallJimbo reviewed May 7, 2021

View reviewed changes

python/lsst/daf/butler/registry/bridge/monolithic.py Outdated Show resolved Hide resolved

Move some code from FileDatastore into StoredFileInfo

b72bcf5

These were effectively methods on the object anyhow and were probably in the wrong place.

timj force-pushed the tickets/DM-29849 branch 2 times, most recently from 5692daf to ecaca7f Compare May 7, 2021 20:48

timj force-pushed the tickets/DM-29849 branch from 306633b to 5dea33a Compare May 7, 2021 23:52

timj force-pushed the tickets/DM-29849 branch from 487d6cb to 231e8f4 Compare May 9, 2021 18:02

TallJimbo approved these changes May 10, 2021

View reviewed changes

timj commented May 10, 2021

View reviewed changes

python/lsst/daf/butler/datastores/fileDatastore.py Outdated Show resolved Hide resolved

timj force-pushed the tickets/DM-29849 branch from 231e8f4 to 0f13a98 Compare May 10, 2021 19:29

timj and others added 6 commits May 10, 2021 14:06

Rewrite emptyTrash context manager to be more efficient

9594918

Now joins the trash table with the records table and returns all info up front rather than doing a per-ref query. Still includes a per-artifact look up to check for DECam case.

Allow opaque table fetch to handle lists in the where clause

bfc22a2

Only log one message when deleting an artifact

09bb6ad

Were logging before and after -- now only log after and also include log entry for failure.

When trashing query file to ref mapping up front

4675447

Significantly reduces time for trash emptying by doing all database querying up front.

Rewrite trash query to calculate artifacts to be retained

60f1e40

Calculate artifacts to be preserved during trash inside bridge

49cdd90

The bridge emptyTrash can now return a set of files that should not be deleted.

timj force-pushed the tickets/DM-29849 branch from 0f13a98 to 81635cf Compare May 10, 2021 21:06

Add an index for path to the datastore records table

3f16df5

We do many searches on the path alone when deleting so need an index on that.

timj added 3 commits May 10, 2021 14:29

Change inability to delete artifact to a debug message

04d8d45

As stated in the comments, if a datastore uses user-level permissions it's entirely possible for the deletion to fail.

Reorganize artifact deletion to be more efficient

8e96e32

No longer do an exists check before deleting. Now do the delete and catch FileNotFoundError. This can be more efficient especially with an object store. Dataset empty trash does not really care if the file is missing.

Fix mypy complaints

59ea2b0

timj force-pushed the tickets/DM-29849 branch from 81635cf to 59ea2b0 Compare May 10, 2021 21:29

Allow LocationFactory.fromPath to take a ButlerURI

3d438db

This can speed up StoredFileInfo.file_location by 25% because it no longer needs to do the double creation of a ButlerURI.

timj merged commit 9576105 into master May 11, 2021

timj deleted the tickets/DM-29849 branch May 11, 2021 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-29849: Make emptyTrash more efficient #517

DM-29849: Make emptyTrash more efficient #517

timj commented May 6, 2021

timj commented May 7, 2021

timj commented May 8, 2021

TallJimbo commented May 8, 2021 •

edited

timj commented May 8, 2021

TallJimbo commented May 8, 2021

timj commented May 8, 2021

TallJimbo commented May 8, 2021

timj commented May 8, 2021

TallJimbo commented May 8, 2021

timj commented May 8, 2021

timj commented May 8, 2021

TallJimbo left a comment

TallJimbo May 10, 2021

timj May 10, 2021

TallJimbo May 11, 2021

DM-29849: Make emptyTrash more efficient #517

DM-29849: Make emptyTrash more efficient #517

Conversation

timj commented May 6, 2021

timj commented May 7, 2021

Original code:

One join in bridge

Single join + separate IN

Double query in bridge

timj commented May 8, 2021

TallJimbo commented May 8, 2021 • edited

timj commented May 8, 2021

TallJimbo commented May 8, 2021

timj commented May 8, 2021

TallJimbo commented May 8, 2021

timj commented May 8, 2021

TallJimbo commented May 8, 2021

timj commented May 8, 2021

timj commented May 8, 2021

TallJimbo left a comment

Choose a reason for hiding this comment

TallJimbo May 10, 2021

Choose a reason for hiding this comment

timj May 10, 2021

Choose a reason for hiding this comment

TallJimbo May 11, 2021

Choose a reason for hiding this comment

TallJimbo commented May 8, 2021 •

edited