DM-31287: Add protection to datasets manager import method #583

andy-slac · 2021-10-05T22:28:06Z

Dataset storage manager now has an additional protection against
inconsistent dataset definitions when importing datasets that use UUID
for dataset ID.

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

TallJimbo

Looks good! I hurt my brain thinking about two potential concerns, and I think my conclusions are that no changes are needed:

Could we get away with fewer queries in _validateImport? I think so, maybe, but if so it'd be a lot harder to read and more fragile, depending on the details of various foreign keys, so not worth doing since we don't seem to have a performance problem here.

Is this subject to race conditions with READ COMMITTED isolation? Only in the sense that concurrent writes that involve identical datasets would cause constraint violations in the actual INSERT, instead of being ignored as they would be if not concurrent. Since that should be rare and doesn't result in any broken invariants, I think we're good.

Does that make sense to you?

python/lsst/daf/butler/registry/datasets/byDimensions/_storage.py

codecov · 2021-10-08T17:33:21Z

Codecov Report

Merging #583 (9f3863c) into master (27e3c37) will increase coverage by 0.07%.
The diff coverage is 100.00%.

❗ Current head 9f3863c differs from pull request most recent head d27114f. Consider uploading reports for the commit d27114f to get more accurate results

@@            Coverage Diff             @@
##           master     #583      +/-   ##
==========================================
+ Coverage   83.47%   83.55%   +0.07%     
==========================================
  Files         241      241              
  Lines       30136    30253     +117     
  Branches     4497     4512      +15     
==========================================
+ Hits        25156    25277     +121     
+ Misses       3786     3784       -2     
+ Partials     1194     1192       -2

Impacted Files	Coverage Δ
.../butler/registry/datasets/byDimensions/_storage.py	`83.55% <100.00%> (+2.39%)`	⬆️
...af/butler/registry/datasets/byDimensions/tables.py	`95.23% <100.00%> (+0.23%)`	⬆️
...n/lsst/daf/butler/registry/interfaces/_database.py	`87.37% <100.00%> (+0.16%)`	⬆️
python/lsst/daf/butler/registry/tests/_database.py	`94.38% <100.00%> (+0.23%)`	⬆️
python/lsst/daf/butler/registry/tests/_registry.py	`98.93% <100.00%> (+0.06%)`	⬆️
...sst/daf/butler/registry/collections/synthIntKey.py	`96.72% <0.00%> (+1.63%)`	⬆️
...on/lsst/daf/butler/registry/collections/nameKey.py	`95.65% <0.00%> (+2.17%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27e3c37...d27114f. Read the comment docs.

Dataset storage manager now has an additional protection against inconsistent dataset definitions when importing datasets that use UUID for dataset ID.

andy-slac · 2021-10-08T18:14:36Z

Could we get away with fewer queries in _validateImport?
There are now one query on dataset table and two queries on tags table there. They should be optimal in a sense that they should be using PK or unique indices, so they should avoid full table scan. If we trust our data completely, we can avoid some checks, e.g. if we believe that UUID4 never collide and there are no mistakes on user side. I prefer to be more paranoid for now until we learn more about how reliable our data is. There is certainly a concern with ever-growing size of these tables which will make queries slower over time, I do not know yet what to do about that.

Is this subject to race conditions with READ COMMITTED isolation?

Concurrent inserts are indeed a potential issue. I do not think we can solve it with locking (except with the whole table lock which we don't want), I believe Postgres does not know how to lock non-existing rows. I think getting an error on INSERT in this situation and re-trying the whole thing should be acceptable, the only question is that it may need human intervention until we teach the code to recognize this sort of failure.

(I rebased this against current master and restarted Jenkins, will merge when it finishes)

TallJimbo approved these changes Oct 8, 2021

View reviewed changes

python/lsst/daf/butler/registry/datasets/byDimensions/_storage.py Outdated Show resolved Hide resolved

andy-slac force-pushed the tickets/DM-31287 branch from eee7f6a to 9f3863c Compare October 8, 2021 17:21

Add protection to datasets manager import method (DM-31287)

d27114f

Dataset storage manager now has an additional protection against inconsistent dataset definitions when importing datasets that use UUID for dataset ID.

andy-slac force-pushed the tickets/DM-31287 branch from 9f3863c to d27114f Compare October 8, 2021 18:15

andy-slac merged commit ac63b18 into master Oct 8, 2021

andy-slac deleted the tickets/DM-31287 branch October 8, 2021 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-31287: Add protection to datasets manager import method #583

DM-31287: Add protection to datasets manager import method #583

andy-slac commented Oct 5, 2021

TallJimbo left a comment

codecov bot commented Oct 8, 2021 •

edited

andy-slac commented Oct 8, 2021

DM-31287: Add protection to datasets manager import method #583

DM-31287: Add protection to datasets manager import method #583

Conversation

andy-slac commented Oct 5, 2021

Checklist

TallJimbo left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 8, 2021 • edited

Codecov Report

andy-slac commented Oct 8, 2021

codecov bot commented Oct 8, 2021 •

edited