Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-31287: Add protection to datasets manager import method #583

Merged
merged 1 commit into from Oct 8, 2021

Conversation

andy-slac
Copy link
Contributor

Dataset storage manager now has an additional protection against
inconsistent dataset definitions when importing datasets that use UUID
for dataset ID.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I hurt my brain thinking about two potential concerns, and I think my conclusions are that no changes are needed:

Could we get away with fewer queries in _validateImport? I think so, maybe, but if so it'd be a lot harder to read and more fragile, depending on the details of various foreign keys, so not worth doing since we don't seem to have a performance problem here.

Is this subject to race conditions with READ COMMITTED isolation? Only in the sense that concurrent writes that involve identical datasets would cause constraint violations in the actual INSERT, instead of being ignored as they would be if not concurrent. Since that should be rare and doesn't result in any broken invariants, I think we're good.

Does that make sense to you?

@codecov
Copy link

codecov bot commented Oct 8, 2021

Codecov Report

Merging #583 (9f3863c) into master (27e3c37) will increase coverage by 0.07%.
The diff coverage is 100.00%.

❗ Current head 9f3863c differs from pull request most recent head d27114f. Consider uploading reports for the commit d27114f to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #583      +/-   ##
==========================================
+ Coverage   83.47%   83.55%   +0.07%     
==========================================
  Files         241      241              
  Lines       30136    30253     +117     
  Branches     4497     4512      +15     
==========================================
+ Hits        25156    25277     +121     
+ Misses       3786     3784       -2     
+ Partials     1194     1192       -2     
Impacted Files Coverage Δ
.../butler/registry/datasets/byDimensions/_storage.py 83.55% <100.00%> (+2.39%) ⬆️
...af/butler/registry/datasets/byDimensions/tables.py 95.23% <100.00%> (+0.23%) ⬆️
...n/lsst/daf/butler/registry/interfaces/_database.py 87.37% <100.00%> (+0.16%) ⬆️
python/lsst/daf/butler/registry/tests/_database.py 94.38% <100.00%> (+0.23%) ⬆️
python/lsst/daf/butler/registry/tests/_registry.py 98.93% <100.00%> (+0.06%) ⬆️
...sst/daf/butler/registry/collections/synthIntKey.py 96.72% <0.00%> (+1.63%) ⬆️
...on/lsst/daf/butler/registry/collections/nameKey.py 95.65% <0.00%> (+2.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 27e3c37...d27114f. Read the comment docs.

Dataset storage manager now has an additional protection against
inconsistent dataset definitions when importing datasets that use UUID
for dataset ID.
@andy-slac
Copy link
Contributor Author

Could we get away with fewer queries in _validateImport?
There are now one query on dataset table and two queries on tags table there. They should be optimal in a sense that they should be using PK or unique indices, so they should avoid full table scan. If we trust our data completely, we can avoid some checks, e.g. if we believe that UUID4 never collide and there are no mistakes on user side. I prefer to be more paranoid for now until we learn more about how reliable our data is. There is certainly a concern with ever-growing size of these tables which will make queries slower over time, I do not know yet what to do about that.

Is this subject to race conditions with READ COMMITTED isolation?

Concurrent inserts are indeed a potential issue. I do not think we can solve it with locking (except with the whole table lock which we don't want), I believe Postgres does not know how to lock non-existing rows. I think getting an error on INSERT in this situation and re-trying the whole thing should be acceptable, the only question is that it may need human intervention until we teach the code to recognize this sort of failure.

(I rebased this against current master and restarted Jenkins, will merge when it finishes)

@andy-slac andy-slac merged commit ac63b18 into master Oct 8, 2021
@andy-slac andy-slac deleted the tickets/DM-31287 branch October 8, 2021 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants