New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-15212: better transactionality and ingest-with-transfer for Datastores #62
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me a while to work out what was going on but it looks good in general.
@@ -76,6 +76,24 @@ def registerUndo(self, name, undoFunc, *args, **kwargs): | |||
""" | |||
self._log.append(self.Event(name, undoFunc, args, kwargs)) | |||
|
|||
@contextlib.contextmanager | |||
def undoAs(self, name, undoFunc, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the name here. What does "undo as?" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to indicate that the first argument (name) described the operation you were trying to undo, and that the second was the operation to undo it, but apparently that only made sense in my head.
I was also toying with undoWith
; do either of those make more sense to you?
with transaction.undoWith("do a thing", undoThing):
doThing()
I think part of the problem is that the name of the operation is the first argument, but the real target of the context manager is undoFunc
. I could just swap them; unfortunately we can't really use kwargs on whatever argument isn't first because we need to be able to forward *args
.
Any thoughts on how to make this better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer undoWith
.
@@ -228,6 +246,45 @@ def put(self, inMemoryDataset, datasetRef): | |||
""" | |||
raise NotImplementedError("Must be implemented by subclass") | |||
|
|||
def ingest(self, path, ref, formatter=None, transfer=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is marked optional but we now have two methods in the base class that insist on a path to a file or directory. I am torn about this but I'll let it stand for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've been thinking about this. From a pure OO perspective, it'd be very natural to have an intermediate (but still abstract) base class FileDatastore
that knows about formatters and files and ingest. All concrete Datastores
that work with formatters and files and can do ingest would then inherit from that instead of directly from Datastore
itself, and that might include, say, an S3 Datastore that doesn't have files internally, but certainly knows how to work with them.
Anyhow, we'd then be able to define a bunch of operations that only work when you have a FileDatastore
. Ingest is the obvious operation that's relevant here, but I wonder if this could even be extended to the problem of whether you can ask for a checksum or size (i.e. "yes, but only if you have a FileDatastore
").
Unfortunately, I don't think the derived-class-interface approach works with composition (a ChainedDatastore
can't inherit from FileDatastore
if and only if it forwards to a FileDatastore
). I think that means we can't use the Python type system to express "is a file datastore", but the "file datastore" interface concept may still be one worth preserving. That's the direction I was trying to go in here, but it's not really fully thought out and I do agree that we should do that thinking before letting paths and formatters spread too much further into generic datastores.
'symlink' indicating how to transfer the file. | ||
Datastores need not support all options, but must raise | ||
NotImplementedError if the passed option is not supported | ||
(including None). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be more explicit in saying that the default None
indicates that the file must be inside the repository already. Or can it also mean that an absolute path to a file is given and it will be referenced directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll improve the docs. Right now it does mean that the file must be inside the repository already; trying to support files outside the repository root was a can of worms I did not want to open here.
self.registry.addStorageInfo(compRef, info) | ||
self.addStoredFileInfo(compRef, fileInfo) | ||
with self.transaction() as transaction: | ||
fullPath = os.path.join(self.root, path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check be inside the transaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that it matters, because there are no undo-able operations that happen before it. Given that I'm going to refactor these to use @transactional
, it will remain inside the transaction.
@@ -292,67 +293,139 @@ def put(self, inMemoryDataset, ref): | |||
TypeError | |||
Supplied object and storage class are inconsistent. | |||
""" | |||
with self.transaction() as transaction: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
registry uses a @transactional
decorator. Is there a reason we aren't doing something similar here? It would seem to be cleaner and remove lots of extra indenting. On the other hand you'll be using self._transaction
everywhere rather than transaction
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I first wrote this I didn't realize I could get access to the transaction via self._transaction
. Seeing that that is the pattern used in Registry
is a good enough reason to switch to it here. Will do.
storageClass=storageClass)) | ||
# Write the file | ||
predictedFullPath = os.path.join(self.root, formatter.predictPath(location)) | ||
with transaction.undoAs("write", os.remove, predictedFullPath): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't initially understand why prediction was needed here, but since we aren't guaranteeing a formatter will clean up a partially written file on error then I think I have to agree that we have no choice but to have path prediction.
size = stat.st_size | ||
info = StorageInfo(self.name, checksum, size) | ||
with self.registry.transaction(): | ||
self.registry.addStorageInfo(ref, info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addStorageInfo
is already wrapped in a transaction.
super().setUp() | ||
self.root = tempfile.mkdtemp(dir=TESTDIR) | ||
self.config = DatastoreConfig(self.configFile) | ||
self.config["root"] = self.root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think these should all be using setConfigRoot
rather than looking inside the config here (it's what setConfigRoot
is for. I'll make the change myself as part of my ticket since I'm rewriting setConfigRoot
and this will be a good test for it (and with chained datastores added to these tests it will be even more important to be consistent).
08e626c
to
2f404d2
Compare
This note for derived classes makes sense for Datastore, but not a concrete implementation like PosixDatastore.
Each test now gets its own temporary root directory (if applicable), and construction of different Datastore types is better abstracted.
2f404d2
to
7dd23be
Compare
Also some includes some minor test refactoring and namespacing fixes to benefit the doc build.