Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-15212: better transactionality and ingest-with-transfer for Datastores #62

Merged
merged 8 commits into from Aug 2, 2018

Conversation

TallJimbo
Copy link
Member

Also some includes some minor test refactoring and namespacing fixes to benefit the doc build.

Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took me a while to work out what was going on but it looks good in general.

@@ -76,6 +76,24 @@ def registerUndo(self, name, undoFunc, *args, **kwargs):
"""
self._log.append(self.Event(name, undoFunc, args, kwargs))

@contextlib.contextmanager
def undoAs(self, name, undoFunc, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the name here. What does "undo as?" mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to indicate that the first argument (name) described the operation you were trying to undo, and that the second was the operation to undo it, but apparently that only made sense in my head.

I was also toying with undoWith; do either of those make more sense to you?

with transaction.undoWith("do a thing", undoThing):
    doThing()

I think part of the problem is that the name of the operation is the first argument, but the real target of the context manager is undoFunc. I could just swap them; unfortunately we can't really use kwargs on whatever argument isn't first because we need to be able to forward *args.

Any thoughts on how to make this better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer undoWith.

@@ -228,6 +246,45 @@ def put(self, inMemoryDataset, datasetRef):
"""
raise NotImplementedError("Must be implemented by subclass")

def ingest(self, path, ref, formatter=None, transfer=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is marked optional but we now have two methods in the base class that insist on a path to a file or directory. I am torn about this but I'll let it stand for now.

Copy link
Member Author

@TallJimbo TallJimbo Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've been thinking about this. From a pure OO perspective, it'd be very natural to have an intermediate (but still abstract) base class FileDatastore that knows about formatters and files and ingest. All concrete Datastores that work with formatters and files and can do ingest would then inherit from that instead of directly from Datastore itself, and that might include, say, an S3 Datastore that doesn't have files internally, but certainly knows how to work with them.

Anyhow, we'd then be able to define a bunch of operations that only work when you have a FileDatastore. Ingest is the obvious operation that's relevant here, but I wonder if this could even be extended to the problem of whether you can ask for a checksum or size (i.e. "yes, but only if you have a FileDatastore").

Unfortunately, I don't think the derived-class-interface approach works with composition (a ChainedDatastore can't inherit from FileDatastore if and only if it forwards to a FileDatastore). I think that means we can't use the Python type system to express "is a file datastore", but the "file datastore" interface concept may still be one worth preserving. That's the direction I was trying to go in here, but it's not really fully thought out and I do agree that we should do that thinking before letting paths and formatters spread too much further into generic datastores.

'symlink' indicating how to transfer the file.
Datastores need not support all options, but must raise
NotImplementedError if the passed option is not supported
(including None).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more explicit in saying that the default None indicates that the file must be inside the repository already. Or can it also mean that an absolute path to a file is given and it will be referenced directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll improve the docs. Right now it does mean that the file must be inside the repository already; trying to support files outside the repository root was a can of worms I did not want to open here.

self.registry.addStorageInfo(compRef, info)
self.addStoredFileInfo(compRef, fileInfo)
with self.transaction() as transaction:
fullPath = os.path.join(self.root, path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check be inside the transaction?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it matters, because there are no undo-able operations that happen before it. Given that I'm going to refactor these to use @transactional, it will remain inside the transaction.

@@ -292,67 +293,139 @@ def put(self, inMemoryDataset, ref):
TypeError
Supplied object and storage class are inconsistent.
"""
with self.transaction() as transaction:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

registry uses a @transactional decorator. Is there a reason we aren't doing something similar here? It would seem to be cleaner and remove lots of extra indenting. On the other hand you'll be using self._transaction everywhere rather than transaction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first wrote this I didn't realize I could get access to the transaction via self._transaction. Seeing that that is the pattern used in Registry is a good enough reason to switch to it here. Will do.

storageClass=storageClass))
# Write the file
predictedFullPath = os.path.join(self.root, formatter.predictPath(location))
with transaction.undoAs("write", os.remove, predictedFullPath):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't initially understand why prediction was needed here, but since we aren't guaranteeing a formatter will clean up a partially written file on error then I think I have to agree that we have no choice but to have path prediction.

size = stat.st_size
info = StorageInfo(self.name, checksum, size)
with self.registry.transaction():
self.registry.addStorageInfo(ref, info)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addStorageInfo is already wrapped in a transaction.

super().setUp()
self.root = tempfile.mkdtemp(dir=TESTDIR)
self.config = DatastoreConfig(self.configFile)
self.config["root"] = self.root
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think these should all be using setConfigRoot rather than looking inside the config here (it's what setConfigRoot is for. I'll make the change myself as part of my ticket since I'm rewriting setConfigRoot and this will be a good test for it (and with chained datastores added to these tests it will be even more important to be consistent).

@TallJimbo TallJimbo merged commit 7628b5a into master Aug 2, 2018
@ktlim ktlim deleted the tickets/DM-15212 branch August 25, 2018 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants