Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File storage interface improvements #256

Draft
wants to merge 41 commits into
base: master
Choose a base branch
from
Draft

File storage interface improvements #256

wants to merge 41 commits into from

Conversation

alexdutton
Copy link
Member

Currently for preliminary review.

See inveniosoftware/rfc#37 for background.

Not all tests pass yet, as the changes to PyFSFileStorage haven't been reflected in the tests.

@lnielsen lnielsen added this to Backlog in Invenio v3.4 + SemanticUI via automation Aug 19, 2020
@alexdutton
Copy link
Member Author

Things that we think still need to be done to this:

  • tests
  • using Locations to bootstrap where files are stored
  • letting backends provide checksum functionality (but probably still provide the default implementation)

@lnielsen lnielsen moved this from Backlog to Blocked in Invenio v3.4 + SemanticUI Aug 28, 2020
@alexdutton alexdutton force-pushed the rfc-37 branch 3 times, most recently from b9280c6 to 067b71c Compare October 6, 2020 13:14
@alexdutton
Copy link
Member Author

alexdutton commented Oct 6, 2020

I think this is getting there. There's some documentation on what file storage backends now look like at https://github.com/inveniosoftware/invenio-files-rest/blob/9bc8a83163fc50d9b407e29a82d7ef6701770f48/docs/new-storage-backend.rst

I've moved the old implementation into invenio_files_rest.storage.legacy and duplicated the tests as was to ensure no regressions.

The general idea was:

  • Create a new storage factory for marshalling multiple backends
  • To reduce the amount of code that needs to be written for a backend, and the understanding needed to do it. Thus, for simple cases, the implementer gets to ignore progress_callback, the file size limit checking and the checksumming
  • Tidy up the set_uri interface to not use positional arguments, so backends can be more nuanced about what FileInstance attributes they're updating.
  • Maintain backwards compatibility. Current settings and external storage implementations should still work.

Things still to do:

  • The tests pass, but I've not checked coverage to see what new functionality needs additional tests
  • More documentation, including linking in the new page
  • Adding deprecation warnings, and marking which code supporting backwards compatibility can be removed in the future. I was thinking about adding a dependency on the deprecation package to do this.
  • Reworking all those commits to be an a lot tidier, because they're a mess (due to lots of exploratory coding and changing of approach)
  • Updating invenio-s3 to use the new interface

This adds overridable config variables for mapping storage backend names to
classes (by default drawn from declared entry points) and setting a default
backend for new files.

It also adds a pyfs storage backend entry point, and makes it the default.

Storage backends should be `django_files_rest.storage:FileStorage` subclasses,
and declared with a `django_files_rest.storage` entrypoint.
…gress

At the moment, the storage backend is responsible for computing checksums and
reporting the size of files based on the number of bytes read. However, these
are common concerns across all storage backends, and so this functionality can
be pulled the other side of the storage backend interface.

This class is therefore intended to wrap any file-like object, and to checksum
the file as it is read. It also counts bytes and calls the progress callback if
one is provided.

In a subsequent commit we'll integrate it with FileInstance, and remove the
functionality from the PyFS storage backend.
This new storage factory uses a `FileInstance.storage_backend` attribute to
determine which backend to use. It uses the
`FILES_REST_DEFAULT_STORAGE_BACKEND` config value when the file instance
doesn't already have a storage backend.

It can be extended to customize how the backend is chosen, and how it is
initialised.

The file path is chosen based on the `id` of the file instance, but this can be
customised too.

It's not set as the default storage factory to preserve backwards compatibility.
…nit classes

This property can now instantiate class-based storage factories, i.e. the new
factory in `invenio_files_rest.storage.factory`.

It's possible the deprecation should be in the pyfs storage factory itself.
These aren't Py3.6-compatible, so will probably need to be removed later, or
replaced with comment-based annotations.
This will mean we can later remove this functionality from the storage backend.
This is exposed as a class property on FileStorage and can be accessed as
`FileStorage.backend_name`. It returns the associated backend name in the
application config, which can then be used to find the class.
It no longer needs to do checksumming, and the filepath can be moved to the
superclass.
All together because this has been very exploratory.
There's always a better option than metaclasses.

In this case, a classmethod, which also simplifies getting the backend name on
instances (`type(instance).backend_name` → `instance.get_backend_name()`)
@alexdutton alexdutton self-assigned this Oct 14, 2020
@lnielsen lnielsen self-requested a review November 20, 2020 08:09
@lnielsen lnielsen self-assigned this Nov 20, 2020
Copy link
Member

@lnielsen lnielsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive job! Thanks a lot for working on this and keeping backward compatibility. I've made a couple of comments

I think the main comment is probably the removal of copy(). It might be easiest to discuss over a telecon. Honestly, I think we can go ahead without adding it back in, but perhaps we make a comment in the code, that this could optimized in the code at a later stage.

FILES_REST_DEFAULT_STORAGE_BACKEND = 'pyfs'
"""The default storage backend name."""

FILES_REST_STORAGE_BACKENDS = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I see some potential troubles here.

How are users expected to provide storage backend?

  1. Only via specifying entry points? In this case, we should probably not make it config, but just a method on the Flask extension.
  2. Allow developers/instances to specify a dict of storage backends? in this case, it makes sense to have a config. However when FILES_REST_STORAGE_BACKENDS is overwritten, then below code will still execute causing all storage backends to be loaded, even if we don't want them.

Thus in both cases, I think it makes sense to move the entry point iteration to the Flask extension to avoid unnecessarily loading storage backends that we don't need.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was supporting both so that there was an obvious default behaviour (entry points), but that they could override this if desired. The "overriding" use case was that they wanted to change the behaviour of a backend they'd already used, but which they don't maintain (e.g. one in invenio-files-rest, or invenio-s3). In this case defining a new entrypoint with the same name wouldn't work, so they'd have to define the name→backend mapping explicitly as config.

However, yes, it's not great to do the entrypoint dereferencing here regardless of whether it's overridden. If you're happy with the rationale above I can move this to the extension loading code.

I should also write documentation for this.

@@ -791,13 +807,45 @@ def verify_checksum(self, progress_callback=None, chunk_size=None,
else (self.checksum == real_checksum))
self.last_check_at = datetime.utcnow()
return self.last_check
return current_files_rest.storage_factory(fileinstance=self, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Two returns - seems like a merge problem I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it does look that way 😬. Will fix!

preferred_location = ( # type: typing.Optional[Location]
Location(uri=default_location) if default_location else None
)
return self.initialize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Previously the init_contents() didn't have a return value. I assume init_contents() is intended to be replaced completely with initialize(), hence would it make sense to keep it like that, or is the return value need somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right; there's no return from initialize(), so the return in init_contents() is unnecessary. I'll get it removed :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yep, I was intending that init_contents() be deprecated.

self.set_uri(
*self.storage(**kwargs).initialize(size=size),
readable=False, writable=True)
if hasattr(current_files_rest.storage_factory, 'initialize'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: Perhaps better with isinstance(current_files_rest.storage_factory, StorageBackend)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I think this was a hangover from when I'd considered keeping the same base class between old and new approaches, but then decided against it.

'uri', 'size', 'checksum', 'writable', 'readable', 'storage_class'
}

def update_file_metadata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Should set_uri() be deprecated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please! Its use of positional arguments didn't seem particularly clear for people using it, and seems inflexible when trying to omit attribute names or add new attribute names later.

)
self._size = result['size']
if not result['checksum']:
result['checksum'] = self.checksum(chunk_size=chunk_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is this for the case where the checksum is not calculated as part of write stream and we want to ask the backend to do it instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. The idea was to make the backend implementation as light-touch as possible and for it to fill in the gaps around what you've provided, unless you want to take control of those things yourself.

In this case, setting the hash name to None will result in checksum not being set from the default write implementation, at which point it'll ask the backend to calculate the checkum explicitly.

'uri': self.uri,
'readable': True,
'writable': False,
'storage_class': 'S',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Should this be taken from FILES_REST_DEFAULT_STORAGE_CLASS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! WIll fix.


def copy(self, src, chunk_size=None, progress_callback=None):
"""Copy data from another file instance.

:param src: Source stream.
:param chunk_size: Chunk size to read from source stream.
"""
fp = src.open(mode='rb')
try:
warnings.warn(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: The reason for having copy on the storage backend is that it could potentially perform the copy using the backend. Say if both files reside in the same storage system, it would be a lot more efficient to do an internal copy or that system, than having to pass all data through Invenio.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. It shouldn't be too much effort for me to reinstate it.

"""Storage initialization."""
self.fileurl = fileurl
# if isinstance(args[0], str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Leftover?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep; will remove.

@@ -103,6 +103,7 @@ def test_simple_workflow(app, db, tmpdir):
], obj=script_info)
assert 0 == result.exit_code

print(tmpdir.listdir())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: Left-over?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep; will remove.

@lnielsen lnielsen moved this from In Review to Pending merge in InvenioRDM November (really December now) Board Nov 20, 2020
@alexdutton
Copy link
Member Author

@lnielsen All replies done. Will work on the improvements soon, if you're happy with the suggested approach.

I'll do everything as fixups for easier reviewing, and then squash them once we're happy.

@lnielsen lnielsen force-pushed the master branch 3 times, most recently from abd43b6 to d1fb4d2 Compare October 20, 2021 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants