Reimplements the custom FileStorage backend #3178
Conversation
|
Hello @dkliban! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on October 23, 2017 at 17:42 Hours UTC |
445dccb
to
aec73ef
Compare
| except OSError as e: | ||
| if e.errno in (errno.EXDEV, errno.ENOENT): | ||
| FileSystem.copy(path, content) | ||
| super()._save(path, content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most (perhaps all) of the unwanted behavior is in FileSystemStorage._save().
- directory creation is not thread safe.
- new file-name is created when file already exists.
- uses os.rename() instead of link.
It also tries to create the dir and rename the file which is 2 unnecessary system calls.
However in this case, it's only being called when the directory has already been created and the link() would raise FileExistsError. So it works for now. What if other undesirable behaviors get introduced in FIleSystemStorage._save()? I don't think removing FileSystem.copy() and saving 11 lines of code is worth the risk of using FIleSystemStorage._save().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback. I read through the _save() method in Django and it looks like it already does what we want it to do for TemporaryUploadedFile objects. When Django sees that the file object has a temporary_file_path method, it attempts to move the file instead of copying it.
Why is the directory creation not thread safe? It looks like it tries to create the directory and if it already exists it keeps going
Instead of implementing the _save() method we can implement save() and get_available_name(). The get_available_name() method should raise an Exception when the file already exists. save() can catch that exception and just keep going without doing any other operations.
I am not sure that I understand why os.rename is not desired for our use case. Could you please elaborate?
|
Thanks for the PR, @dkliban. While you're at it, how about changing |
| str(model.pk), | ||
| name) | ||
|
|
||
| def get_available_name(self, name, max_length=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is used to derive a new unique file-name for the path in cases where the file already exists. We never want this to happen for artifacts. This is used by Storage.save() and we should override and return name.
| files.append(entry) | ||
| return (directories, files) | ||
|
|
||
| def path(self, name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FileSystemStorage.path uses a safe_path() that does 2 things.
-
Ensure that absolute paths are within the MEDIA_ROOT. This is good.
-
For relative paths, it joins the MEDIA_ROOT & path. This is bad. For artifacts, we never want the storage to change the storage location.
To be safe, let's override this and implement the #1 check for absolute paths and raise an exception for relative paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to worry about enforcing the paths. I don't think plugin writers will be providing paths for where to save files. I am not 100% sure on this, so if you can provide an example of when we need to guard against this, that would help me understand this requirement better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think making assumptions about how the class is used is a good practice. This storage class may be used by every model in pulp. Both core and plugin models. Plugin writers could add FileField to any of their models. Passing a relative path could be unintentional by the plugin writer or a coding error.
The path() method goes beyond the Storage API and adds the behavior of joining relative paths to the MEDIA_ROOT. There is no guarantee that any other storage back-end will do this so pulp should not rely on it. Looking at S3 storage back-end, it uses the path as-is for the object key.
We should continue to override this method as suggested (or as it originally was) to ensure consistent, deterministic behavior across storage back-ends.
| return | ||
| else: | ||
| raise | ||
| if self.exists(name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not thread safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this method is called again later in _save() if a file exists exception is raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Storage interface does not document that this method should raise FileExistsError. This behavior is convenient for subverting the logic in _save() but inconsistent with the Storage interface and dangerous. What if the FileSystemStorage._save() logic changes? This could result in unpredictable behavior.
|
|
||
| @staticmethod | ||
| def _save(path, content): | ||
| def save(self, name, content, max_length=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is called path in the docstring.
| directories.append(entry) | ||
| try: | ||
| name = self.get_available_name(name, max_length=max_length) | ||
| return self._save(name, content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you catching exceptions from this line? If not, it can move to an else block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am catching exceptions in self._save(). self.get_available_name() can get called again during self._save() if the file already exists.
|
|
||
| @staticmethod | ||
| def _save(path, content): | ||
| def save(self, name, content, max_length=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Storage base class intends subclasses to implement _save().
| directories.append(entry) | ||
| try: | ||
| name = self.get_available_name(name, max_length=max_length) | ||
| return self._save(name, content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delegating everything FileSystemStorage._save() has reintroduced a lot of bad and unwanted behavior.
- I thought there was a race condition flaw with directory creation logic but after looking again it does catch EEXIST. So, although the logic is poor, it works.
- Problems with
file_move_safe()- Logic related to allow_overwrite has race condition flaw.
- Uses
os.reneme()which removes the source (storage should not do this) and has the nasty behavior of silently overwriting the target.
- Determines new filename when already exists.
What you had before with the _save() was better. Just needed to deal with path() and get_available_name() overrides.
| try: | ||
| name = self.get_available_name(name, max_length=max_length) | ||
| return self._save(name, content) | ||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be catch IOError or OSError, not Exception.
| from django.core.files import File | ||
|
|
||
|
|
||
| class DownloadedFile(File): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The temporary_file_path() does not seem to be part of the File interface. We should not be doing special things so that a specific Storage impl will work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I borrowed this code from the TemporaryUploadedFile[0]. The FileSystemStorage is designed to work with 2 types of uploaded files: In memory or on disk. Our upload API always writes files to disk. So when we hand an uploaded file to the FileSystemStorage, the file system storage calls file_move_safe(). I would like us to handle temporary downloaded files in the same manner.
[0] https://github.com/django/django/blob/1.11/django/core/files/uploadedfile.py#L59
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original (code you removed) FileSystem._save() handled both in memory and on disk files with very little complexity and code. It cleanly supported both linking and copying. And, did not need the content to be a special type of File to do so. The FileSystemStorage._save() seems over engineered and does not behave like pulp needs it to. I don't see the point to writing code to coerce FileSystemStorage._save() to behave the way we want instead of just overriding _save(), path() and get_available_name().
After reviewing the Storage and FileSystemStorage code (again), I still think the original FileSystem.path() and FileSystem.get_available_name() overrides that returned name was the correct thing to do.
I'm all for leveraging django code as much as possible so long as it's safe, practical and provides desirable functionality.
| else: | ||
| raise | ||
| if self.exists(name): | ||
| raise OSError(errno.EEXIST, "File Exists") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: better to use FileExistsError instead.
|
@dkliban, I think your original PR was on the right track to making some good improvements. Please consider going back to that approach. |
119466f
to
f0a47e7
Compare
|
ok test |
5e77413
to
f5fca54
Compare
| """ | ||
| tmp_path = '.'.join((path, str(uuid4()), 'tmp')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tmp file logic ensures that a file being copied into storage is atomic. This ensures that if the process is killed, it cannot leave a half-baked file in place. If half-baked, the logic will think the file already exists in subsequent save requests. Best I can tell, the FileSystemStorage copy is not atomic. Am I missing something?
|
There are a lot of good changes in the PR:
Summary of concerns:
|
f5fca54
to
7e33216
Compare
|
I think there are some tests in |
platform/pulpcore/app/files.py
Outdated
| A temporary downloaded file. | ||
|
|
||
| The FileSystemStorage backend treats this object the same as a TemporaryUploadedFile. The | ||
| storage backend attempts to link the file to it's final location. If the final location is on a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/it's/its/
same thing on the next line too.
| >>> certificate = 'PEM ENCODED CERTIFICATE HERE' | ||
| >>> person.ssl_certificate = FileContent(certificate) | ||
| >>> | ||
| Modifies how Django saves files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better description of how this class differs from the Django one would be helpful.
4bac5b2
to
fb97eed
Compare
|
|
||
| The FileSystemStorage backend provided by Django is designed to work with File, | ||
| TemporaryUploadedFile, and InMemoryUploadedFile objects. The latter two inherit from the first. | ||
| The type of File determines how the storage backend places it in it's final destination. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/it's/its/
| try: | ||
| super().save(*args, **kwargs) | ||
| self.file.close() | ||
| except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bare exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't care about what kind of exception occured here. We just want to make sure that we close the file before reraising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm almost certain you mean except Exception: though. More info:
| No cleanup happens when this occurs. | ||
|
|
||
| Here is how FileSystem saves a file to its destination for the two File types we are | ||
| interested in: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a repeat of the stuff above.
… conform to Pulp needs. When saving a file, the storage backend does nothing if a file by that name already exists. All TemporaryDownloadedFile and TemporaryUploadedFile objects are moved using os.rename() when possible. The new ArtifactFileField allows plugin writers to instantiate an Artifact with a string path to a downloaded file. This helps the plugin writer avoid having to open and close a downloaded file. The pre_save()[0] method instantiates a TemporaryDownloadedFile object and passes it to the storage backend to save into place. The save()[1] method on the Artifact then closes the file opened by pre_save(). The get_artifact_path() and published_metadata_path() static methods are now helper methods that can be used with any storage backend. In particular, I expect these to work with S3 storage backend without any modification. The signal handler for pre_delete of the Artifact model has also been removed. This handler relied on the custom storage backend to remove empty directories left behind after an Artifact has been removed. This is not necesary because there can only be 255 directories inside /var/lib/pulp/artifact. It is highly unlikely that a directory will remain empty for long. [0] https://docs.djangoproject.com/en/1.11/ref/models/fields/#django.db.models.Field.pre_save [1] https://docs.djangoproject.com/en/1.11/ref/models/instances/#django.db.models.Model.save closes: pulp#2950 https://pulp.plan.io/issues/2950
fb97eed
to
15857fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
The FileSystemStorage backend provided by Django has been extended to conform to Pulp needs. When saving a file, the storage backend does nothing if a file by that name already exists. All TemporaryDownloadedFile and TemporaryUploadedFile objects are moved using os.rename() when possible.
The new ArtifactFileField allows plugin writers to instantiate an Artifact with a string path to a downloaded file. This helps the plugin writer avoid having to open and close a downloaded file. The pre_save()[0] method instantiates a TemporaryDownloadedFile object and passes it to the storage backend to save into place. The save()[1] method on the Artifact then closes the file opened by pre_save().
The get_artifact_path() and published_metadata_path() static methods are now helper
methods that can be used with any storage backend. In particular, I expect these to
work with S3 storage backend without any modification.
The signal handler for pre_delete of the Artifact model has also been removed. This handler relied on the custom storage backend to remove empty directories left behind after an Artifact has been removed. This is not necesary because there can only be 255 directories inside /var/lib/pulp/artifact. It is highly unlikely that a directory will remain empty for long.
[0] https://docs.djangoproject.com/en/1.11/ref/models/fields/#django.db.models.Field.pre_save
[1] https://docs.djangoproject.com/en/1.11/ref/models/instances/#django.db.models.Model.save
closes: #2950
https://pulp.plan.io/issues/2950