Skip to content
This repository has been archived by the owner on Dec 7, 2022. It is now read-only.

Reimplements the custom FileStorage backend #3178

Merged
merged 1 commit into from Oct 23, 2017

Conversation

dkliban
Copy link
Member

@dkliban dkliban commented Sep 28, 2017

The FileSystemStorage backend provided by Django has been extended to conform to Pulp needs. When saving a file, the storage backend does nothing if a file by that name already exists. All TemporaryDownloadedFile and TemporaryUploadedFile objects are moved using os.rename() when possible.

The new ArtifactFileField allows plugin writers to instantiate an Artifact with a string path to a downloaded file. This helps the plugin writer avoid having to open and close a downloaded file. The pre_save()[0] method instantiates a TemporaryDownloadedFile object and passes it to the storage backend to save into place. The save()[1] method on the Artifact then closes the file opened by pre_save().

The get_artifact_path() and published_metadata_path() static methods are now helper
methods that can be used with any storage backend. In particular, I expect these to
work with S3 storage backend without any modification.

The signal handler for pre_delete of the Artifact model has also been removed. This handler relied on the custom storage backend to remove empty directories left behind after an Artifact has been removed. This is not necesary because there can only be 255 directories inside /var/lib/pulp/artifact. It is highly unlikely that a directory will remain empty for long.

[0] https://docs.djangoproject.com/en/1.11/ref/models/fields/#django.db.models.Field.pre_save
[1] https://docs.djangoproject.com/en/1.11/ref/models/instances/#django.db.models.Model.save

closes: #2950
https://pulp.plan.io/issues/2950

@pep8speaks
Copy link

pep8speaks commented Sep 28, 2017

Hello @dkliban! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on October 23, 2017 at 17:42 Hours UTC

@dkliban dkliban force-pushed the remove-custom-storage branch 2 times, most recently from 445dccb to aec73ef Compare September 28, 2017 19:39
except OSError as e:
if e.errno in (errno.EXDEV, errno.ENOENT):
FileSystem.copy(path, content)
super()._save(path, content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most (perhaps all) of the unwanted behavior is in FileSystemStorage._save().

  • directory creation is not thread safe.
  • new file-name is created when file already exists.
  • uses os.rename() instead of link.

It also tries to create the dir and rename the file which is 2 unnecessary system calls.

However in this case, it's only being called when the directory has already been created and the link() would raise FileExistsError. So it works for now. What if other undesirable behaviors get introduced in FIleSystemStorage._save()? I don't think removing FileSystem.copy() and saving 11 lines of code is worth the risk of using FIleSystemStorage._save().

Copy link
Member Author

@dkliban dkliban Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. I read through the _save() method in Django and it looks like it already does what we want it to do for TemporaryUploadedFile objects. When Django sees that the file object has a temporary_file_path method, it attempts to move the file instead of copying it.

Why is the directory creation not thread safe? It looks like it tries to create the directory and if it already exists it keeps going

Instead of implementing the _save() method we can implement save() and get_available_name(). The get_available_name() method should raise an Exception when the file already exists. save() can catch that exception and just keep going without doing any other operations.

I am not sure that I understand why os.rename is not desired for our use case. Could you please elaborate?

@jortel
Copy link
Contributor

jortel commented Sep 28, 2017

Thanks for the PR, @dkliban.

While you're at it, how about changing TLSLocation to module functions as well for consistency. You'll need to update the Importer model.

str(model.pk),
name)

def get_available_name(self, name, max_length=None):
Copy link
Contributor

@jortel jortel Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used to derive a new unique file-name for the path in cases where the file already exists. We never want this to happen for artifacts. This is used by Storage.save() and we should override and return name.

files.append(entry)
return (directories, files)

def path(self, name):
Copy link
Contributor

@jortel jortel Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FileSystemStorage.path uses a safe_path() that does 2 things.

  1. Ensure that absolute paths are within the MEDIA_ROOT. This is good.

  2. For relative paths, it joins the MEDIA_ROOT & path. This is bad. For artifacts, we never want the storage to change the storage location.

To be safe, let's override this and implement the #1 check for absolute paths and raise an exception for relative paths.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to worry about enforcing the paths. I don't think plugin writers will be providing paths for where to save files. I am not 100% sure on this, so if you can provide an example of when we need to guard against this, that would help me understand this requirement better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think making assumptions about how the class is used is a good practice. This storage class may be used by every model in pulp. Both core and plugin models. Plugin writers could add FileField to any of their models. Passing a relative path could be unintentional by the plugin writer or a coding error.

The path() method goes beyond the Storage API and adds the behavior of joining relative paths to the MEDIA_ROOT. There is no guarantee that any other storage back-end will do this so pulp should not rely on it. Looking at S3 storage back-end, it uses the path as-is for the object key.

We should continue to override this method as suggested (or as it originally was) to ensure consistent, deterministic behavior across storage back-ends.

return
else:
raise
if self.exists(name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not thread safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this method is called again later in _save() if a file exists exception is raised.

Copy link
Contributor

@jortel jortel Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Storage interface does not document that this method should raise FileExistsError. This behavior is convenient for subverting the logic in _save() but inconsistent with the Storage interface and dangerous. What if the FileSystemStorage._save() logic changes? This could result in unpredictable behavior.


@staticmethod
def _save(path, content):
def save(self, name, content, max_length=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is called path in the docstring.

directories.append(entry)
try:
name = self.get_available_name(name, max_length=max_length)
return self._save(name, content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you catching exceptions from this line? If not, it can move to an else block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am catching exceptions in self._save(). self.get_available_name() can get called again during self._save() if the file already exists.


@staticmethod
def _save(path, content):
def save(self, name, content, max_length=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Storage base class intends subclasses to implement _save().

directories.append(entry)
try:
name = self.get_available_name(name, max_length=max_length)
return self._save(name, content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delegating everything FileSystemStorage._save() has reintroduced a lot of bad and unwanted behavior.

  • I thought there was a race condition flaw with directory creation logic but after looking again it does catch EEXIST. So, although the logic is poor, it works.
  • Problems with file_move_safe()
    • Logic related to allow_overwrite has race condition flaw.
    • Uses os.reneme() which removes the source (storage should not do this) and has the nasty behavior of silently overwriting the target.
  • Determines new filename when already exists.

What you had before with the _save() was better. Just needed to deal with path() and get_available_name() overrides.

try:
name = self.get_available_name(name, max_length=max_length)
return self._save(name, content)
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be catch IOError or OSError, not Exception.

from django.core.files import File


class DownloadedFile(File):
Copy link
Contributor

@jortel jortel Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The temporary_file_path() does not seem to be part of the File interface. We should not be doing special things so that a specific Storage impl will work correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I borrowed this code from the TemporaryUploadedFile[0]. The FileSystemStorage is designed to work with 2 types of uploaded files: In memory or on disk. Our upload API always writes files to disk. So when we hand an uploaded file to the FileSystemStorage, the file system storage calls file_move_safe(). I would like us to handle temporary downloaded files in the same manner.

[0] https://github.com/django/django/blob/1.11/django/core/files/uploadedfile.py#L59

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original (code you removed) FileSystem._save() handled both in memory and on disk files with very little complexity and code. It cleanly supported both linking and copying. And, did not need the content to be a special type of File to do so. The FileSystemStorage._save() seems over engineered and does not behave like pulp needs it to. I don't see the point to writing code to coerce FileSystemStorage._save() to behave the way we want instead of just overriding _save(), path() and get_available_name().

After reviewing the Storage and FileSystemStorage code (again), I still think the original FileSystem.path() and FileSystem.get_available_name() overrides that returned name was the correct thing to do.

I'm all for leveraging django code as much as possible so long as it's safe, practical and provides desirable functionality.

else:
raise
if self.exists(name):
raise OSError(errno.EEXIST, "File Exists")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: better to use FileExistsError instead.

@jortel
Copy link
Contributor

jortel commented Sep 29, 2017

@dkliban, I think your original PR was on the right track to making some good improvements. Please consider going back to that approach.

@dkliban dkliban force-pushed the remove-custom-storage branch 2 times, most recently from 119466f to f0a47e7 Compare October 6, 2017 18:15
@dralley dralley added the 3.0 label Oct 10, 2017
@dkliban
Copy link
Member Author

dkliban commented Oct 19, 2017

ok test

@dkliban dkliban changed the title Removes most of the FileStorage implementation Removes custom FileStorage backend Oct 19, 2017
"""
tmp_path = '.'.join((path, str(uuid4()), 'tmp'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tmp file logic ensures that a file being copied into storage is atomic. This ensures that if the process is killed, it cannot leave a half-baked file in place. If half-baked, the logic will think the file already exists in subsequent save requests. Best I can tell, the FileSystemStorage copy is not atomic. Am I missing something?

@jortel
Copy link
Contributor

jortel commented Oct 19, 2017

There are a lot of good changes in the PR:

  • Moving the class methods on the storage to module functions for determining path.
  • Subclassing FileSystemStorage to reduce implemented methods. Not a lot of value but good none the less.

Summary of concerns:

  • Overriding get_available_name() to trick the logic in FileSystemStorage._save() to not store the file with a new name when it already exists instead of overriding FileSystemStorage._save() is brittle. It's based on knowledge of the superclass implementation of _save() which is never a good idea. Especially when the logic is not specified in the Storage interface. Any changes to the impl of FileSystemStorage._save() will have unpredictable results. If a method has undesirable behavior, it should be overridden.
  • The FileSystemStorage._save() uses os.rename() which removes the source file and will blindly overwrite the target file if given the chance. The source file should not be removed. There is nothing in the Storage interface that suggests this. And, it's presumptuous not to mention inconsistent with copy.
  • The FileSystemStorage._save() has logic Pulp does not want. Why work so hard to reuse it.
  • The FileSystemStorage._save() copy logic is not atomic. If interrupted the half-baked file will prevent the file from ever being stored in subsequent save attempts.
  • I don't understand the value of the TemporaryDownloadFile.
  • Using the File with FileField is standard django, I don't see the value in adding code to hide this from plugin writers.

@daviddavis
Copy link
Contributor

I think there are some tests in test_content.py that should be removed/updated.

A temporary downloaded file.

The FileSystemStorage backend treats this object the same as a TemporaryUploadedFile. The
storage backend attempts to link the file to it's final location. If the final location is on a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it's/its/

same thing on the next line too.

>>> certificate = 'PEM ENCODED CERTIFICATE HERE'
>>> person.ssl_certificate = FileContent(certificate)
>>>
Modifies how Django saves files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better description of how this class differs from the Django one would be helpful.

@dkliban dkliban force-pushed the remove-custom-storage branch 2 times, most recently from 4bac5b2 to fb97eed Compare October 23, 2017 15:19

The FileSystemStorage backend provided by Django is designed to work with File,
TemporaryUploadedFile, and InMemoryUploadedFile objects. The latter two inherit from the first.
The type of File determines how the storage backend places it in it's final destination. The
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it's/its/

try:
super().save(*args, **kwargs)
self.file.close()
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare exception

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't care about what kind of exception occured here. We just want to make sure that we close the file before reraising.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm almost certain you mean except Exception: though. More info:

https://docs.python.org/2/howto/doanddont.html#except

No cleanup happens when this occurs.

Here is how FileSystem saves a file to its destination for the two File types we are
interested in:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a repeat of the stuff above.

… conform to Pulp needs. When

saving a file, the storage backend does nothing if a file by that name already exists. All
TemporaryDownloadedFile and TemporaryUploadedFile objects are moved using os.rename() when possible.

The new ArtifactFileField allows plugin writers to instantiate an Artifact with a string path to a downloaded
file. This helps the plugin writer avoid having to open and close a downloaded file. The pre_save()[0] method
instantiates a TemporaryDownloadedFile object and passes it to the storage backend to save into place. The
save()[1] method on the Artifact then closes the file opened by pre_save().

The get_artifact_path() and published_metadata_path() static methods are now helper methods that can be used
with any storage backend. In particular, I expect these to work with S3 storage backend without any
modification.

The signal handler for pre_delete of the Artifact model has also been removed. This handler relied on the
custom storage backend to remove empty directories left behind after an Artifact has been removed. This is
not necesary because there can only be 255 directories inside /var/lib/pulp/artifact. It is highly unlikely that a
directory will remain empty for long.

[0] https://docs.djangoproject.com/en/1.11/ref/models/fields/#django.db.models.Field.pre_save
[1] https://docs.djangoproject.com/en/1.11/ref/models/instances/#django.db.models.Model.save

closes: pulp#2950
https://pulp.plan.io/issues/2950
Copy link
Contributor

@daviddavis daviddavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@dkliban dkliban changed the title Removes custom FileStorage backend Reimplements the custom FileStorage backend Oct 23, 2017
@dkliban dkliban merged commit cb32acb into pulp:3.0-dev Oct 23, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
6 participants