Skip to content
This repository has been archived by the owner on Dec 7, 2022. It is now read-only.

Problem: Artifacts can't be created via REST API #3080

Merged
merged 1 commit into from Jul 25, 2017

Conversation

dkliban
Copy link
Member

@dkliban dkliban commented Jul 12, 2017

Solution: add a viewset for Artifacts

The REST API user can now POST to /api/v3/artifacts endpoint with a file,
size, and any number of checksums. During the upload of the file, each
possible checksum is calculated. If user provides size or any checksums,
they are verified after the upload finishes. If one of the values does not
match a ValidationError is raised.

The user can also list all the the Artifacts by performing a GET request
to /api/v3/artifacts. An Artifact can be deleted by making a DELETE
request to /api/v3/artifacts/.

The user can not update Artifacts via REST API after one is created.

The 'django.core.files.uploadhandler.MemoryFileUploadHandler' is disabled so
all uploads are handled by 'pulpcore.app.upload.PulpFileUploadHandler' and
are written to '/var/lib/pulp/tmp' as the file is being uploaded.

closes #2843
https://pulp.plan.io/issues/2843

"""
def storage_path(self, name):
digest = self.sha256
return "%s/artifacts/%s/%s" % (settings.MEDIA_ROOT, digest[0:2], digest[2:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest os.path.join() would be better.
Also, directories are implicitly plural so artifact would be better and more consistent with linux directory naming conventions.

Copy link
Contributor

@jortel jortel Jul 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as we discussed, the storage path should be calculated by something coupled to the actual storage mechanism. The current implementation makes an attempt at this by having ArtifactLocation defined in the storage module but falls a little short of the goal. I suggest adding a method to the FileStorage that can be used to get a deconstructable object that can calculate all paths appropriately for that storage mechanism. Then in Artifact.storage_path() instantiate DefaultStorage, get the path calculator and use it to calculate and return the storage path. We can do the same for the certificate storage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. I was generally thinking that this type of logic is a good fit for being handled within the domain of the FileField object, and I think that lines up with your specific suggestions.

due to validation error. This theoretically should never occur since validation is
performed before the task is dispatched.
:param data:
:return:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:param_data: and :return: probably left over from cut-n-paste.


Args:
data (dict): the id of the importer

Copy link
Contributor

@jortel jortel Jul 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra blank lines.

self.hashers = {}
for hasher in hashlib.algorithms_guaranteed:
self.hashers[hasher] = getattr(hashlib, hasher)()
super(PulpTemporaryUploadedFile, self).__init__(name, content_type, size, charset, content_type_extra)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line length.

Create the file object to append to as data is coming in.
"""
super(PulpFileUploadHandler, self).new_file(file_name, *args, **kwargs)
self.file = PulpTemporaryUploadedFile(self.file_name, self.content_type, 0, self.charset, self.content_type_extra)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line length.

file = serializers.FileField(
help_text=_("The stored file."),
read_only=True
required=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this actually mean for a FileField? That bits must be present? Note that for on-demand content, we will need an Artifact record so that we can publish it and also have Catalog entries, but the file may not yet be present.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I am making the file not required.

Validate file by size and all checksums which values are provided.

Args:
data (dict): the id of the importer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-and-paste gone wrong?

super(PulpTemporaryUploadedFile, self).__init__(name, content_type, size, charset, content_type_extra)


class PulpFileUploadHandler(TemporaryFileUploadHandler):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more descriptive, consider something like HashingFileUploadHandler as the name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a better name for sure. thanks!

"""
Create the file object to append to as data is coming in.
"""
super(PulpFileUploadHandler, self).new_file(file_name, *args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an act of mercy, Python 3 no longer requires us to pass stuff into super(). 😄

@pep8speaks
Copy link

pep8speaks commented Jul 17, 2017

Hello @dkliban! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 25, 2017 at 13:16 Hours UTC

@jortel
Copy link
Contributor

jortel commented Jul 17, 2017

generally LGTM so far. Suggest removing: Content.natural_key_digest().

Copy link
Contributor

@mhrivnak mhrivnak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM.

There is one interesting effect of this modeling change. If a type of content depends on knowing a particular checksum or the size of any of its artifacts at publish time, that information will need to be duplicated on the Content model subclass. For example, consider a FileContent repo set to on-demand.

Publish requires knowledge of the artifact's size and sha256 checksum. In an on-demand scenario where the file hasn't been downloaded, and thus there is no Artifact, that info would need to be stored on the FileContent model directly so it's available at publish time.

Maybe that's a good thing? Maybe we should declare that publication must be possible based solely on attributes of the Content subclass? That's definitely an important constraint though that we need to think through (maybe you already did and I'm catching up), and be explicit about.

Thoughts?


content = models.ForeignKey(Content, related_name='artifacts', on_delete=models.CASCADE)
file = models.FileField(db_index=True, upload_to=storage_path, max_length=255)
size = models.IntegerField(blank=True, null=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file itself is required, why not require this as well?

"""
Create the file object to append to as data is coming in.
"""
super().new_file(file_name, *args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create a file whose reference I think will be overwritten by the next line. I also think we don't need or want that file at all. Is there anything that calling the superclass new_file() gets us that we need? Or could that be left out?

https://docs.djangoproject.com/en/1.11/_modules/django/core/files/uploadhandler/#TemporaryFileUploadHandler

Copy link
Member Author

@dkliban dkliban Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jortel
Copy link
Contributor

jortel commented Jul 17, 2017

It seems reasonable to expect that the content (unit) has all metadata needed to publish. Both size and digest (sha256) are properties obtained from the (remote) manifest. So, adding both to FileContent seems reasonable. When publishing with other digest(s) (like in the RPM checksum case), the publisher already requires the file be downloaded and Artifact created in the DB so that the additional digest is available.

@jortel
Copy link
Contributor

jortel commented Jul 17, 2017

Although, after thinking through this further -

Storing artifact related information (fields) on the Content (unit) such as size and digest seems wrong. Instead, I think we should explore keeping the artifact as required on ContentArtifact and create them even when the file is not downloaded. Also, keep Artifact.downloaded. This puts things back to What is the natural key for an artifact? The natural key is used to look up an artifact when it already exists to support:

  • de-duplication
  • download an artifact (file) via API.

I suggest the digest fields are the natural key. An artifact could be located by any one of it's strong digest fields.

Artifacts created with upload (or sync with immediate download policy) will always have at least (1) digest because the file exists and they can all be calculated. no problem there.

Artifacts created during sync (when downloading is deferred) of content types supported today probably all would have a digest because a digest is included in the metadata supplied by the remote repository.

So, what about edge cases where downloading is deferred and the remote repository does not provide a digest. Perhaps in that case, Artifacts are created without any digest fields populated. The artifact cannot be fetched using a natural key and duplicates could be created. Maybe that is not so bad. As files are downloaded and digests calculated and set, pulp could perform lazy de-duplication.

Anyway. I think we should discuss this further.

@dkliban
Copy link
Member Author

dkliban commented Jul 18, 2017

Artifacts are provided by the core as a way for plugins to store files in Pulp. If a plugin doesn't have a file to store it doesn't need to create an Artifact.

If a plugin needs to be able to publish content for which there are no artifacts, it is the plugin's responsibility to store all information required for publishing.

@dkliban
Copy link
Member Author

dkliban commented Jul 18, 2017

@mhrivnak @jortel In the on-demand scenario where will we store the URL for the remote artifact that the streamer needs to download? The same table in the database should probably store the expected size and checksums.

relative_path (models.TextField): The artifact's path relative to the associated
:class:`Content`. This path is incorporated in the absolute storage path of the file
and its published path relative to the root publishing directory. At a minimum the path
will contain the file name but may also include sub-directories.
size (models.IntegerField): The size of the file in bytes.
md5 (models.CharField): The MD5 checksum of the file.
sha1 (models.CharField): The SHA-1 checksum of the file.
sha224 (models.CharField): The SHA-224 checksum of the file.
sha256 (models.CharField): The SHA-256 checksum of the file.
sha384 (models.CharField): The SHA-384 checksum of the file.
sha512 (models.CharField): The SHA-512 checksum of the file.
Copy link
Contributor

@jortel jortel Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The digest fields need indexes with sha256 and stronger being unique.

I'm assuming that (eventually) Artifact and RemoteArtifact will have a common abstract base model and all (3) models defined here, correct? If so, let refactor that here now since you know you'll be adding the RemoteArtifact eventually. Later get rid of catalog.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have different validation rules, and I don't know if that might prevent us from having an ABC. For example, on an Artifact all fields will be required.

@jortel
Copy link
Contributor

jortel commented Jul 20, 2017

Adding another LGTM but leaving approval to someone that knows more about serializer and viewsets.


class ArtifactViewSet(NamedModelViewSet):
endpoint_name = 'artifacts'
lookup_field = 'pk'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lookup_field value is a default for NamedModelViewset, so it's not needed.
Docstring for NamedModelViewset can be fixed though since it says about the old, no longer valid, default value. My bad.

@mhrivnak
Copy link
Contributor

I didn't re-review, but I'm happy with the changes I previously requested. If there's anything specific you'd like me to look at, let me know, but otherwise I'm happy to leave it to the others.

Solution: add a viewset for Artifacts

The REST API user can now POST to /api/v3/artifacts endpoint with a file,
size, and any number of checksums. During the upload of the file, each
possible checksum is calculated. If user provides size or any checksums,
they are verified after the upload finishes. If one of the values does not
match a ValidationError is raised.

The user can also list all the the Artifacts by performing a GET request
to /api/v3/artifacts. An Artifact can be deleted by making a DELETE
request to /api/v3/artifacts/<pk>.

The user can not update Artifacts via REST API after one is created.

The 'django.core.files.uploadhandler.MemoryFileUploadHandler' is disabled so
all uploads are handled by 'pulpcore.app.upload.PulpFileUploadHandler' and
are written to '/var/lib/pulp/tmp' as the file is being uploaded.

closes pulp#2843
https://pulp.plan.io/issues/2843
@dkliban dkliban merged commit d3b6d70 into pulp:3.0-dev Jul 25, 2017
@dkliban dkliban deleted the upload-artifact branch July 25, 2017 13:23
@daviddavis daviddavis mentioned this pull request Jan 2, 2018
daviddavis pushed a commit that referenced this pull request Jan 3, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
6 participants