New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added twine upload support #367
Conversation
pulp_python/app/pypi.py
Outdated
| session = metadata["session"] | ||
| start_time = datetime.now(tz=timezone.utc) + timedelta(seconds=5) | ||
| if not session.get("start"): | ||
| create_upload_task(session) | ||
| else: | ||
| sq = Session.objects.select_for_update(nowait=False).filter(pk=session.session_key) | ||
| with transaction.atomic(): | ||
| try: | ||
| sq.first() | ||
| if datetime.fromisoformat(session['start']) >= datetime.now(tz=timezone.utc): | ||
| session['artifacts'].append((str(artifact.pk), file.name)) | ||
| session['start'] = str(start_time) | ||
| session.modified = False | ||
| session.save() | ||
| log.info(f"Updated current task with new package {file.name}") | ||
| else: raise DatabaseError | ||
| except DatabaseError: | ||
| session.cycle_key() | ||
| create_upload_task(session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Twine uploads each package in a separate post request, but they all share the same session. To group the uploaded packages into one task I pass the Session object to the task and have it wait 5 seconds until starting the task. If more packages are uploaded in those 5 seconds then the session is updated with those tasks and the start time of the task is pushed back. If a giant package is in the list of uploaded packages and the task starts before the package finishes uploading, then package's from that one and on will be added in a new task.
ee81c10
to
a859785
Compare
pulp_python/app/pypi.py
Outdated
| """ | ||
| serializer = PyPIPackageUploadSerializer(data=request.data) | ||
| serializer.is_valid(raise_exception=True) | ||
| file, sha256 = serializer.validated_data["content"], serializer.validated_data["sha256_digest"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR warehouse stores multiple hashes if you want to replicate that https://pypi.org/project/build/#copy-hash-modal-9cd0d632-b45a-435a-8d43-1b36ecdbd951
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For every artifact in Pulp we store multiple hashes https://github.com/pulp/pulpcore/blob/master/pulpcore/app/models/content.py#L142, but sha256 is the main one we use. We disabled by default using md5 for security purposes and we don't calculate or store blake256 digests. I don't think it's necessary we check all 3 since sha256 is pretty secure and all the tools upload the package with the sha256.
845f2e4
to
78bb9fb
Compare
| try: | ||
| artifact.save() | ||
| except IntegrityError: | ||
| artifact = Artifact.objects.get(sha256=artifact.sha256) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inconsistent with Warehouse. The standard PyPI returns 403 on duplicate artifacts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, inconsistent with Warehouse, but not so for Pulp. One of Pulp's key features is artifact de-duplication. If an artifact is shared among multiple content or a piece of content is contained in multiple repositories, Pulp only stores the one artifact or content unit once. So for pulp python the content unit is the Python package and its artifact is the tarball or wheel itself. Since Pulp can have multiple repositories if the package is already present in Pulp, but not in the repository, then it should be treated as adding that package to that repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the artifact exists but the hash is different? It mustn't be possible to mutate the same dist. At least this case should error out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Artifact.init_and_validate will throw an error if the digest is wrong, so the same dist can't be mutated. However, if the package and its artifact already exist and you try to upload the same package with a different artifact, Pulp will just use the present artifact instead. This could be considered incorrect behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you try to upload the same package with a different artifact
With the same hash or a different one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either or. Currently artifacts in Pulp are unique by their hash, but python packages in pulp_python are only unique by their filename. Since a goal of Pulp is to allow hosting multiple indexes, I think I'm going to need to change the global uniqueness constraint for python packages to be their hash, but still have the local uniqueness constraint for repositories be their filename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the uniqueness constraint for packages to be their sha256. If you upload a package with the same hash as the one already in Pulp, but not in the repository you will simply add that package to that repository. If you try to upload a package with the same name as another in Pulp but not in the repository and with a different hash then a new package will be created and added to the repository. This means there can be two packages with the same name but different digests in Pulp, however they both can't be in the same repository. The sha256 of the package's metadata and its artifact are now linked and can't be mismatched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means there can be two packages with the same name but different digests in Pulp, however they both can't be in the same repository.
Awesome, this makes perfect sense 👍🏻 thx!
57984de
to
c7effd4
Compare
7a1e445
to
5dd5f80
Compare
363cfe4
to
7a6eaa5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some things I need to change and areas I would like feedback on.
| repository = PythonRepository.objects.get(pk=repository_pk) | ||
| with repository.new_version() as new_version: | ||
| new_version.add_content(queryset) | ||
| @transaction.atomic() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already called from an atomic transaction. I wonder if this does anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will create a new "savepoint" within the transaction https://docs.djangoproject.com/en/2.2/topics/db/transactions/#topics-db-transactions-savepoints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That functionality is actually a little controversial, SQLAlchemy and a few other ORMs keep different constructs for transactions and savepoints as the semantics are slightly different
| "pulp_href": "/pulp/api/v3/distributions/python/pypi/e8438593-fd40-4654-8577-65398833c331/", | ||
| "pulp_created": "2021-06-03T20:04:18.233230Z", | ||
| "base_path": "my-pypi", | ||
| "base_url": "https://pulp3-source-fedora33.localhost.example.com/pypi/foo/", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| @@ -43,6 +43,8 @@ class PythonDistribution(Distribution): | |||
|
|
|||
| TYPE = 'python' | |||
|
|
|||
| allow_uploads = models.BooleanField(default=True) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the container registry plugin uses an entirely separate distribution type for this. I'm not sure what the reasons are, or if they would apply to the Python plugin as well. @ipanova could you weigh in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Container registry uses same type of distribution, it however has 2 repo_types. Push repo type is auto-created during 'podman push', this is basically a read-only repo which you subsequently cannot use for mirroring.
I see that a python repo is being created in advance before python packages twine upload and you have just one repo type.
Is it possible to use same repo for mirroring and uploading? if there are no constraints in that area then i probably don't see the reason to apply this limitation, and if applying I would probably move this flag to the repo model instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my understanding of the way this PR works currently. Are there security reasons for the way the container plugin does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
podman push operation creates a lot of intermediary repo-versions and only last repo-version is content compete meaning that in push-repo-type can be served only last version and it also means that rollback does not work.
Also if using one repo-type sync with 'mirror' mode would just remove everything what has been pushed previously by push api and this behavior can be undesirable. tldr; because of several subtleties it was easier and more explicit to have 2 repo types where one would be used for sync and another for push workflows. I think in registry world one would not use same repo for push and sync workflows and so that's why we have decided to bring the safeguards in a form of different repo types.
This however brings certain amount of complexity which we are figuring out how to minimize https://hackmd.io/z7i694VeTR-p4hgTk2Oocw#Food-for-thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it sounds like the container plugin has to deal with inter-content dependencies, and therefore intermediate repository versions are broken -- but because Python repositories only have one type of content, and one upload per content, that we therefore don't need to follow what the container plugin does with regards to separate repository / distribution types.
Thanks @ipanova, that is super helpful feedback.
| ContentArtifact.objects.create( | ||
| artifact=artifact, content=content, relative_path=filename | ||
| ) | ||
| return content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be wrapped in a transaction, if it isn't already (and maybe it is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is.
| repo, distro = self._create_empty_repo_and_distribution() | ||
| url = urljoin(PYPI_HOST, distro.base_path + "/legacy/") | ||
| username, password = "admin", "password" | ||
| subprocess.run( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a long, long time since I touched this functionality, but I would check to see if pulp_smash.cli.Client provides any extra functionality that makes it better suited than `subprocess1.
(This isn't a requirement, just a suggestion, I actually don't know if it does. We don't often drive a shell from functional tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to use it originally, and it just wasn't well suited. The problem is that it runs the commands in the same container as the Pulp instance, but our CI installs the test_requirements in the VM running that container, so the pulp_smash CLI has no access to twine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work @gerrod3! This looks great. And the implementation seems very clean overall
I'll wait on Ina's feedback before approving, just in case there are good design reasons to go with multiple distribution types instead (like RBAC?)
Lots to still work on, but twine support will finally be here.
twine-enableoption to distribution (might need a better name for that)fixes: #342