Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added twine upload support #367

Merged
merged 1 commit into from Jun 11, 2021
Merged

Added twine upload support #367

merged 1 commit into from Jun 11, 2021

Conversation

gerrod3
Copy link
Contributor

@gerrod3 gerrod3 commented Apr 29, 2021

Lots to still work on, but twine support will finally be here.

  • Make endpoint available at /pulp_python/pypi/ as well
  • Replace constants with appropriate values from settings
  • Add twine-enable option to distribution (might need a better name for that)
  • Add tests
  • Add new workflows
  • Stretch - Add simple live api so you don't need publications.

fixes: #342

Comment on lines 90 to 108
session = metadata["session"]
start_time = datetime.now(tz=timezone.utc) + timedelta(seconds=5)
if not session.get("start"):
create_upload_task(session)
else:
sq = Session.objects.select_for_update(nowait=False).filter(pk=session.session_key)
with transaction.atomic():
try:
sq.first()
if datetime.fromisoformat(session['start']) >= datetime.now(tz=timezone.utc):
session['artifacts'].append((str(artifact.pk), file.name))
session['start'] = str(start_time)
session.modified = False
session.save()
log.info(f"Updated current task with new package {file.name}")
else: raise DatabaseError
except DatabaseError:
session.cycle_key()
create_upload_task(session)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Twine uploads each package in a separate post request, but they all share the same session. To group the uploaded packages into one task I pass the Session object to the task and have it wait 5 seconds until starting the task. If more packages are uploaded in those 5 seconds then the session is updated with those tasks and the start time of the task is pushed back. If a giant package is in the list of uploaded packages and the task starts before the package finishes uploading, then package's from that one and on will be added in a new task.

@gerrod3 gerrod3 force-pushed the twine branch 2 times, most recently from ee81c10 to a859785 Compare May 4, 2021 22:33
@gerrod3 gerrod3 added this to the 3.3.0 milestone May 10, 2021
"""
serializer = PyPIPackageUploadSerializer(data=request.data)
serializer.is_valid(raise_exception=True)
file, sha256 = serializer.validated_data["content"], serializer.validated_data["sha256_digest"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR warehouse stores multiple hashes if you want to replicate that https://pypi.org/project/build/#copy-hash-modal-9cd0d632-b45a-435a-8d43-1b36ecdbd951

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For every artifact in Pulp we store multiple hashes https://github.com/pulp/pulpcore/blob/master/pulpcore/app/models/content.py#L142, but sha256 is the main one we use. We disabled by default using md5 for security purposes and we don't calculate or store blake256 digests. I don't think it's necessary we check all 3 since sha256 is pretty secure and all the tools upload the package with the sha256.

try:
artifact.save()
except IntegrityError:
artifact = Artifact.objects.get(sha256=artifact.sha256)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent with Warehouse. The standard PyPI returns 403 on duplicate artifacts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, inconsistent with Warehouse, but not so for Pulp. One of Pulp's key features is artifact de-duplication. If an artifact is shared among multiple content or a piece of content is contained in multiple repositories, Pulp only stores the one artifact or content unit once. So for pulp python the content unit is the Python package and its artifact is the tarball or wheel itself. Since Pulp can have multiple repositories if the package is already present in Pulp, but not in the repository, then it should be treated as adding that package to that repository.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the artifact exists but the hash is different? It mustn't be possible to mutate the same dist. At least this case should error out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Artifact.init_and_validate will throw an error if the digest is wrong, so the same dist can't be mutated. However, if the package and its artifact already exist and you try to upload the same package with a different artifact, Pulp will just use the present artifact instead. This could be considered incorrect behavior.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you try to upload the same package with a different artifact

With the same hash or a different one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either or. Currently artifacts in Pulp are unique by their hash, but python packages in pulp_python are only unique by their filename. Since a goal of Pulp is to allow hosting multiple indexes, I think I'm going to need to change the global uniqueness constraint for python packages to be their hash, but still have the local uniqueness constraint for repositories be their filename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the uniqueness constraint for packages to be their sha256. If you upload a package with the same hash as the one already in Pulp, but not in the repository you will simply add that package to that repository. If you try to upload a package with the same name as another in Pulp but not in the repository and with a different hash then a new package will be created and added to the repository. This means there can be two packages with the same name but different digests in Pulp, however they both can't be in the same repository. The sha256 of the package's metadata and its artifact are now linked and can't be mismatched.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means there can be two packages with the same name but different digests in Pulp, however they both can't be in the same repository.

Awesome, this makes perfect sense 👍🏻 thx!

@gerrod3 gerrod3 force-pushed the twine branch 9 times, most recently from 57984de to c7effd4 Compare June 3, 2021 15:06
@gerrod3 gerrod3 force-pushed the twine branch 3 times, most recently from 7a1e445 to 5dd5f80 Compare June 3, 2021 20:50
docs/workflows/publish.rst Outdated Show resolved Hide resolved
@gerrod3 gerrod3 force-pushed the twine branch 2 times, most recently from 363cfe4 to 7a6eaa5 Compare June 4, 2021 22:46
@gerrod3 gerrod3 marked this pull request as ready for review June 4, 2021 23:02
Copy link
Contributor Author

@gerrod3 gerrod3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things I need to change and areas I would like feedback on.

pulp_python/app/pypi/serializers.py Outdated Show resolved Hide resolved
pulp_python/app/pypi/views.py Outdated Show resolved Hide resolved
pulp_python/app/pypi/views.py Outdated Show resolved Hide resolved
pulp_python/app/pypi/views.py Outdated Show resolved Hide resolved
pulp_python/app/pypi/views.py Outdated Show resolved Hide resolved
repository = PythonRepository.objects.get(pk=repository_pk)
with repository.new_version() as new_version:
new_version.add_content(queryset)
@transaction.atomic()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already called from an atomic transaction. I wonder if this does anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That functionality is actually a little controversial, SQLAlchemy and a few other ORMs keep different constructs for transactions and savepoints as the semantics are slightly different

docs/workflows/pypi.rst Outdated Show resolved Hide resolved
docs/workflows/pypi.rst Outdated Show resolved Hide resolved
"pulp_href": "/pulp/api/v3/distributions/python/pypi/e8438593-fd40-4654-8577-65398833c331/",
"pulp_created": "2021-06-03T20:04:18.233230Z",
"base_path": "my-pypi",
"base_url": "https://pulp3-source-fedora33.localhost.example.com/pypi/foo/",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -43,6 +43,8 @@ class PythonDistribution(Distribution):

TYPE = 'python'

allow_uploads = models.BooleanField(default=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the container registry plugin uses an entirely separate distribution type for this. I'm not sure what the reasons are, or if they would apply to the Python plugin as well. @ipanova could you weigh in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container registry uses same type of distribution, it however has 2 repo_types. Push repo type is auto-created during 'podman push', this is basically a read-only repo which you subsequently cannot use for mirroring.
I see that a python repo is being created in advance before python packages twine upload and you have just one repo type.
Is it possible to use same repo for mirroring and uploading? if there are no constraints in that area then i probably don't see the reason to apply this limitation, and if applying I would probably move this flag to the repo model instead.

Copy link
Contributor

@dralley dralley Jun 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my understanding of the way this PR works currently. Are there security reasons for the way the container plugin does it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podman push operation creates a lot of intermediary repo-versions and only last repo-version is content compete meaning that in push-repo-type can be served only last version and it also means that rollback does not work.
Also if using one repo-type sync with 'mirror' mode would just remove everything what has been pushed previously by push api and this behavior can be undesirable. tldr; because of several subtleties it was easier and more explicit to have 2 repo types where one would be used for sync and another for push workflows. I think in registry world one would not use same repo for push and sync workflows and so that's why we have decided to bring the safeguards in a form of different repo types.

This however brings certain amount of complexity which we are figuring out how to minimize https://hackmd.io/z7i694VeTR-p4hgTk2Oocw#Food-for-thoughts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it sounds like the container plugin has to deal with inter-content dependencies, and therefore intermediate repository versions are broken -- but because Python repositories only have one type of content, and one upload per content, that we therefore don't need to follow what the container plugin does with regards to separate repository / distribution types.

Thanks @ipanova, that is super helpful feedback.

ContentArtifact.objects.create(
artifact=artifact, content=content, relative_path=filename
)
return content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be wrapped in a transaction, if it isn't already (and maybe it is)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is.

repo, distro = self._create_empty_repo_and_distribution()
url = urljoin(PYPI_HOST, distro.base_path + "/legacy/")
username, password = "admin", "password"
subprocess.run(
Copy link
Contributor

@dralley dralley Jun 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a long, long time since I touched this functionality, but I would check to see if pulp_smash.cli.Client provides any extra functionality that makes it better suited than `subprocess1.

(This isn't a requirement, just a suggestion, I actually don't know if it does. We don't often drive a shell from functional tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to use it originally, and it just wasn't well suited. The problem is that it runs the commands in the same container as the Pulp instance, but our CI installs the test_requirements in the VM running that container, so the pulp_smash CLI has no access to twine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@dralley dralley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work @gerrod3! This looks great. And the implementation seems very clean overall

I'll wait on Ina's feedback before approving, just in case there are good design reasons to go with multiple distribution types instead (like RBAC?)

@gerrod3 gerrod3 merged commit 8bccde4 into pulp:master Jun 11, 2021
@gerrod3 gerrod3 deleted the twine branch June 11, 2021 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

As a user, I can upload a python package to a repository from twine
5 participants