Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition on multiple nearly-simultaneous uploads of the same collection (different versions) #1571

Closed
webknjaz opened this issue Sep 14, 2023 · 1 comment · Fixed by #1572

Comments

@webknjaz
Copy link

Version

galaxy-ng==4.7.1
pulp-ansible==0.17.3
pulp-container==2.14.7
pulp-glue==0.19.5
pulpcore==3.23.13

Describe the bug

We use galaxy-ng in our tests for ansible-core upstream. The jobs are flaky and get stuck periodically.

We don't always see the helper container logs as they aren't available in the CI but I recently observed something locally that just screams a race condition, which led me to believe that the problem is in this repository, specifically.

There are the corresponding log lines:

ERROR! Galaxy import process failed: duplicate key value violates unique constraint "unique_is_highest"
DETAIL:  Key (collection_id, is_highest)=(7cd3d7cd-be93-4f7e-8ca2-07a73f2a8444, t) already exists. (Code: UNKNOWN)

To Reproduce

  1. Make like a dozen copies of a collection with the same namespace, name and all the metadata but different versions. Maybe, hundreds would work better to be sure.
  2. ansible-galaxy collection publish all of them simultaneously (could use a normal shell loop and add & to the publish command to spawn many processes)
  3. Inspect the logs and the database.

Expected behavior

No errors. The latest version is always updated.

Additional context

To add more context, our tests run a galaxy-ng container and provision a bunch of dummy collections for testing of the dependency resolver used in the ansible-galaxy collection install|download commands. Then, we proceed to test the related scenarios.

Here are the collections provisioned with our helper: https://github.com/ansible/ansible/blob/81c83c6/test/integration/targets/ansible-galaxy-collection/vars/main.yml#L27-L167

The helper itself is https://github.com/ansible/ansible/blob/81c83c6/test/integration/targets/ansible-galaxy-collection/library/setup_collections.py and upon closer inspection, it runs ansible-galaxy collection publish asynchronously in subprocesses but no more than 4 at a time.

Back to the test collections being provisioned. There is one of the collections namespace1.name1 that has 14 different versions. That's the one that is most likely to hit race conditions due to the nature of the piece of logic in pulp_ansible where it is happening.

Our helper records the results associated with each upload and the snippet above is extracted from its log.
Said log has successful uploads for all of the collections in the list (not just this one) except for namespace1.name1:0.0.7 while all the other artifacts of this collection (0.0.1-0.0.6, 0.0.8-0.0.10, 0.1.0, 1.0.0, 1.0.9, 1.1.0-beta.1) are identical and are processed in the same loop.

So that duplicate key value violates unique constraint "unique_is_highest" is a clear indication of a database constraint kicking in and failing the DB entry update. It's not hard to imagine that with the asynchronous tasks being scheduled, multiple uploads would trigger simultaneous attempts to update the database table in a way that would cause a conflict.
I'm convinced that if the code were to use transactions/locks/retries, it could be more robust and process this case more gracefully.

The constraint itself is introduced @ https://github.com/pulp/pulp_ansible/blob/567b0c7/pulp_ansible/app/models.py#L234 / https://github.com/pulp/pulp_ansible/blob/567b0c7/pulp_ansible/app/migrations/0005_collectionversion_is_highest.py#L32.
I've also found this visualization of the DB schema: https://alikins.github.io/schemaspy/tables/ansible_collectionversion.html.

I believe that the buggy piece of code that gets into the race condition directly is https://github.com/pulp/pulp_ansible/blob/567b0c7/pulp_ansible/app/tasks/collections.py#L424-L477.

@mdellweg mdellweg self-assigned this Sep 15, 2023
mdellweg added a commit to mdellweg/pulp_ansible that referenced this issue Sep 15, 2023
This bugfix contains a migration and is therefore not backportable.

fixes pulp#1571
mdellweg added a commit to mdellweg/pulp_ansible that referenced this issue Sep 15, 2023
This bugfix contains a migration and is therefore not backportable.

fixes pulp#1571
mdellweg added a commit to mdellweg/pulp_ansible that referenced this issue Sep 15, 2023
This is the backported version of the fix. A different strategy was
pursued.

fixes pulp#1571
mdellweg added a commit that referenced this issue Sep 19, 2023
This bugfix contains a migration and is therefore not backportable.

fixes #1571
@webknjaz
Copy link
Author

@mdellweg is the bugfix release planned anytime soon?

mdellweg added a commit to mdellweg/pulp_ansible that referenced this issue Sep 26, 2023
mdellweg added a commit that referenced this issue Sep 26, 2023
webknjaz added a commit to webknjaz/ansible that referenced this issue Dec 11, 2023
It updates the underlying `pulp_ansible` to v0.20.2, which fixes a
race condition that is occasiontally observable in our CI when
provisioning the initial dummy collections for testing.

Refs:

  * ansible/galaxy_ng#1950
  * ansible/galaxy#3267
  * pulp/pulp_ansible#1571
  * pulp/pulp_ansible#1572
  * pulp/pulp_ansible#1595
  * pulp/pulp_ansible#1623
  * pulp/pulp_ansible#1624
webknjaz added a commit to webknjaz/ansible that referenced this issue Dec 11, 2023
It updates the underlying `pulp_ansible` to v0.20.2, which fixes a
race condition that is occasiontally observable in our CI when
provisioning the initial dummy collections for testing.

Refs:

  * https://github.com/ansible/galaxy_ng/releases/tag/4.9.0
  * ansible/galaxy_ng#1950
  * ansible/galaxy#3267
  * pulp/pulp_ansible#1571
  * pulp/pulp_ansible#1572
  * pulp/pulp_ansible#1595
  * pulp/pulp_ansible#1623
  * pulp/pulp_ansible#1624
webknjaz added a commit to webknjaz/ansible that referenced this issue Dec 11, 2023
It updates the underlying `pulp_ansible` to v0.20.2, which fixes a
race condition that is occasionally observable in our CI when
provisioning the initial dummy collections for testing.

Refs:

  * https://github.com/ansible/galaxy_ng/releases/tag/4.9.0
  * ansible/galaxy_ng#1950
  * ansible/galaxy#3267
  * pulp/pulp_ansible#1571
  * pulp/pulp_ansible#1572
  * pulp/pulp_ansible#1595
  * pulp/pulp_ansible#1623
  * pulp/pulp_ansible#1624
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants