Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overwriting existing packages in backend storage can lead to caching issues #5149

Closed
mbearup opened this issue Mar 22, 2024 · 3 comments · Fixed by #5196
Closed

Overwriting existing packages in backend storage can lead to caching issues #5149

mbearup opened this issue Mar 22, 2024 · 3 comments · Fixed by #5196
Labels

Comments

@mbearup
Copy link

mbearup commented Mar 22, 2024

If an existing package is re-added to pulp, the default behavior will overwrite the existing file in backing storage. This is typically fine.

  • If using Azure Blobstore, the timestamp of the blob is updated (Last-Modified time and ETag).
  • Conversely, some CDN's (notably Azure Front Door) use Last-Modified Time as a signal that a file in origin has updated.
  • This can lead to poor cache behavior, and in some cases, incomplete downloads as the CDN attempts to resolve disparate content.
  • If we set AZURE_OVERWRITE_FILES to false this partially mitigates the issue (Last-Modified/ETag are unmodified). However, this results in duplicate copies written to storage (with a suffix to differentiate from the original).
  • We should have an option that does "nothing" if the uploaded file already exists (don't overwrite, and don't write a new copy).
@daviddavis
Copy link
Contributor

daviddavis commented Mar 22, 2024

I was able to reproduce this with a pulp dev environment using minio. Steps to reproduce:

pulp rpm repository create --name test
pulp rpm distribution create --name test --repository test --base-path test
pulp rpm content upload --file bear-4.1-1.noarch.rpm
pulp rpm repository content add --repository test --package-href /pulp/api/v3/content/rpm/packages/018e6776-e768-7f2e-bff5-8aeb5bf6e43b/  
pulp rpm publication create --repository test

# now check the Last-Modified header
curl -sL --head http://localhost:24816/pulp/content/test/Packages/b/bear-4.1-1.noarch.rpm | grep "Last-Modified"

# upload the package again
pulp rpm content upload --file bear-4.1-1.noarch.rpm > /dev/null

# check the Last-Modified header again
curl -sL --head http://localhost:24816/pulp/content/test/Packages/b/bear-4.1-1.noarch.rpm | grep "Last-Modified"

Here's the output I see:

$ curl -sL --head http://localhost:24816/pulp/content/test/Packages/b/bear-4.1-1.noarch.rpm | grep "Last-Modified"
Last-Modified: Fri, 22 Mar 2024 18:40:29 GMT           

$ pulp rpm content upload --file bear-4.1-1.noarch.rpm > /dev/null
Started background task /pulp/api/v3/tasks/018e6782-e5f5-7b25-9e6f-d1e240170130/
.Done.

$ curl -sL --head http://localhost:24816/pulp/content/test/Packages/b/bear-4.1-1.noarch.rpm | grep "Last-Modified"
Last-Modified: Fri, 22 Mar 2024 18:53:35 GMT                                                               

So the Last Modified header changes when a package is uploaded. I definitely think that reuploading a package shouldn't change its modified timestamp for various reasons.

Like Azure, there is a AWS_S3_FILE_OVERWRITE option (defaults to True) but setting it to False makes django-storages create new files in storage with random chars as a suffix which is problematic.

@daviddavis
Copy link
Contributor

Adding a check here before save to try to fetch any existing artifact seems to fix the problem. I think what's happening is that django-storages updates the file in storage regardless of whether the artifact already exists or not.

@dkliban
Copy link
Member

dkliban commented Mar 26, 2024

@daviddavis do you want to open a PR?

ritotn added a commit to ritotn/pulpcore that referenced this issue Mar 27, 2024
…ngo-storages was being updated on duplicate uploads.

Closes pulp#5149
ritotn added a commit to ritotn/pulpcore that referenced this issue Mar 28, 2024
…ngo-storages was being updated on duplicate uploads.

Closes pulp#5149
ritotn added a commit to ritotn/pulpcore that referenced this issue Apr 11, 2024
Fixes a bug where the Last-Modified header of a package stored in django-storages was being updated on duplicate uploads.

Closes pulp#5149
mdellweg pushed a commit that referenced this issue Apr 18, 2024
Fixes a bug where the Last-Modified header of a package stored in django-storages was being updated on duplicate uploads.

Closes #5149
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants