Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload chunks as separate files #914

Merged
merged 1 commit into from Oct 20, 2020

Conversation

lubosmj
Copy link
Member

@lubosmj lubosmj commented Sep 16, 2020

closes #4498

@pulpbot
Copy link
Member

pulpbot commented Sep 16, 2020

Attached issue: https://pulp.plan.io/issues/4498

@daviddavis
Copy link
Contributor

I think the main problem with this solution is that it's tied to S3 and we support a variety of other storage backends besides S3 (e.g. Azure). I think the option of where to store upload chunks should be to either use the default storage backend or to override that and use the filesystem.


Because of that, we are not able to dynamically change the storage used for chunked uploads
without applying additional migrations after performing relevant changes to the settings file.
Support for a callable storage is new in Django 3.1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this limitation in django 2, it's creating a rather complex solution to have a separate storage setting for chunks. I'm inclined to ditch the setting altogether (until django 3.1) and just store chunks in the default storage.

@ipanova
Copy link
Member

ipanova commented Sep 17, 2020

I think the main problem with this solution is that it's tied to S3 and we support a variety of other storage backends besides S3 (e.g. Azure). I think the option of where to store upload chunks should be to either use the default storage backend or to override that and use the filesystem.

I agree that we should not be tied to S3 only but use the default storage backend. I am not so sure about idea of using the local fs because this will be an issue for, for example when having 100 docker puh operations simultaneously whereas layers are quire big and it might happen that local fs won't be enough

@daviddavis
Copy link
Contributor

I agree that we should not be tied to S3 only but use the default storage backend. I am not so sure about idea of using the local fs because this will be an issue for, for example when having 100 docker puh operations simultaneously whereas layers are quire big and it might happen that local fs won't be enough

I agree but if we give users the choice of where to store their upload chunks, I don't think we should prevent them from using a filesystem.

@mdellweg
Copy link
Member

I thought it would just use the temporary file storage.

@daviddavis
Copy link
Contributor

The problem with just using temporary file storage is that it's horribly inefficient in some cases. It requires that Pulp downloads the chunks, assemble the files, and then re-uploads the chunks back into S3. This is after having already having uploaded the chunks into S3 in the first place. And it also presents the same problem @ipanova outlined in that users could potentially run out of filesystem space.

@daviddavis
Copy link
Contributor

I agree though that using temporary storage only for now makes sense (or at least until we're on django 3.1 and can offer some configuration options). I would document the issues I describe above though so users are aware.

@lubosmj
Copy link
Member Author

lubosmj commented Sep 22, 2020

@daviddavis, speaking of the option for temporary storage (i.e. FileSystemStorage), users would have to deal with migrations again, as I mentioned in the documentation.

The only thing I can actually do, is to note that In Pulp, we do support only default_storage and FileSystemStorage for storing chunked uploads; other options are not guaranteed to work properly. Yet, I am not sure if this is you what you meant by your last comment. And of course, I will remove the wrapper around the S3 storage.

@mdellweg
Copy link
Member

Sorry to say it like that, but having the user create site-local migrations is not an option at all.
We struggle a lot not to generate or cherry-pick migrations on release branches, already.

@daviddavis
Copy link
Contributor

@lubosmj my last comment is that I think we should ditch the storage setting altogether for now and just use default_storage for chunk uploads.

@mdellweg I'm not sure what you mean by users creating "site-local migrations"

@mdellweg
Copy link
Member

@daviddavis referring to this: https://github.com/pulp/pulpcore/pull/914/files#diff-f47f8ca652dac550c8bc9d449ca0d253R165

@lubosmj lubosmj force-pushed the not-only-s3-chunked-uploads-4498 branch 3 times, most recently from 22e1a8d to 45ea0ca Compare September 23, 2020 15:44
@daviddavis
Copy link
Contributor

What happens to users that have upload chunks in their database and then they upgrade?

@lubosmj
Copy link
Member Author

lubosmj commented Sep 29, 2020

What happens to users that have upload chunks in their database and then they upgrade?

They could probably lose track of uploads since the file field is deleted by the migration. Will users perform chunked uploading during the upgrade by any chance? I thought that once an upload is finished, all users create an artifact from it; then, there should not be any uploads left behind.

I may modify the migration to move all uploads to the directory used for chunked uploads. Then, uploaded files will be referenced by UploadChunk models instead of Upload models.

@daviddavis
Copy link
Contributor

@lubosmj I brought this topic up at the pulpcore team meeting. The consensus was to add a release note saying that users should delete all uploads before upgrading. So I think we want a XXXX.removal saying that upload chunks are being moved from the filesystem to the default storage the user has configured and users need to delete any chunked uploads before upgrading.

@mdellweg
Copy link
Member

Honestly i understood that we expect the administrator to clean the diskspace after the migration.
But at least we need a note that pending uploads will not survive the upgrade.

@daviddavis
Copy link
Contributor

daviddavis commented Sep 30, 2020

Overall, looks good to me. Great job on this. 👍

@mdellweg or @ipanova could you also take a look?

ipanova
ipanova previously approved these changes Sep 30, 2020
Copy link
Member

@ipanova ipanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, minor comments that should not be considered as a blocker.

upload_chunk = UploadChunk(
upload=self, offset=offset, size=len(chunk), sha256=current_sha256
)
upload_chunk.file.save("", ContentFile(chunk_read))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you have empty string here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, we need to add name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the same technique like we use in PulpTemporaryFile. A chunk's filename does not depend on the provided filename, but rather on pulp_id which is known when saving the chunk.

@ipanova ipanova self-requested a review September 30, 2020 21:01
@ipanova ipanova dismissed their stale review September 30, 2020 21:02

as disccused per irc more changes are needed to be added first

@lubosmj lubosmj force-pushed the not-only-s3-chunked-uploads-4498 branch from 19a81e3 to 643b75f Compare October 9, 2020 13:07
@@ -0,0 +1,2 @@
The local file system directory used for uploaded chunks was changed and moved to the default
storage instead. Users are encouraged to remove all uploaded files before applying this change.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small change:

s/remove all uploaded/remove all uncommitted uploaded/

@daviddavis
Copy link
Contributor

I thought we talked about using pulp_id or uuid to store the chunks in storage and not sha256?

@daviddavis
Copy link
Contributor

Oh, I see now you are using pulp_id_str to put it in a nested dir. I think this is probably overkill. I don't think that there will be more than a 100 chunks at any one time.

pulpcore/app/models/upload.py Show resolved Hide resolved
return storage.get_upload_chunk_file_path(self.pulp_id)

file = fields.FileField(null=False, upload_to=storage_path, max_length=255)
sha256 = models.CharField(max_length=64, null=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this field if it is not being used when UploadChunk is created? please whether remove it completely or use it during the creation of upload chunk.

@lubosmj lubosmj force-pushed the not-only-s3-chunked-uploads-4498 branch from 643b75f to 0642584 Compare October 16, 2020 11:10
@lubosmj
Copy link
Member Author

lubosmj commented Oct 16, 2020

@daviddavis , after the latest changes, nested directories are not created. However, chunks' file names are still equal to pulp_id.

@daviddavis
Copy link
Contributor

Cool, I would probably get rid of the get_upload_chunk_file_path function too but if you think it's useful to have, then that's fine.

Copy link
Contributor

@daviddavis daviddavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one optional comment. Looks good enough to merge to me.

Thanks @lubosmj!

@lubosmj
Copy link
Member Author

lubosmj commented Oct 16, 2020

Yes, I think that it is useful to have that function!

@dkliban dkliban merged commit e46c3ff into pulp:master Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants