New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload chunks as separate files #914
Conversation
|
Attached issue: https://pulp.plan.io/issues/4498 |
a004394
to
d3bbe34
Compare
|
I think the main problem with this solution is that it's tied to S3 and we support a variety of other storage backends besides S3 (e.g. Azure). I think the option of where to store upload chunks should be to either use the default storage backend or to override that and use the filesystem. |
pulpcore/app/models/upload.py
Outdated
|
|
||
| Because of that, we are not able to dynamically change the storage used for chunked uploads | ||
| without applying additional migrations after performing relevant changes to the settings file. | ||
| Support for a callable storage is new in Django 3.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this limitation in django 2, it's creating a rather complex solution to have a separate storage setting for chunks. I'm inclined to ditch the setting altogether (until django 3.1) and just store chunks in the default storage.
I agree that we should not be tied to S3 only but use the default storage backend. I am not so sure about idea of using the local fs because this will be an issue for, for example when having 100 docker puh operations simultaneously whereas layers are quire big and it might happen that local fs won't be enough |
I agree but if we give users the choice of where to store their upload chunks, I don't think we should prevent them from using a filesystem. |
|
I thought it would just use the temporary file storage. |
|
The problem with just using temporary file storage is that it's horribly inefficient in some cases. It requires that Pulp downloads the chunks, assemble the files, and then re-uploads the chunks back into S3. This is after having already having uploaded the chunks into S3 in the first place. And it also presents the same problem @ipanova outlined in that users could potentially run out of filesystem space. |
|
I agree though that using temporary storage only for now makes sense (or at least until we're on django 3.1 and can offer some configuration options). I would document the issues I describe above though so users are aware. |
|
@daviddavis, speaking of the option for temporary storage (i.e. FileSystemStorage), users would have to deal with migrations again, as I mentioned in the documentation. The only thing I can actually do, is to note that |
|
Sorry to say it like that, but having the user create site-local migrations is not an option at all. |
22e1a8d
to
45ea0ca
Compare
|
What happens to users that have upload chunks in their database and then they upgrade? |
They could probably lose track of uploads since the file field is deleted by the migration. Will users perform chunked uploading during the upgrade by any chance? I thought that once an upload is finished, all users create an artifact from it; then, there should not be any uploads left behind. I may modify the migration to move all uploads to the directory used for chunked uploads. Then, uploaded files will be referenced by UploadChunk models instead of Upload models. |
|
@lubosmj I brought this topic up at the pulpcore team meeting. The consensus was to add a release note saying that users should delete all uploads before upgrading. So I think we want a XXXX.removal saying that upload chunks are being moved from the filesystem to the default storage the user has configured and users need to delete any chunked uploads before upgrading. |
45ea0ca
to
19a81e3
Compare
|
Honestly i understood that we expect the administrator to clean the diskspace after the migration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, minor comments that should not be considered as a blocker.
| upload_chunk = UploadChunk( | ||
| upload=self, offset=offset, size=len(chunk), sha256=current_sha256 | ||
| ) | ||
| upload_chunk.file.save("", ContentFile(chunk_read)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you have empty string here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, we need to add name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the same technique like we use in PulpTemporaryFile. A chunk's filename does not depend on the provided filename, but rather on pulp_id which is known when saving the chunk.
as disccused per irc more changes are needed to be added first
19a81e3
to
643b75f
Compare
CHANGES/4498.removal
Outdated
| @@ -0,0 +1,2 @@ | |||
| The local file system directory used for uploaded chunks was changed and moved to the default | |||
| storage instead. Users are encouraged to remove all uploaded files before applying this change. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small change:
s/remove all uploaded/remove all uncommitted uploaded/
|
I thought we talked about using pulp_id or uuid to store the chunks in storage and not sha256? |
|
Oh, I see now you are using |
pulpcore/app/models/upload.py
Outdated
| return storage.get_upload_chunk_file_path(self.pulp_id) | ||
|
|
||
| file = fields.FileField(null=False, upload_to=storage_path, max_length=255) | ||
| sha256 = models.CharField(max_length=64, null=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of this field if it is not being used when UploadChunk is created? please whether remove it completely or use it during the creation of upload chunk.
closes #4498
643b75f
to
0642584
Compare
|
@daviddavis , after the latest changes, nested directories are not created. However, chunks' file names are still equal to pulp_id. |
|
Cool, I would probably get rid of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one optional comment. Looks good enough to merge to me.
Thanks @lubosmj!
|
Yes, I think that it is useful to have that function! |
closes #4498