-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serious memory leak on multipart file upload #929
Comments
what is the actual memory limit on the pods ? @mykola-mokhnach can you share the application code here? |
The application code is super straightforward. This method is uploading files:
The memory limit on the pod is 512M, the size of the file is 500M |
At max it should use 75*(num_workers) @mykola-mokhnach do you see beyond this? |
This is what I would expect as well. I see the default workers count is 3. But you see the actual allocation value in the log - it's much bigger than that |
That I think looks like because the buffer is not re-used - although it is still using 225MiB per 3 workers. |
75*3=225 would be fine, but it shows 423 :( |
I looks like the issue also exists for single-part uploads. In Line 941 in c8aaeba
body argument.
For now I've rewritten the implementation like that in my client code. It only supports single-part file upload, but this is fine with us, since the max file size is limited to 4GB (Amazon's high limit is 5GB per part). The average memory usage of such code snippet is no more than 5 MB: def load_file_hash(hash_func, file_path: str):
with open(file_path, 'rb') as f:
while True:
chunk = f.read(1 * MiB)
if not chunk:
break
hash_func.update(chunk)
return hash_func
...
def upload_file(self, src_path: str, progress: Optional[Progress] = None) -> FileStat:
file_id = str(uuid.uuid4())
file_name = os.path.basename(src_path)
log.debug(f'Got request for "{file_name}" file upload, uuid: {file_id}')
# A workaround for https://github.com/minio/minio-py/issues/929
total_size = os.stat(src_path).st_size
headers = {
'Content-Length': str(total_size),
}
md5_base64 = ''
sha256_hex = _UNSIGNED_PAYLOAD
if self._config.secure:
h = load_file_hash(hashlib.md5(), src_path)
md5_base64 = base64.b64encode(h.digest()).decode('utf-8')
else:
h = load_file_hash(hashlib.sha256(), src_path)
sha256_hex = h.hexdigest()
if md5_base64:
headers['Content-Md5'] = md5_base64
if progress:
progress.set_meta(total_size, STORAGE_ROOT + file_id)
with open(src_path, 'rb') as body:
# noinspection PyProtectedMember
self._require_client()._url_open(
'PUT',
bucket_name=self._config.bucket,
object_name=STORAGE_ROOT + file_id,
headers=headers,
body=body,
content_sha256=sha256_hex
)
if progress:
progress.update(total_size)
log.debug(f'"{file_name}" has been successfully uploaded, uuid: {file_id}')
return self.stat_file(file_id) |
@mykola-mokhnach this is not a memory leak. The maximum allocated memory would be 6 x part_size, but I also sent a PR which reduces the memory usage. |
Thanks for your response @vadmeste Could you please answer some more questions if you have time?
|
The data source is a stream, and the S3 spec requires that we know that size of the data before uploading it.. so we don't have a choice here.
Not safe, but the default part size is 5 Mb anyway, we can't do much AFAIK |
I probably don't know much about the original background there. These are just some ideas on how I would optimize the stuff:
|
Hello,
it looks like there is a series memory leak while uploading multipart files to s3 using this client. We use version 5.0.10 of it and were observing constant pod crashes due to OutOfMemory exceptions while uploading bigger files to S3. This is particularly visible if the allocated amount of pod memory resources is low. After some investigation we've figured out the root cause of the issue might be in
minio-py/minio/api.py
Line 1734 in c8aaeba
In this loop we load the data from the source and put the chunks into the pool. The actual cause is the
part_data = read_full(data, current_part_size)
line. We, actually, load all the chunks into RAM first (even for very big files) and pass these memory chunks into corresponding upload threads. I assume the expected behaviour should be that the data should be read into the memory from the inside of each thread. This way only the "active" chunks will be loaded into RAM when the pool decides to activate the particular thread (according to itsmax_workers
setting).Some local debugging to illustrate the issue better:
The callback that displays the memory usage was put into the
Progress
thread (set_meta
andupdate
methods). The memory usage is measured usingpsutil
module (the RSS value)The text was updated successfully, but these errors were encountered: