-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leaks? #60
Comments
I assume you have already seen the section of the README section about setting the row group/buffer size? https://github.com/ironSource/parquetjs#buffering--row-group-size -- If you keep increasing the batch size an OOM error will eventually be the expected failure case. The default batch size is 4096; is it possible that your records are on the order of ~500KB each? Have you tried lowering the parquet.js row group size (via |
I saw the section, but even with 1024 as a rowGroupSize it sometimes (very rarely) sends an OOM error. |
I am facing a similar issue with my app. It writes incoming data to parquet, rotates the files every 1 minute and pushes to s3. The application start with ~400mb of memory and keeps on increasing, eventually crossing 10gb in a few hours. Each record is a small json and parquet file generated each minutes is ~40mb. Parquet
S3
Chaning the group size also does not alter this behaviour. |
I think I hit this same issue. Will try to post a working sample to demonstrate, but here's what I did:
Debugging through the library it seems that if the flushing happens only inside the But if - due to your row group size - it is triggered also in I tried the following as a quick and dirty workaround and seems to work: I changed the above lines in
With this it does seem to keep the count integrity in place. I think having the Does this sound right? Regards, |
OK, thanks for figuring this out! I think I see what is going on here now. There is an undocumented assumption that appendRow is not called concurrently, i.e. that, at any given time, there is only one outstanding call to the method. For now, the following workaround should solve the problem: Ensure that calls to appendRow are sequenced, i.e. only start a new call to appendRow once the previous one has returned. This should ideally be done by using the "await" keyword on any call to appendRow. As for a long term solution, I am not enough of a JavaScript expert to tell if something needs to be fixed in parquetjs or not. In my eyes, our current behaviour is similar to the the behaviour of the "fs" module built into nodejs. E.g. if you call fs.appendFile concurrently the result would be equally undefined. However, we might at the least consider to add a better error message for users that hit this issue. Also see the discussion in #105 |
Closing this as resolved. |
Hi,
I'm using elasticsearchJS to export a whole index from ES in batches of 4096.
The whole tool uses about 500mb RAM while dumping ES index to parquet format.
(nodeJS has 2GB memory limit set)
If i lower or increase the batch size (or randomly) it uses a lot of memory like 2-3GB and it gets killed.
The quickest way to reproduce is to increase the batch size that it has to process.
The generate parquet file, usually has ~5.4GB.
Is there anything i can do to debug this more?
Thanks!
P.S.: I'm using
git+ssh://git@github.com/ironSource/parquetjs.git#1fa58b589d9b6451379f1558214e9ae751909596
as the parquetJS package.The text was updated successfully, but these errors were encountered: