Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM/SegFault issues? #25

Closed
ASharmaML opened this issue Oct 6, 2022 · 3 comments
Closed

OOM/SegFault issues? #25

ASharmaML opened this issue Oct 6, 2022 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@ASharmaML
Copy link

PLTs train extremely quickly using this implementation which is fantastic to see. However, I have run into a few issues when training on larger datasets:

  • There is no batching method by default, which then requires very large matrices to be held in matrices in order to train the model. I assume the way to avoid this in memory issue is FitOnFile
  • Even if the training data fits comfortably in memory, at larger sizes such as >1 million training data points with >10k labels, the Python kernel crashes which I assume is due to an OOM or s error on the C++ side. It feels like there must be a memory leak somewhere, as the actual trees themselves never get that large, and I assume internally that the model trains in batches as outlined in the paper
@mwydmuch
Copy link
Owner

Hi @ASharmaML, sorry for the long response time and thank you for opening the issue.

If by batching you mean training using a small subset of training examples, then this is actually more problematic than loading the whole dataset since storing all weights (for all trees and their nodes) in the memory for training usually requires much more resources than storing the whole dataset in spars format. Internally the model is trained node by node, and once a node is trained its weights are stored in the file, and training of the next node begins. Later it can be loaded in a sparse format for efficient prediction.

The problem with Python bindings at the moment is that the data in the Python format (scipy matrix/numpy array) needs to be copied to the internal format on C++ side, which simply doubles the memory requirement when using Python bindings and can cause OOM error on large datasets. The solution for that at the moment is to save data to the file and use the fit_on_file method.

I will check in the next week if there is some leak. And meanwhile, I'm happy to answer any questions you may have.

@ASharmaML
Copy link
Author

ASharmaML commented Oct 12, 2022

Thanks so much for responding, all of the above makes sense RE batching as the method itself going node by node doesn't allow for it.

Does the fit_on_file method work with the Python binding currently? As when I try and invoke it I get the following error:

AttributeError: 'napkinxc._napkinxc.CPPModel' object has no attribute 'fitOnFile'

I'm going to try and fix it in the mean time.

@mwydmuch
Copy link
Owner

Hi @ASharmaML, indeed, there was a typo causing the error. It's fixed now, the new release should work as expected. Thank you very much for spotting and reporting it.

@mwydmuch mwydmuch added the bug Something isn't working label Oct 17, 2022
@mwydmuch mwydmuch self-assigned this Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants