Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the algorithm less memory intensive #43

Closed
stolam opened this issue Jan 19, 2021 · 4 comments
Closed

Make the algorithm less memory intensive #43

stolam opened this issue Jan 19, 2021 · 4 comments

Comments

@stolam
Copy link

stolam commented Jan 19, 2021

When using big data, it becomes infeasible to hold everything in memory at once.
Would it be possible to iterate over the data rather than hold it in memory?

It might also help exposing n_jobs parameter for UMAP so that the user has some control over the number of cores and therefore consumed memory.

@Wktx
Copy link

Wktx commented Jan 19, 2021

I am having issues using big data as well, model fit transform is very resource intensive and takes hours. I am curious:

1.Is there a way to turn on progress_bar=true fit-transforming the model like using the transformers.encode to get embeddings? I don't see a way to tell if the model is still running or hanging.

2.Is there plans to add options to offload the model to the GPU via torch? I have VRAMs to spare

Any suggestion and feedback would be greatly appreciated!

@MaartenGr
Copy link
Owner

@stolam There is also the option to set calculate_probabilities to false, which definitely helps with resource management and speeds up the solution. In UMAP there is similarly the option to set low_memory to False which I have found to help with low resource machines. I am thinking of replacing the calculate_probabilities parameter with a low_memory parameter in order to chance both the calculating of probabilities as well as setting low memory in UMAP.

@Kingstonshaw

  1. This is something I am considering turning on if verbose is set to True, so I do think you will see this in an update in the near future!

  2. Not yet, this is mainly because the model in its entirety currently cannot be offloaded to the GPU as not all models in BERTopic have that feature (UMAP & HDBSCAN) although UMAP is nearing that point. Having said that, I might look into replacing sentence-transformers with Flair to make it easier to offload the creation of embeddings.

@stolam
Copy link
Author

stolam commented Jan 20, 2021

@MaartenGr Thank you, for your reply. It is good to know about the calculate_probabilities option. In my case, the algorithm crashes during the UMAP phase, so I will use the low_memory option (you meant to set it to True, right?) and try to limit the cores, that should improve things.

Just to put things into perspective, my dataset is 20GB, my RAM is 128GB so I thought I would be OK. The memory consumption was growing slowly from 20 to 40GB and then exploded quickly with UMAP.

@MaartenGr MaartenGr mentioned this issue Jan 21, 2021
Merged
@MaartenGr
Copy link
Owner

@stolam I just released a new version of BERTopic (v0.5) that has a low_memory options built-in as a parameter. Set this to True and it should use significantly less memory. Also, calculate_probabilities is now set to False as a default to prevent accidental memory issues. Hopefully, this helps you out. If not, please let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants