Make the algorithm less memory intensive #43

stolam · 2021-01-19T12:45:29Z

When using big data, it becomes infeasible to hold everything in memory at once.
Would it be possible to iterate over the data rather than hold it in memory?

It might also help exposing n_jobs parameter for UMAP so that the user has some control over the number of cores and therefore consumed memory.

The text was updated successfully, but these errors were encountered:

Wktx · 2021-01-19T16:12:21Z

I am having issues using big data as well, model fit transform is very resource intensive and takes hours. I am curious:

1.Is there a way to turn on progress_bar=true fit-transforming the model like using the transformers.encode to get embeddings? I don't see a way to tell if the model is still running or hanging.

2.Is there plans to add options to offload the model to the GPU via torch? I have VRAMs to spare

Any suggestion and feedback would be greatly appreciated!

MaartenGr · 2021-01-19T16:53:04Z

@stolam There is also the option to set calculate_probabilities to false, which definitely helps with resource management and speeds up the solution. In UMAP there is similarly the option to set low_memory to False which I have found to help with low resource machines. I am thinking of replacing the calculate_probabilities parameter with a low_memory parameter in order to chance both the calculating of probabilities as well as setting low memory in UMAP.

@Kingstonshaw

This is something I am considering turning on if verbose is set to True, so I do think you will see this in an update in the near future!
Not yet, this is mainly because the model in its entirety currently cannot be offloaded to the GPU as not all models in BERTopic have that feature (UMAP & HDBSCAN) although UMAP is nearing that point. Having said that, I might look into replacing sentence-transformers with Flair to make it easier to offload the creation of embeddings.

stolam · 2021-01-20T06:21:25Z

@MaartenGr Thank you, for your reply. It is good to know about the calculate_probabilities option. In my case, the algorithm crashes during the UMAP phase, so I will use the low_memory option (you meant to set it to True, right?) and try to limit the cores, that should improve things.

Just to put things into perspective, my dataset is 20GB, my RAM is 128GB so I thought I would be OK. The memory consumption was growing slowly from 20 to 40GB and then exploded quickly with UMAP.

MaartenGr · 2021-02-08T13:41:21Z

@stolam I just released a new version of BERTopic (v0.5) that has a low_memory options built-in as a parameter. Set this to True and it should use significantly less memory. Also, calculate_probabilities is now set to False as a default to prevent accidental memory issues. Hopefully, this helps you out. If not, please let me know!

MaartenGr mentioned this issue Jan 21, 2021

v0.5 #46

Merged

MaartenGr closed this as completed Feb 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the algorithm less memory intensive #43

Make the algorithm less memory intensive #43

stolam commented Jan 19, 2021

Wktx commented Jan 19, 2021

MaartenGr commented Jan 19, 2021

stolam commented Jan 20, 2021

MaartenGr commented Feb 8, 2021

Make the algorithm less memory intensive #43

Make the algorithm less memory intensive #43

Comments

stolam commented Jan 19, 2021

Wktx commented Jan 19, 2021

MaartenGr commented Jan 19, 2021

stolam commented Jan 20, 2021

MaartenGr commented Feb 8, 2021