Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partial_fit and sieve can easily outgrow available memory #33

Open
nucflash opened this issue May 30, 2022 · 0 comments
Open

partial_fit and sieve can easily outgrow available memory #33

nucflash opened this issue May 30, 2022 · 0 comments

Comments

@nucflash
Copy link

Thank you for putting together such a great library. It's been extremely helpful.

I was toying with the parameters in the example in the documentation on massive datasets. I realized that when using partial_fit (and therefore the sieve optimizer) and slightly more features or I set my target sample size to something larger, it is easy to get a memory error. Here is an example that I tried:

# apricot-massive-dataset-example.py
from apricot import FeatureBasedSelection
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

train_data = fetch_20newsgroups(subset='train', categories=('sci.med', 'sci.space'))
vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(train_data.data) # This returns a sparse matrix which is supported in apricot
print(X_train.shape)

selector = FeatureBasedSelection(1000, concave_func='sqrt', verbose=False)
selector.partial_fit(X_train)

Running the above, I get:

$ python apricot-massive-dataset-example.py
(1187, 25638)
Traceback (most recent call last):
  File "apricot-example.py", line 12, in <module>
    selector.partial_fit(X_train)
  File "/envs/bla/lib/python3.8/site-packages/apricot/functions/base.py", line 258, in partial_fit
    self.optimizer.select(X, self.n_samples, sample_cost=sample_cost)
  File "/envs/bla/lib/python3.8/site-packages/apricot/optimizers.py", line 1103, in select
    self.function._calculate_sieve_gains(X, thresholds, idxs)
  File "/envs/bla/lib/python3.8/site-packages/apricot/functions/featureBased.py", line 360, in _calculate_sieve_gains
    super(FeatureBasedSelection, self)._calculate_sieve_gains(X,
  File "/envs/bla/lib/python3.8/site-packages/apricot/functions/base.py", line 418, in _calculate_sieve_gains
    self.sieve_subsets_ = numpy.zeros((l, self.n_samples, self._X.shape[1]), dtype='float32')
numpy.core._exceptions.MemoryError: Unable to allocate 117. GiB for an array with shape (1227, 1000, 25638) and data type float32

This behavior doesn't happen when I use fit() and another optimizer, e.g., two-stage.

Looking into the code, it seems that the root is at an array initialization of sieve_subsets_, and can happen again later here. In both places, we ask for a zero, float64, non-sparse matrix, of size |thresholds| x |n_samples| x |feature_dims|, which can become quite large and not fit in memory when dealing with massive datasets. I wonder if there is a more memory efficient way of writing it? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant