You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for putting together such a great library. It's been extremely helpful.
I was toying with the parameters in the example in the documentation on massive datasets. I realized that when using partial_fit (and therefore the sieve optimizer) and slightly more features or I set my target sample size to something larger, it is easy to get a memory error. Here is an example that I tried:
# apricot-massive-dataset-example.py
from apricot import FeatureBasedSelection
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
train_data = fetch_20newsgroups(subset='train', categories=('sci.med', 'sci.space'))
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_data.data) # This returns a sparse matrix which is supported in apricot
print(X_train.shape)
selector = FeatureBasedSelection(1000, concave_func='sqrt', verbose=False)
selector.partial_fit(X_train)
Running the above, I get:
$ python apricot-massive-dataset-example.py
(1187, 25638)
Traceback (most recent call last):
File "apricot-example.py", line 12, in <module>
selector.partial_fit(X_train)
File "/envs/bla/lib/python3.8/site-packages/apricot/functions/base.py", line 258, in partial_fit
self.optimizer.select(X, self.n_samples, sample_cost=sample_cost)
File "/envs/bla/lib/python3.8/site-packages/apricot/optimizers.py", line 1103, in select
self.function._calculate_sieve_gains(X, thresholds, idxs)
File "/envs/bla/lib/python3.8/site-packages/apricot/functions/featureBased.py", line 360, in _calculate_sieve_gains
super(FeatureBasedSelection, self)._calculate_sieve_gains(X,
File "/envs/bla/lib/python3.8/site-packages/apricot/functions/base.py", line 418, in _calculate_sieve_gains
self.sieve_subsets_ = numpy.zeros((l, self.n_samples, self._X.shape[1]), dtype='float32')
numpy.core._exceptions.MemoryError: Unable to allocate 117. GiB for an array with shape (1227, 1000, 25638) and data type float32
This behavior doesn't happen when I use fit() and another optimizer, e.g., two-stage.
Looking into the code, it seems that the root is at an array initialization of sieve_subsets_, and can happen again later here. In both places, we ask for a zero, float64, non-sparse matrix, of size |thresholds| x |n_samples| x |feature_dims|, which can become quite large and not fit in memory when dealing with massive datasets. I wonder if there is a more memory efficient way of writing it? Thanks!
The text was updated successfully, but these errors were encountered:
Thank you for putting together such a great library. It's been extremely helpful.
I was toying with the parameters in the example in the documentation on massive datasets. I realized that when using
partial_fit
(and therefore thesieve
optimizer) and slightly more features or I set my target sample size to something larger, it is easy to get a memory error. Here is an example that I tried:Running the above, I get:
This behavior doesn't happen when I use
fit()
and another optimizer, e.g.,two-stage
.Looking into the code, it seems that the root is at an array initialization of
sieve_subsets_
, and can happen again later here. In both places, we ask for a zero,float64
,non-sparse matrix
, of size|thresholds| x |n_samples| x |feature_dims|
, which can become quite large and not fit in memory when dealing with massive datasets. I wonder if there is a more memory efficient way of writing it? Thanks!The text was updated successfully, but these errors were encountered: