Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.

Loading Objects breaks '_default_chunksize' Attribute for 'get_output' Method #1519

Closed
tim0marshall opened this issue Nov 30, 2021 · 5 comments

Comments

@tim0marshall
Copy link

Hello,

I am trying to save and load pyemma.coordinates objects (for TICA and KmeansClustering) and getting a chunksize error.

For example, while running pyemma.coordinates.tica(), saving the output TICA object as an .h5 file using 'save' method causes no issues. Upon loading and attempting to call 'get_output()' method on the loaded object, I get the following error:
'TICA' object has no attribute '_default_chunksize'

If I generate the TICA object using pyemma.coordinates.tica() and use the 'get_output()' method in one script without saving/loading, I get no issue.

If I call a different method (describe, n_frames_total, get_params), I get an expected output with no issues.

Please let me know how to fix this issue.

Below is the code and error. I load in TICA object from a directory as tica. I print it as a sanity check. I then call the method.

import pyemma

my_dir = f'./testing_2rob/analysisRUN0'    #working dir
tica_import = f'{my_dir}/distances_tica_raw.h5'    #file to load

tica = pyemma.load(tica_import)
print(f'tica loaded:\n {tica}')

tica_getoutput = tica.get_output()

Here's the output.

tica loaded:
TICA(commute_map=False, dim=8, epsilon=1e-06, kinetic_map=True, lag=500,
ncov_max=inf, reversible=True, skip=0, stride=1, var_cutoff=0.95,
weights=None)


AttributeError Traceback (most recent call last)
ipython-input-17-c66809f45c96 in module
7 print(f'tica loaded:\n {tica}')
8
---> 9 tica_getoutput = tica.get_output()

~/packages/anaconda3/lib/python3.6/site-packages/pyemma/coordinates/data/_base/transformer.py in get_output(self, dimensions, stride, skip, chunk)
224 self.estimate(self.data_producer, stride=stride)
225
-> 226 return super(StreamingTransformer, self).get_output(dimensions, stride, skip, chunk)
227
228

~/packages/anaconda3/lib/python3.6/site-packages/pyemma/coordinates/data/_base/datasource.py in get_output(self, dimensions, stride, skip, chunk)
368
369 if chunk is None:
-> 370 chunk = self.chunksize
371
372 # create iterator

~/packages/anaconda3/lib/python3.6/site-packages/pyemma/coordinates/data/_base/transformer.py in chunksize(self)
179 """chunksize defines how much data is being processed at once."""
180 if not self.data_producer:
-> 181 return self.default_chunksize
182 return self.data_producer.chunksize
183

~/packages/anaconda3/lib/python3.6/site-packages/pyemma/coordinates/data/_base/iterable.py in default_chunksize(self)
69 This variable respects your setting for maximum memory in pyemma.config.default_chunksize
70 """
-> 71 if self._default_chunksize is None:
72 try:
73 # TODO: if dimension is not yet fixed (eg tica var cutoff, use dim of data_producer.

AttributeError: 'TICA' object has no attribute '_default_chunksize'

P.S. sorry about the white space in the error log.

@clonker
Copy link
Member

clonker commented Dec 1, 2021

I'll have a closer look tomorrow but from a quick glance i think the issue is that the data source is not persisted when saving a model (which is per design). when you try to call get_output, the model looks for this no longer existing data source - although I have to admit that the error message could be a bit more descriptive. possible remedy: setting tica.data_producer = pyemma.coordinates.source(...) with the appropriate files.

@clonker
Copy link
Member

clonker commented Dec 2, 2021

This is indeed the issue. You can do the following:

source = pyemma.coordinates.source(data)
tica = pyemma.coordinates.tica(source)
out = tica.get_output()  # this works

tica.save('tica.h5')
tica_restored = pyemma.load('tica.h5')
# tica_restored.get_output()  does not work because there is no data source configured
tica_restored.data_producer = source  # configure the data source
out = tica_restored.get_output()

@clonker clonker closed this as completed Dec 10, 2021
@ThiDungNguyen
Copy link

ThiDungNguyen commented Sep 22, 2022

Thank you @clonker for this the exam code to load saved tica and then get tica.get_output from loaded one. I followed your instruction above for tica and got success. But when I try to load the saved cluster, then do cluster.get_output, I got the same error like I got with tica: 'KmeansClustering' object has no attribute '_default_chunksize". I tried to use cluster_restored.data_producer = source or cluster_restored.data_producer = tica , both ways doesn't work. Could you tell me how to fix this. Thank you.

here is the code when I got error with '_default_chunksize:

cluster = coor.cluster_kmeans(tica_output, k=n_clusters, max_iter=100) cluster.save(clusters_raw.h5, overwrite=True) cluster_restored = pyemma.load(clusters_raw.h5) cluster_output = cluster_restored.get_output()

then the error showed up like:

`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Input In [85], in <cell line: 21>()
22 cluster_restored = pyemma.load(f'{tica_dir}/objects/{n_clusters}_clusters_raw.h5')
23 #cluster_restored.data_producer =
---> 24 cluster_output = cluster_restored.get_output()
26 """
27 if do_cluster:
28 def k_means(tica_getoutput, n_cluster, max_iter, save, save_dir, save_name):
(...)
48
49 """

File ~/opt/anaconda3/lib/python3.9/site-packages/pyemma/coordinates/data/_base/transformer.py:226, in StreamingEstimationTransformer.get_output(self, dimensions, stride, skip, chunk)
223 if not self._estimated:
224 self.estimate(self.data_producer, stride=stride)
--> 226 return super(StreamingTransformer, self).get_output(dimensions, stride, skip, chunk)

File ~/opt/anaconda3/lib/python3.9/site-packages/pyemma/coordinates/data/_base/datasource.py:370, in DataSource.get_output(self, dimensions, stride, skip, chunk)
367 assert ndim > 0, "ndim was zero in %s" % self.class.name
369 if chunk is None:
--> 370 chunk = self.chunksize
372 # create iterator
373 if self.in_memory and not self._mapping_to_mem_active:

File ~/opt/anaconda3/lib/python3.9/site-packages/pyemma/coordinates/data/_base/transformer.py:181, in StreamingTransformer.chunksize(self)
179 """chunksize defines how much data is being processed at once."""
180 if not self.data_producer:
--> 181 return self.default_chunksize
182 return self.data_producer.chunksize

File ~/opt/anaconda3/lib/python3.9/site-packages/pyemma/coordinates/data/_base/iterable.py:71, in Iterable.default_chunksize(self)
63 @Property
64 def default_chunksize(self):
65 """ How much data will be processed at once, in case no chunksize has been provided.
66
67 Notes
68 -----
69 This variable respects your setting for maximum memory in pyemma.config.default_chunksize
70 """
---> 71 if self._default_chunksize is None:
72 try:
73 # TODO: if dimension is not yet fixed (eg tica var cutoff, use dim of data_producer.
74 self.dimension()

AttributeError: 'KmeansClustering' object has no attribute '_default_chunksize'`

Then I tried to configure the data source by:

cluster_restored = pyemma.load(clusters_raw.h5) cluster_restored.data_producer = source cluster_output = cluster_restored.get_output()

so, the error I got like:
`---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [86], in <cell line: 21>()
21 if load_cluster:
22 cluster_restored = pyemma.load(f'{tica_dir}/objects/{n_clusters}_clusters_raw.h5')
---> 23 cluster_restored.data_producer = source
24 cluster_output = cluster_restored.get_output()
26 """
27 if do_cluster:
28 def k_means(tica_getoutput, n_cluster, max_iter, save, save_dir, save_name):
(...)
48
49 """

File ~/opt/anaconda3/lib/python3.9/site-packages/pyemma/coordinates/data/_base/transformer.py:135, in StreamingTransformer.data_producer(self, dp)
133 self._data_producer = dp
134 if dp is not None and not isinstance(dp, DataSource):
--> 135 raise ValueError('can not set data_producer to non-iterable class of type {}'.format(type(dp)))
136 # register random access strategies
137 self._set_random_access_strategies()

ValueError: can not set data_producer to non-iterable class of type <class 'function'>`

Then I tried configure data source by the different way like:

cluster_restored = pyemma.load(clusters_raw.h5) cluster_restored.data_producer = tica_output cluster_output = cluster_restored.get_output()

I still got an error like:

`---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [87], in <cell line: 21>()
21 if load_cluster:
22 cluster_restored = pyemma.load(clusters_raw.h5')
---> 23 cluster_restored.data_producer = tica_output
24 cluster_output = cluster_restored.get_output()
26 """
27 if do_cluster:
28 def k_means(tica_getoutput, n_cluster, max_iter, save, save_dir, save_name):
(...)
48
49 """

File ~/opt/anaconda3/lib/python3.9/site-packages/pyemma/coordinates/data/_base/transformer.py:135, in StreamingTransformer.data_producer(self, dp)
133 self._data_producer = dp
134 if dp is not None and not isinstance(dp, DataSource):
--> 135 raise ValueError('can not set data_producer to non-iterable class of type {}'.format(type(dp)))
136 # register random access strategies
137 self._set_random_access_strategies()

ValueError: can not set data_producer to non-iterable class of type <class 'list'>`

@clonker
Copy link
Member

clonker commented Sep 23, 2022

Cheers Julie, try this

source = pyemma.coordinates.source(data)
tica = pyemma.coordinates.tica(source)
cluster = pyemma.coordinates.cluster_kmeans(tica)
out = cluster.get_output()  # this works

tica.save('tica.h5', overwrite=True)
cluster.save('cluster.h5', overwrite=True)
tica_restored = pyemma.load('tica.h5')
cluster_restored = pyemma.load('cluster.h5')
tica_restored.data_producer = source  # configure the data source
cluster_restored.data_producer = tica_restored  # configure cluster data source as (restored) tica
out_tica = tica_restored.get_output()
out2 = cluster_restored.get_output()

assert_equal(out, out2)

@mmfarrugia
Copy link

mmfarrugia commented Jul 26, 2023

Hi there, I have run into this same issue and I do understand the explanation, but I wanted to ask whether this is also the intended behavior for models generated from data which was not streamed via pyemma.coordinates.source but rather loaded in as an object?

Code/Usage Details:
In my case, I featurized my data with it loaded via pyemma.coordinates.load with globbed files and featurizer specified. I then saved the features to .npy files and loaded them back in as variables via np.load before those variables were passed as arguments to pyemma.coordinates.pca (or tica or vamp). I did not use the pyemma.coordinates.source method which API indicates is streaming, I have relatively large datasets at 20 trajectories of 500ns so perhaps it defaults to streaming (?) but they are minimized to a resolution of 200ps so they are not all that long. I can add files if that would be useful, but this code is distributed across multiple files and my own modules so I thought a thorough explanation would be better since I hope I have done enough digging around to phrase my question/issue well.

Please ignore below if that was not the intended behavior.

Code Behavior Question:
If this is the intended behavior in this case as well, does specifying the data_producer only set those values which have not been re-initiated upon loading in a model via pyemma.load or is it also necessary because the model does not retain its transformed variables but rather produces them from the source data each time they are requested via a getter? If so, I will instead save the individual variables I will need later as .npy data or pickle them into one file unless there is an improvement in this functionality in deeptime (I did not see one and would prefer not to switch since deeptime is also no longer maintained) or another recommendation you may have.

Use Case Details:
To be clear, my goal in saving the models is less so for exact replication of models (I am saving many as I am checking quality of reduced dimension spaces and sufficiency of sampling data) and more so to minimize computational effort and RAM requirements (when I load data in normally for analysis as done in the various tutorials I reach up to 400GB RAM usage and I intend to scale up my sampling to a ceiling of 100 microseconds). Thus, I just want the transformed data in the reduced subspace as well as the relevant transformation vectors and quality-related data such as vamp2 score or implied timescales. If it helps, I notice that the model files are consistently almost a gigabyte in size (0.8GB) and my original data itself is only about 1.2 GB at most and sometimes as small 0.3 GB.

I understand that this code is no longer under active maintenance, so thank you for your time and any remedies you may suggest,

Mikaela

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants