Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.

chunksize is ignored in TICA #846

Closed
fabian-paul opened this issue Jun 29, 2016 · 9 comments
Closed

chunksize is ignored in TICA #846

fabian-paul opened this issue Jun 29, 2016 · 9 comments

Comments

@fabian-paul
Copy link
Member

Apparently setting the chunk size parameter doesn't have any effect in TICA anymore.
I have a long (microsecond) trajectory that I want to use to evaluate a larger set of features. Setting the chunk_size doesn't have any effect on the feature reader as can be seen from the output below.

import pyemma

ft = pyemma.coordinates.featurizer('PMI-amber.pdb')
ft.add_backbone_torsions()
source = pyemma.coordinates.source(['28.xtc'], features=ft, chunk_size=10)
tica = pyemma.coordinates.tica(source, lag=5000)

I added the following print statement here:

class FeatureReaderIterator(DataSourceIterator):
    ...
    def _next_chunk(self):
        ...
        chunk = next(self._mditer)
        shape = chunk.xyz.shape
        print 'chunk shape', chunk.xyz.shape

This is the output I get.

chunk shape (100000, 195, 3)
chunk shape (95000, 195, 3)

I'm not sure how this should be fixed. As a quick fix, I changed https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/transform/tica.py#L264
to

it = iterable.iterator(lag=self.lag, return_trajindex=False, chunk=iterable.chunksize)

However this works only, if I instantiate the TICA object directly. When I use the tica() API function, the chunksize is set back to zero.

IMHO this looks very broken to me. There are some lines such as https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/api.py#L837 that don't make any sense to me. _param_stage only gets called with its chunk_size parameter left at its default value = 0, so this can only work if the chunk size parameters of the source is also zero.

What's the recommanded way to pass the chunk_size parameter between different transformers? If an iterator is created in _estimate() where should the chunk_size parameter be taken from?

@fabian-paul
Copy link
Member Author

With .get_output() it's the same problem. The ckunksize parameter of the transformer and it's source are not respected.

@marscher
Copy link
Member

the self.chunk_size property is delegated through the pipeline, so it should be taken from there. Your approach is the correct fix, but do not take the property from the iterable but self instead.

I can not see why the chunksize in Iterable.get_output should not be respected, since it is passed correctly:
https://github.com/markovmodel/PyEMMA/blob/devel/pyemma/coordinates/data/_base/iterable.py#L129

@marscher
Copy link
Member

And you are right, the handling in param_stage is wrong, since the set chunksize of the reader/datasource is overridden with 0.

marscher added a commit to marscher/PyEMMA that referenced this issue Jun 29, 2016
Only set it in case the user has overriden it, otherwise take
default chunksize from Iterable.

Fixes markovmodel#846
@marscher
Copy link
Member

@fabian-paul can you please confirm that the behaviour correct with the suggested PR?

@fabian-paul
Copy link
Member Author

@marscher yes, this solved the problems.

@marscher
Copy link
Member

marscher commented Jun 30, 2016 via email

@marscher
Copy link
Member

@clonker with these changes the test_fragmented_reader_xtc fails with wrong coordinates. Do you have any clue what could be broken now?

@marscher
Copy link
Member

I'd guess that the chunked mode is somehow broken, since the tested behaviour used always chunksize=0

@marscher
Copy link
Member

I've found the culprit: in patches.iterload there was no proper setter for chunksize...

cc @clonker

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants