Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.

Fix coor api stride #1190

Merged
merged 15 commits into from Dec 1, 2017
Merged

Conversation

marscher
Copy link
Member

@marscher marscher commented Nov 28, 2017

  • [tica] fix handling of stride in estimate

    Also do not store running covar as instance, because it is only needed
    during estimation.

  • [coor/api]

    • removed _param_stage and _get_input_stage (handled by estimate)
    • all chunksizes default to None (Impl chooses a value, eg. Iterable has a global setting currently 5000). This unifies the mess of different default values in the api.
  • [streaming_estimator] chunksize is parameter of estimate, which passes it to the input iterable.

@codecov
Copy link

codecov bot commented Nov 29, 2017

Codecov Report

Merging #1190 into devel will decrease coverage by 0.01%.
The diff coverage is 96.69%.

Impacted file tree graph

@@            Coverage Diff             @@
##            devel    #1190      +/-   ##
==========================================
- Coverage   90.77%   90.76%   -0.02%     
==========================================
  Files         201      201              
  Lines       20849    20883      +34     
==========================================
+ Hits        18926    18954      +28     
- Misses       1923     1929       +6
Impacted Files Coverage Δ
pyemma/coordinates/data/data_in_memory.py 94.97% <100%> (ø) ⬆️
pyemma/coordinates/data/_base/datasource.py 92.55% <100%> (ø) ⬆️
pyemma/coordinates/tests/test_koopman_estimator.py 98.75% <100%> (ø) ⬆️
pyemma/coordinates/tests/test_source.py 99.07% <100%> (+0.01%) ⬆️
pyemma/util/tests/test_config.py 99.18% <100%> (+0.05%) ⬆️
pyemma/util/_config.py 81.44% <100%> (+0.59%) ⬆️
pyemma/coordinates/estimation/koopman.py 90.9% <100%> (ø) ⬆️
...emma/coordinates/data/_base/streaming_estimator.py 100% <100%> (+14.81%) ⬆️
pyemma/coordinates/tests/test_tica.py 99.31% <100%> (ø) ⬆️
pyemma/coordinates/estimation/covariance.py 82.97% <100%> (+0.12%) ⬆️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 769ffde...9a58a05. Read the comment docs.

Copy link
Member

@clonker clonker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the changes, the code in itself becomes a lot cleaner (especially in the API). :)

There is some inconsistency between chunksize and chunk_size and I would probably opt for sticking with one of them and eliminating the other.
The default chunk size should probably come from the config file, as we have already discussed.

from pyemma.util.contexts import attribute
from pyemma.util.types import is_int

# this is used, in case None is passed as input chunk size.
DEFAULT_CHUNKSIZE = 5000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we've discussed, might be beneficial to configure this value via the config file


class Iterable(Loggable, metaclass=ABCMeta):

def __init__(self, chunksize=1000):
def __init__(self, chunksize=DEFAULT_CHUNKSIZE):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this is inconsistent, the default chunk size is set if the argument is None, however the argument itself defaults to the default chunksize

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it is called chunksize instead of chunk_size

@@ -115,7 +115,7 @@ class StreamingTransformer(Transformer, DataSource, NotifyOnChangesMixIn):
the chunksize used to batch process underlying data.

"""
def __init__(self, chunksize=1000):
def __init__(self, chunksize=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to the previous remark: here you default the argument to None (which probably is what is desired)

@@ -364,7 +366,7 @@ def source(inp, features=None, top=None, chunk_size=None, **kw):
return reader


def combine_sources(sources, chunksize=1000):
def combine_sources(sources, chunksize=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here it's called chunksize instead of chunk_size

@@ -628,7 +639,7 @@ def save_traj(traj_inp, indexes, outfile, top=None, stride = 1, chunksize=1000,
reading/featurizing/transforming/discretizing the files contained
in :py:obj:`traj_inp.trajfiles`.

chunksize : int. Default 1000.
chunksize : int. Default=1000.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here it's called chunksize instead of chunk_size

it = iterable.iterator(lag=self.lag, return_trajindex=False, stride=self.stride, skip=self.skip,
chunk=self.chunksize if not partial_fit else 0)
chunk=chunksize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_size

# iterator over input weights
if hasattr(self.weights, 'iterator'):
if hasattr(self.weights, '_transform_array'):
self.weights.data_producer = iterable
it_weights = self.weights.iterator(lag=0, return_trajindex=False, stride=self.stride, skip=self.skip,
chunk=self.chunksize if not partial_fit else 0)
chunk=chunksize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_size

Kest = _KoopmanEstimator(cls.tau, epsilon=cls.epsilon, chunksize=cls.chunksize)
Kest.estimate(cls.source_obj)
Kest = _KoopmanEstimator(cls.tau, epsilon=cls.epsilon)
Kest.estimate(cls.source_obj, chunksize=cls.chunksize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_size

cls.K_est = _KoopmanEstimator(cls.tau, epsilon=cls.epsilon, chunksize=cls.chunksize)
cls.K_est.estimate(cls.source_obj)
cls.K_est = _KoopmanEstimator(cls.tau, epsilon=cls.epsilon)
cls.K_est.estimate(cls.source_obj, chunksize=cls.chunksize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_size

@@ -193,7 +193,7 @@ def test_chunksize(self):
reader_xtc = api.source(self.traj_files, top=self.pdb_file)
chunksize = 1001
chain = [reader_xtc, api.tica(), api.cluster_mini_batch_kmeans(batch_size=0.3, k=3)]
p = api.pipeline(chain, chunksize=chunksize)
p = api.pipeline(chain, chunksize=chunksize, run=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_size

@marscher
Copy link
Member Author

thanks for the review @clonker. The chunksize vs. chunk_size thing can be solved without breaking the api or deprecating first. I think chunk_size is only used in the API, but was never used in any estimator. Actually chunk_size is more appropriate, but deprecating and switching over would break a lot of code IMO. Agreed about the misleading usage in Iterable.

Putting the default value in the config would be nice as well. We also discussed on accepting strings to directly limit the amount of memory a chunk can occupy. Do you suggest to use pint to do the conversion, which has the nice property we could accept SI units for memory, or shall we hack something custom for this?

@clonker
Copy link
Member

clonker commented Nov 30, 2017

True, I didn't consider the breaking api thing.
Wrt units: I think pint is the way to go, especially if one would need other units / conversions in the future. However it does mean to introduce a new dependency. But then again that dependency is pure python, so it shouldn't cause much trouble.

This is a human readable string (eg. '10m') variable in pyemma.config

All coordinate streamed classes by default use None as chunksize and
determine the value from the config then.

current default = 2m

It is implemented by obtaining the output type and the dimension.
@marscher
Copy link
Member Author

didnt go for pint but added the reverse function to bytes_to_string ;)

@marscher
Copy link
Member Author

marscher commented Dec 1, 2017

thanks!

@marscher marscher merged commit 5c7deb9 into markovmodel:devel Dec 1, 2017
@marscher marscher deleted the fix_coor_api_stride branch December 1, 2017 17:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants