Fix coor api stride #1190

marscher · 2017-11-28T18:25:23Z

[tica] fix handling of stride in estimate

Also do not store running covar as instance, because it is only needed
during estimation.
[coor/api]
- removed _param_stage and _get_input_stage (handled by estimate)
- all chunksizes default to None (Impl chooses a value, eg. Iterable has a global setting currently 5000). This unifies the mess of different default values in the api.
[streaming_estimator] chunksize is parameter of estimate, which passes it to the input iterable.

Also do not store running covar as instance, because it is only needed during estimation.

Removed implementation detail from api (wiring data producers, set cs) from api to StreamingTransformer.estimate(). fixed docstrings in API.

codecov · 2017-11-29T23:14:15Z

Codecov Report

Merging #1190 into devel will decrease coverage by 0.01%.
The diff coverage is 96.69%.

@@            Coverage Diff             @@
##            devel    #1190      +/-   ##
==========================================
- Coverage   90.77%   90.76%   -0.02%     
==========================================
  Files         201      201              
  Lines       20849    20883      +34     
==========================================
+ Hits        18926    18954      +28     
- Misses       1923     1929       +6

Impacted Files	Coverage Δ
pyemma/coordinates/data/data_in_memory.py	`94.97% <100%> (ø)`	⬆️
pyemma/coordinates/data/_base/datasource.py	`92.55% <100%> (ø)`	⬆️
pyemma/coordinates/tests/test_koopman_estimator.py	`98.75% <100%> (ø)`	⬆️
pyemma/coordinates/tests/test_source.py	`99.07% <100%> (+0.01%)`	⬆️
pyemma/util/tests/test_config.py	`99.18% <100%> (+0.05%)`	⬆️
pyemma/util/_config.py	`81.44% <100%> (+0.59%)`	⬆️
pyemma/coordinates/estimation/koopman.py	`90.9% <100%> (ø)`	⬆️
...emma/coordinates/data/_base/streaming_estimator.py	`100% <100%> (+14.81%)`	⬆️
pyemma/coordinates/tests/test_tica.py	`99.31% <100%> (ø)`	⬆️
pyemma/coordinates/estimation/covariance.py	`82.97% <100%> (+0.12%)`	⬆️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 769ffde...9a58a05. Read the comment docs.

clonker

I like the changes, the code in itself becomes a lot cleaner (especially in the API). :)

There is some inconsistency between chunksize and chunk_size and I would probably opt for sticking with one of them and eliminating the other.
The default chunk size should probably come from the config file, as we have already discussed.

clonker · 2017-11-30T17:03:08Z

pyemma/coordinates/data/_base/iterable.py

 from pyemma.util.contexts import attribute
 from pyemma.util.types import is_int

+# this is used, in case None is passed as input chunk size.
+DEFAULT_CHUNKSIZE = 5000


as we've discussed, might be beneficial to configure this value via the config file

clonker · 2017-11-30T17:04:48Z

pyemma/coordinates/data/_base/iterable.py


 class Iterable(Loggable, metaclass=ABCMeta):

-    def __init__(self, chunksize=1000):
+    def __init__(self, chunksize=DEFAULT_CHUNKSIZE):


i believe this is inconsistent, the default chunk size is set if the argument is None, however the argument itself defaults to the default chunksize

also it is called chunksize instead of chunk_size

clonker · 2017-11-30T17:05:52Z

pyemma/coordinates/data/_base/transformer.py

@@ -115,7 +115,7 @@ class StreamingTransformer(Transformer, DataSource, NotifyOnChangesMixIn):
        the chunksize used to batch process underlying data.

    """
-    def __init__(self, chunksize=1000):
+    def __init__(self, chunksize=None):


similar to the previous remark: here you default the argument to None (which probably is what is desired)

clonker · 2017-11-30T17:06:59Z

pyemma/coordinates/api.py

@@ -364,7 +366,7 @@ def source(inp, features=None, top=None, chunk_size=None, **kw):
    return reader


-def combine_sources(sources, chunksize=1000):
+def combine_sources(sources, chunksize=None):


here it's called chunksize instead of chunk_size

clonker · 2017-11-30T17:07:50Z

pyemma/coordinates/api.py

@@ -628,7 +639,7 @@ def save_traj(traj_inp, indexes, outfile, top=None, stride = 1, chunksize=1000,
        reading/featurizing/transforming/discretizing the files contained
        in :py:obj:`traj_inp.trajfiles`.

-    chunksize : int. Default 1000.
+    chunksize : int. Default=1000.


here it's called chunksize instead of chunk_size

clonker · 2017-11-30T17:12:55Z

pyemma/coordinates/estimation/covariance.py

        it = iterable.iterator(lag=self.lag, return_trajindex=False, stride=self.stride, skip=self.skip,
-                               chunk=self.chunksize if not partial_fit else 0)
+                               chunk=chunksize)


chunk_size

clonker · 2017-11-30T17:13:02Z

pyemma/coordinates/estimation/covariance.py

        # iterator over input weights
        if hasattr(self.weights, 'iterator'):
            if hasattr(self.weights, '_transform_array'):
                self.weights.data_producer = iterable
            it_weights = self.weights.iterator(lag=0, return_trajindex=False, stride=self.stride, skip=self.skip,
-                                               chunk=self.chunksize if not partial_fit else 0)
+                                               chunk=chunksize)


chunk_size

clonker · 2017-11-30T17:13:10Z

pyemma/coordinates/tests/test_eq_covar_estimator.py

-        Kest = _KoopmanEstimator(cls.tau, epsilon=cls.epsilon, chunksize=cls.chunksize)
-        Kest.estimate(cls.source_obj)
+        Kest = _KoopmanEstimator(cls.tau, epsilon=cls.epsilon)
+        Kest.estimate(cls.source_obj, chunksize=cls.chunksize)


chunk_size

clonker · 2017-11-30T17:13:15Z

pyemma/coordinates/tests/test_koopman_estimator.py

-        cls.K_est = _KoopmanEstimator(cls.tau, epsilon=cls.epsilon, chunksize=cls.chunksize)
-        cls.K_est.estimate(cls.source_obj)
+        cls.K_est = _KoopmanEstimator(cls.tau, epsilon=cls.epsilon)
+        cls.K_est.estimate(cls.source_obj, chunksize=cls.chunksize)


chunk_size

clonker · 2017-11-30T17:13:23Z

pyemma/coordinates/tests/test_pipeline.py

@@ -193,7 +193,7 @@ def test_chunksize(self):
        reader_xtc = api.source(self.traj_files, top=self.pdb_file)
        chunksize = 1001
        chain = [reader_xtc, api.tica(), api.cluster_mini_batch_kmeans(batch_size=0.3, k=3)]
-        p = api.pipeline(chain, chunksize=chunksize)
+        p = api.pipeline(chain, chunksize=chunksize, run=False)


chunk_size

marscher · 2017-11-30T17:32:21Z

thanks for the review @clonker. The chunksize vs. chunk_size thing can be solved without breaking the api or deprecating first. I think chunk_size is only used in the API, but was never used in any estimator. Actually chunk_size is more appropriate, but deprecating and switching over would break a lot of code IMO. Agreed about the misleading usage in Iterable.

Putting the default value in the config would be nice as well. We also discussed on accepting strings to directly limit the amount of memory a chunk can occupy. Do you suggest to use pint to do the conversion, which has the nice property we could accept SI units for memory, or shall we hack something custom for this?

clonker · 2017-11-30T17:59:53Z

True, I didn't consider the breaking api thing.
Wrt units: I think pint is the way to go, especially if one would need other units / conversions in the future. However it does mean to introduce a new dependency. But then again that dependency is pure python, so it shouldn't cause much trouble.

This is a human readable string (eg. '10m') variable in pyemma.config All coordinate streamed classes by default use None as chunksize and determine the value from the config then. current default = 2m It is implemented by obtaining the output type and the dimension.

marscher · 2017-11-30T23:09:17Z

didnt go for pint but added the reverse function to bytes_to_string ;)

marscher · 2017-12-01T17:33:51Z

thanks!

marscher added 11 commits November 28, 2017 19:22

[tica] fix handling of stride in estimate

39f8ebb

Also do not store running covar as instance, because it is only needed during estimation.

[coor/api] removed stride param from _param_stage (handled by estimate)

910a66b

[streaming_estimator] propagate chunksize to input iterable

000ce27

fix test

d7a4dbd

dont need to run to test chunksize

0bfb24e

[tica] fix partial_fit (set covar to none)

a411e8e

[Iterable] if passed chunksize is None, set to 1000

6341858

[lagged covar] chunksize is not an estimation parameter.

f9301cb

[coordinates] chunksize defaults to None, added DEFAULT_CHUNKSIZE param.

26f677b

Removed implementation detail from api (wiring data producers, set cs) from api to StreamingTransformer.estimate(). fixed docstrings in API.

removed ctor param cs from streaming estimator (estimtate param)

21532e8

cleanup

8dbabed

marscher requested a review from clonker November 29, 2017 23:19

clonker reviewed Nov 30, 2017

View reviewed changes

marscher added 2 commits November 30, 2017 22:56

[api] pass stride in assign_to_centers

079f97a

marscher added 2 commits December 1, 2017 00:17

[doc] amend changelog [ci skip]

240ff0a

fix test

9a58a05

clonker approved these changes Dec 1, 2017

View reviewed changes

marscher merged commit 5c7deb9 into markovmodel:devel Dec 1, 2017

marscher deleted the fix_coor_api_stride branch December 1, 2017 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix coor api stride #1190

Fix coor api stride #1190

marscher commented Nov 28, 2017 •

edited

codecov bot commented Nov 29, 2017 •

edited

clonker left a comment

clonker Nov 30, 2017

clonker Nov 30, 2017

clonker Nov 30, 2017

clonker Nov 30, 2017

clonker Nov 30, 2017

clonker Nov 30, 2017

clonker Nov 30, 2017

clonker Nov 30, 2017

clonker Nov 30, 2017

clonker Nov 30, 2017

clonker Nov 30, 2017

marscher commented Nov 30, 2017

clonker commented Nov 30, 2017

marscher commented Nov 30, 2017

marscher commented Dec 1, 2017

Fix coor api stride #1190

Fix coor api stride #1190

Conversation

marscher commented Nov 28, 2017 • edited

codecov bot commented Nov 29, 2017 • edited

Codecov Report

clonker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marscher commented Nov 30, 2017

clonker commented Nov 30, 2017

marscher commented Nov 30, 2017

marscher commented Dec 1, 2017

marscher commented Nov 28, 2017 •

edited

codecov bot commented Nov 29, 2017 •

edited