Conversation
Also do not store running covar as instance, because it is only needed during estimation.
Removed implementation detail from api (wiring data producers, set cs) from api to StreamingTransformer.estimate(). fixed docstrings in API.
Codecov Report
@@ Coverage Diff @@
## devel #1190 +/- ##
==========================================
- Coverage 90.77% 90.76% -0.02%
==========================================
Files 201 201
Lines 20849 20883 +34
==========================================
+ Hits 18926 18954 +28
- Misses 1923 1929 +6
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the changes, the code in itself becomes a lot cleaner (especially in the API). :)
There is some inconsistency between chunksize
and chunk_size
and I would probably opt for sticking with one of them and eliminating the other.
The default chunk size should probably come from the config file, as we have already discussed.
from pyemma.util.contexts import attribute | ||
from pyemma.util.types import is_int | ||
|
||
# this is used, in case None is passed as input chunk size. | ||
DEFAULT_CHUNKSIZE = 5000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we've discussed, might be beneficial to configure this value via the config file
|
||
class Iterable(Loggable, metaclass=ABCMeta): | ||
|
||
def __init__(self, chunksize=1000): | ||
def __init__(self, chunksize=DEFAULT_CHUNKSIZE): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe this is inconsistent, the default chunk size is set if the argument is None
, however the argument itself defaults to the default chunksize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also it is called chunksize
instead of chunk_size
@@ -115,7 +115,7 @@ class StreamingTransformer(Transformer, DataSource, NotifyOnChangesMixIn): | |||
the chunksize used to batch process underlying data. | |||
|
|||
""" | |||
def __init__(self, chunksize=1000): | |||
def __init__(self, chunksize=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to the previous remark: here you default the argument to None (which probably is what is desired)
@@ -364,7 +366,7 @@ def source(inp, features=None, top=None, chunk_size=None, **kw): | |||
return reader | |||
|
|||
|
|||
def combine_sources(sources, chunksize=1000): | |||
def combine_sources(sources, chunksize=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here it's called chunksize
instead of chunk_size
pyemma/coordinates/api.py
Outdated
@@ -628,7 +639,7 @@ def save_traj(traj_inp, indexes, outfile, top=None, stride = 1, chunksize=1000, | |||
reading/featurizing/transforming/discretizing the files contained | |||
in :py:obj:`traj_inp.trajfiles`. | |||
|
|||
chunksize : int. Default 1000. | |||
chunksize : int. Default=1000. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here it's called chunksize
instead of chunk_size
it = iterable.iterator(lag=self.lag, return_trajindex=False, stride=self.stride, skip=self.skip, | ||
chunk=self.chunksize if not partial_fit else 0) | ||
chunk=chunksize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk_size
# iterator over input weights | ||
if hasattr(self.weights, 'iterator'): | ||
if hasattr(self.weights, '_transform_array'): | ||
self.weights.data_producer = iterable | ||
it_weights = self.weights.iterator(lag=0, return_trajindex=False, stride=self.stride, skip=self.skip, | ||
chunk=self.chunksize if not partial_fit else 0) | ||
chunk=chunksize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk_size
Kest = _KoopmanEstimator(cls.tau, epsilon=cls.epsilon, chunksize=cls.chunksize) | ||
Kest.estimate(cls.source_obj) | ||
Kest = _KoopmanEstimator(cls.tau, epsilon=cls.epsilon) | ||
Kest.estimate(cls.source_obj, chunksize=cls.chunksize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk_size
cls.K_est = _KoopmanEstimator(cls.tau, epsilon=cls.epsilon, chunksize=cls.chunksize) | ||
cls.K_est.estimate(cls.source_obj) | ||
cls.K_est = _KoopmanEstimator(cls.tau, epsilon=cls.epsilon) | ||
cls.K_est.estimate(cls.source_obj, chunksize=cls.chunksize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk_size
@@ -193,7 +193,7 @@ def test_chunksize(self): | |||
reader_xtc = api.source(self.traj_files, top=self.pdb_file) | |||
chunksize = 1001 | |||
chain = [reader_xtc, api.tica(), api.cluster_mini_batch_kmeans(batch_size=0.3, k=3)] | |||
p = api.pipeline(chain, chunksize=chunksize) | |||
p = api.pipeline(chain, chunksize=chunksize, run=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk_size
thanks for the review @clonker. The chunksize vs. chunk_size thing can be solved without breaking the api or deprecating first. I think chunk_size is only used in the API, but was never used in any estimator. Actually chunk_size is more appropriate, but deprecating and switching over would break a lot of code IMO. Agreed about the misleading usage in Iterable. Putting the default value in the config would be nice as well. We also discussed on accepting strings to directly limit the amount of memory a chunk can occupy. Do you suggest to use pint to do the conversion, which has the nice property we could accept SI units for memory, or shall we hack something custom for this? |
True, I didn't consider the breaking api thing. |
This is a human readable string (eg. '10m') variable in pyemma.config All coordinate streamed classes by default use None as chunksize and determine the value from the config then. current default = 2m It is implemented by obtaining the output type and the dimension.
didnt go for pint but added the reverse function to bytes_to_string ;) |
thanks! |
[tica] fix handling of stride in estimate
Also do not store running covar as instance, because it is only needed
during estimation.
[coor/api]
[streaming_estimator] chunksize is parameter of estimate, which passes it to the input iterable.