TICA appears 100x slower (on large dataset) in 2.5.1 vs. 2.4 #1284
Comments
Realized I hadn't tried in a fresh conda environment - actually today is the last day of Torque license on our old cluster where I was reporting the above from and appears like it has just gone off - so I can't try anymore. Transfering the data to our new cluster and will check there on a fresh conda install - will report soon. |
On 29.03.2018 21:09, Rafal Wiewiora wrote:
Realized I hadn't tried in a fresh conda environment - actually today is the last day of Torque license on our old cluster where I was reporting the above from and appears like it has just gone off - so I can't try anymore. Transfering the data to our new cluster and will check there on a fresh conda install - will report soon.
Thanks for testing this!
Does setting a chunksize solve this? 2.5 introduced a method to
determine the chunksize automatically, but in case of TICA this can be
tricky because we do not know the real output dimension during estimation.
|
On 29.03.2018 21:09, Rafal Wiewiora wrote:
Realized I hadn't tried in a fresh conda environment - actually today is the last day of Torque license on our old cluster where I was reporting the above from and appears like it has just gone off - so I can't try anymore. Transfering the data to our new cluster and will check there on a fresh conda install - will report soon.
Can you please set the log level to DEBUG for your TICA computation?
$EDITOR ~/.pyemma/logging.yml
change level to DEBUG and run your script. It should log something about
the computed chunksize.
|
Actually I'm certain that this is the fault of the automatic chunk size computation, as the output of the progress bar shows how many work pieces are used (2509 vs. 467871). This adds a tremendous Python overhead. So by passing the chunksize parameter != None, you should regain the old runtime speed. |
Well this needs to be fixed anyway asap. Getting such a slowdown is a Dealbreaker. If chunksize is the problem, I don't understand how this was not found running the tutorials. Please make sure these are always run, including for minor releases.
Either we go back and remove the faulty optimization, or we fix it. Again, estimating dimension cannot be the problem here, as chunksize limitation must apply to the known input size.
Sent from my T-Mobile 4G LTE Device
…-------- Original message --------
From: "Martin K. Scherer" <notifications@github.com>
Date: 4/4/18 07:36 (GMT-06:00)
To: markovmodel/PyEMMA <PyEMMA@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Subject: Re: [markovmodel/PyEMMA] TICA appears 100x slower (on large dataset)
in 2.5.1 vs. 2.4 (#1284)
Actually I'm certain that this is the fault of the automatic chunk size computation, as the output of the progress bar shows how many work pieces are used (2509 vs. 467871). This adds a tremendous Python overhead. So by passing the chunksize parameter != None, you should regain the old runtime speed.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/markovmodel/PyEMMA","title":"markovmodel/PyEMMA","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/markovmodel/PyEMMA"}},"updates":{"snippets":[{"icon":"PERSON","message":"@marscher in #1284: Actually I'm certain that this is the fault of the automatic chunk size computation, as the output of the progress bar shows how many work pieces are used (2509 vs. 467871). This adds a tremendous Python overhead. So by passing the chunksize parameter != None, you should regain the old runtime speed."}],"action":{"name":"View Issue","url":"#1284 (comment)"}}}
|
I think in 2.5.1 we did add Nystroem TICA and as part of that you made a code reorganization, correct? Is that the point when this problem was introduced? @marscher, can you confirm Rafal's observation on our tutorial notebooks? |
Indeed setting the chunksize to 0 or big enough solves it, but this still persists if I use the (using here only 1k trajectories rather than 2509 above) source = pyemma.coordinates.source(trajs[:1000])
tica_kinetic = pyemma.coordinates.tica(source, lag=10, kinetic_map=True, var_cutoff=1, chunksize=0) gives normal speed:
BUT source = pyemma.coordinates.source(trajs[:1000])
tica_kinetic = pyemma.coordinates.tica(lag=10, kinetic_map=True, var_cutoff=1, chunksize=0)
stages = [source, tica_kinetic]
pipeline = pyemma.coordinates.pipeline(stages, chunksize=0) still chunks into single frames:
why? |
The chunk size was not passed on correctly when using a pipeline. Also there was a bug in the default chunk size calculation where the number of possible frames was again divided by the dimension, yielding a much smaller chunk size than what would actually have been possible. The PR should fix this behavior. |
Thanks Moritz. Can you confirm that TICA behaves poorly before your fix and well after, e.g. when using the BPTI notebook?
Rafal, can you confirm this addresses your isse?
When confirmed, we should do a patch release, and since this error is a showstopper, we should also issue an import warning that is triggered, when a user imports 2.5.1, asking to update to 2.5.2. Is that possible?
Best, Frank. Sent from my T-Mobile 4G LTE Device
…-------- Original message --------
From: Moritz Hoffmann <notifications@github.com>
Date: 4/7/18 20:53 (GMT+08:00)
To: markovmodel/PyEMMA <PyEMMA@noreply.github.com>
Cc: Frank Noe <frank.noe@fu-berlin.de>, Comment <comment@noreply.github.com>
Subject: Re: [markovmodel/PyEMMA] TICA appears 100x slower (on large dataset) in
The chunk size was not passed on correctly when using a pipeline. Also there was a bug in the default chunk size calculation where the number of possible frames was again divided by the dimension, yielding a much smaller chunk size than what would actually have been possible.
The PR should fix this behavior.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/markovmodel/PyEMMA","title":"markovmodel/PyEMMA","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/markovmodel/PyEMMA"}},"updates":{"snippets":[{"icon":"PERSON","message":"@clonker in #1284: The chunk size was not passed on correctly when using a pipeline. Also there was a bug in the default chunk size calculation where the number of possible frames was again divided by the dimension, yielding a much smaller chunk size than what would actually have been possible.\r\n\r\nThe PR should fix this behavior."}],"action":{"name":"View Issue","url":"#1284 (comment)"}}}
|
When using the BPTI notebook I can confirm that the chunksize before the fix was off:
The 270 megabytes are to be understood in the decimal representation, i.e., The actual runtime is pretty much the same, as the data is comparably low dimensional (6500 dimensions in Rafal's case, 170 in the BPTI case) and in the chunksize calculation there was a division by number of dimensions too much. I have timed the tica estimation and output process with the following results:
|
Perhaps in Rafal s case the chunking led to many more IO operations. That would explain the large speed difference.
Rafal, can you please try Moritz fix and confirm that it works for you?
Sent from my T-Mobile 4G LTE Device
…-------- Original message --------
From: Moritz Hoffmann <notifications@github.com>
Date: 4/9/18 16:33 (GMT+08:00)
To: markovmodel/PyEMMA <PyEMMA@noreply.github.com>
Cc: Frank Noe <frank.noe@fu-berlin.de>, Comment <comment@noreply.github.com>
Subject: Re: [markovmodel/PyEMMA] TICA appears 100x slower (on large dataset) in 2
When using the BPTI notebook I can confirm that the chunksize before the fix was off:
before:
chunksize: tica_obj.chunksize = 2216
nbytes: tica_obj.chunksize * inp.dimension() * 4 = 1542336 which are about 1.5 megabytes
after:
chunksize: tica_obj.chunksize = 385683
tica_obj.chunksize * inp.dimension() * 4 = 268435368 which are about 270 MB
The 270 megabytes are to be understood in the decimal representation, i.e., 1 MB = 1000 kB = 1000 * 1000 byte, whereas the config is built on the 2^10 representation, i.e., 1 MiB = 2^10 KiB = 1024 * 1024 byte. In that representation it is 255 MiB. Since in the config file it is 256m and not 256MiB as maximal chunk size I think we should take the decimal representation (see here).
The actual runtime is pretty much the same, as the data is comparably low dimensional (6500 dimensions in Rafal's case, 170 in the BPTI case) and in the chunksize calculation there was a division by number of dimensions too much. I have timed the tica estimation and output process with the following results:
before the fix: 2.2 s ± 45.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
after the fix: 2.23 s ± 31.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/markovmodel/PyEMMA","title":"markovmodel/PyEMMA","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/markovmodel/PyEMMA"}},"updates":{"snippets":[{"icon":"PERSON","message":"@clonker in #1284: When using the BPTI notebook I can confirm that the chunksize before the fix was off:\r\n\r\n- before:\r\n - chunksize: `tica_obj.chunksize = 2216`\r\n - nbytes: `tica_obj.chunksize * inp.dimension() * 4 = 1542336` which are about 1.5 megabytes\r\n\r\n- after:\r\n - chunksize: `tica_obj.chunksize = 385683`\r\n - `tica_obj.chunksize * inp.dimension() * 4 = 268435368` which are about 270 MB\r\n\r\nThe 270 megabytes are to be understood in the decimal representation, i.e., `1 MB = 1000 kB = 1000 * 1000 byte`, whereas the config is built on the `2^10` representation, i.e., `1 MiB = 2^10 KiB = 1024 * 1024 byte`. In that representation it is `255 MiB`. Since in the config file it is `256m` and not `256MiB` as maximal chunk size I think we should take the decimal representation (see [here](https://en.wikipedia.org/wiki/Template:Bit_and_byte_prefixes)).\r\n\r\nThe actual runtime is pretty much the same, as the data is comparably low dimensional (6500 dimensions in Rafal's case, 170 in the BPTI case) and in the chunksize calculation there was a division by number of dimensions too much. I have timed the tica estimation and output process with the following results:\r\n\r\n- before the fix: `2.2 s ± 45.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)`\r\n- after the fix: `2.23 s ± 31.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)`"}],"action":{"name":"View Issue","url":"#1284 (comment)"}}}
|
When confirmed, we should do a patch release, and since this error is a showstopper, we should also issue an import warning that is triggered, when a user imports 2.5.1, asking to update to 2.5.2. Is that possible?
If the user is online, he/she will get a warning, if the used version is
not latest available. So I think this would be kind of redundant.
|
It's not redundant if the update is important to do.
Sent from my T-Mobile 4G LTE Device
-------- Original message --------
From: "Martin K. Scherer" <notifications@github.com>
Date: 4/9/18 18:37 (GMT+08:00)
To: markovmodel/PyEMMA <PyEMMA@noreply.github.com>
Cc: Frank Noe <frank.noe@fu-berlin.de>, Comment <comment@noreply.github.com>
Subject: Re: [markovmodel/PyEMMA] TICA appears 100x slower (on large dataset)
When confirmed, we should do a patch release, and since this error is a showstopper, we should also issue an import warning that is triggered, when a user imports 2.5.1, asking to update to 2.5.2. Is that possible?
If the user is online, he/she will get a warning, if the used version is
not latest available. So I think this would be kind of redundant.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/markovmodel/PyEMMA","title":"markovmodel/PyEMMA","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/markovmodel/PyEMMA"}},"updates":{"snippets":[{"icon":"PERSON","message":"@marscher in #1284: \u003e When confirmed, we should do a patch release, and since this error is a showstopper, we should also issue an import warning that is triggered, when a user imports 2.5.1, asking to update to 2.5.2. Is that possible?\n\nIf the user is online, he/she will get a warning, if the used version is \nnot latest available. So I think this would be kind of redundant.\n"}],"action":{"name":"View Issue","url":"#1284 (comment)"}}}
|
- Fixed the calculation of the default chunk size in `iterable.py` and pass the chunk size into estimation when using a pipeline. - DataInMemory now returns the dtype of the first array as output_type - output_type now returns an instance of dtype rather than the class definition - Fixes #1284
Hi guys,
I'm re-running scoring on my 5 ms dataset and after update to 2.5.1 things look weird - we've gone from 3 hours for TICA covariances calc. to 300 hours:
2.4:
2.5.1:
(it's not just at the beginning of the calculation somehow, I let it run overnight too and progress was consistent with this ETA)
This is 2509 trajectories, 492,961 frames, 6557 dimensions.
Code:
(also run with data in memory (
pyemma.coordinates.load
) - same thing).conda info
:conda list
attached: conda_list.txtI don't imagine you would have really put out a 100x slower new version, so it's something wrong on my side? I don't have time right now to transfer all this data to a different system to try, so I'm sticking to 2.4 for now and putting this out here for comments.
The text was updated successfully, but these errors were encountered: