TICA appears 100x slower (on large dataset) in 2.5.1 vs. 2.4 #1284

rafwiewiora · 2018-03-29T18:29:06Z

Hi guys,

I'm re-running scoring on my 5 ms dataset and after update to 2.5.1 things look weird - we've gone from 3 hours for TICA covariances calc. to 300 hours:

2.4:

calculate covariances:   0% (  11/2509) [ ] eta 3:30:14 |

2.5.1:

calculate covariances:   0%| | 8/467871 [00:17<319:26:36,  2.46s/it]

(it's not just at the beginning of the calculation somehow, I let it run overnight too and progress was consistent with this ETA)

This is 2509 trajectories, 492,961 frames, 6557 dimensions.

Code:

import pyemma
import glob
import numpy as np

splits_train = np.load('/cbio/jclab/home/rafal.wiewiora/repos/MSM_play/set8_apo_11707_11709_FINAL/feat_choice_scoring_new_scheme_2/splits_train_dist.npy')

split = 0
source = pyemma.coordinates.source(list(np.array(glob.glob('../data_cut_start_noH_stride10_featurized/dist_cross/*/*.npy'))[list(splits_train[split])]))


lag_time = 10
tica_kinetic = pyemma.coordinates.tica(lag=lag_time, kinetic_map=True, var_cutoff=1)
stages = [source, tica_kinetic]
pipeline = pyemma.coordinates.pipeline(stages, chunksize = 1000)

(also run with data in memory (pyemma.coordinates.load) - same thing).

conda info:

Current conda install:

               platform : linux-64
          conda version : 4.3.34
       conda is private : False
      conda-env version : 4.3.34
    conda-build version : not installed
         python version : 3.5.5.final.0
       requests version : 2.18.3
       root environment : /cbio/jclab/home/rafal.wiewiora/anaconda3  (writable)
    default environment : /cbio/jclab/home/rafal.wiewiora/anaconda3
       envs directories : /cbio/jclab/home/rafal.wiewiora/anaconda3/envs
                          /cbio/jclab/home/rafal.wiewiora/.conda/envs
          package cache : /cbio/jclab/home/rafal.wiewiora/anaconda3/pkgs
                          /cbio/jclab/home/rafal.wiewiora/.conda/pkgs
           channel URLs : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://conda.anaconda.org/omnia/linux-64
                          https://conda.anaconda.org/omnia/noarch
                          https://conda.anaconda.org/salilab/linux-64
                          https://conda.anaconda.org/salilab/noarch
                          https://repo.continuum.io/pkgs/main/linux-64
                          https://repo.continuum.io/pkgs/main/noarch
                          https://repo.continuum.io/pkgs/free/linux-64
                          https://repo.continuum.io/pkgs/free/noarch
                          https://repo.continuum.io/pkgs/r/linux-64
                          https://repo.continuum.io/pkgs/r/noarch
                          https://repo.continuum.io/pkgs/pro/linux-64
                          https://repo.continuum.io/pkgs/pro/noarch
            config file : /cbio/jclab/home/rafal.wiewiora/.condarc
             netrc file : None
           offline mode : False
             user-agent : conda/4.3.34 requests/2.18.3 CPython/3.5.5 Linux/2.6.32-573.18.1.el6.x86_64 CentOS/6.7 glibc/2.12    
                UID:GID : 20077:3008

conda list attached: conda_list.txt

I don't imagine you would have really put out a 100x slower new version, so it's something wrong on my side? I don't have time right now to transfer all this data to a different system to try, so I'm sticking to 2.4 for now and putting this out here for comments.

The text was updated successfully, but these errors were encountered:

rafwiewiora · 2018-03-29T19:09:46Z

Realized I hadn't tried in a fresh conda environment - actually today is the last day of Torque license on our old cluster where I was reporting the above from and appears like it has just gone off - so I can't try anymore. Transfering the data to our new cluster and will check there on a fresh conda install - will report soon.

marscher · 2018-03-30T12:22:46Z

On 29.03.2018 21:09, Rafal Wiewiora wrote: Realized I hadn't tried in a fresh conda environment - actually today is the last day of Torque license on our old cluster where I was reporting the above from and appears like it has just gone off - so I can't try anymore. Transfering the data to our new cluster and will check there on a fresh conda install - will report soon.

Thanks for testing this! Does setting a chunksize solve this? 2.5 introduced a method to determine the chunksize automatically, but in case of TICA this can be tricky because we do not know the real output dimension during estimation.

marscher · 2018-03-30T12:35:04Z

On 29.03.2018 21:09, Rafal Wiewiora wrote: Realized I hadn't tried in a fresh conda environment - actually today is the last day of Torque license on our old cluster where I was reporting the above from and appears like it has just gone off - so I can't try anymore. Transfering the data to our new cluster and will check there on a fresh conda install - will report soon.

Can you please set the log level to DEBUG for your TICA computation? $EDITOR ~/.pyemma/logging.yml change level to DEBUG and run your script. It should log something about the computed chunksize.

marscher · 2018-04-04T12:36:49Z

Actually I'm certain that this is the fault of the automatic chunk size computation, as the output of the progress bar shows how many work pieces are used (2509 vs. 467871). This adds a tremendous Python overhead. So by passing the chunksize parameter != None, you should regain the old runtime speed.

franknoe · 2018-04-04T13:06:27Z

Well this needs to be fixed anyway asap. Getting such a slowdown is a Dealbreaker. If chunksize is the problem, I don't understand how this was not found running the tutorials. Please make sure these are always run, including for minor releases. Either we go back and remove the faulty optimization, or we fix it. Again, estimating dimension cannot be the problem here, as chunksize limitation must apply to the known input size. Sent from my T-Mobile 4G LTE Device

…

-------- Original message -------- From: "Martin K. Scherer" <notifications@github.com> Date: 4/4/18 07:36 (GMT-06:00) To: markovmodel/PyEMMA <PyEMMA@noreply.github.com> Cc: Subscribed <subscribed@noreply.github.com> Subject: Re: [markovmodel/PyEMMA] TICA appears 100x slower (on large dataset) in 2.5.1 vs. 2.4 (#1284) Actually I'm certain that this is the fault of the automatic chunk size computation, as the output of the progress bar shows how many work pieces are used (2509 vs. 467871). This adds a tremendous Python overhead. So by passing the chunksize parameter != None, you should regain the old runtime speed. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/markovmodel/PyEMMA","title":"markovmodel/PyEMMA","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/markovmodel/PyEMMA"}},"updates":{"snippets":[{"icon":"PERSON","message":"@marscher in #1284: Actually I'm certain that this is the fault of the automatic chunk size computation, as the output of the progress bar shows how many work pieces are used (2509 vs. 467871). This adds a tremendous Python overhead. So by passing the chunksize parameter != None, you should regain the old runtime speed."}],"action":{"name":"View Issue","url":"#1284 (comment)"}}}

franknoe · 2018-04-04T23:03:55Z

I think in 2.5.1 we did add Nystroem TICA and as part of that you made a code reorganization, correct? Is that the point when this problem was introduced?

@marscher, can you confirm Rafal's observation on our tutorial notebooks?

rafwiewiora · 2018-04-07T01:30:07Z

Indeed setting the chunksize to 0 or big enough solves it, but this still persists if I use the pipeline.

(using here only 1k trajectories rather than 2509 above)

source = pyemma.coordinates.source(trajs[:1000])
tica_kinetic = pyemma.coordinates.tica(source, lag=10, kinetic_map=True, var_cutoff=1, chunksize=0)

gives normal speed:

calculate covariances
0% 3/1000 [00:07<38:58, 2.35s/it]

BUT

source = pyemma.coordinates.source(trajs[:1000])
tica_kinetic = pyemma.coordinates.tica(lag=10, kinetic_map=True, var_cutoff=1, chunksize=0)
stages = [source, tica_kinetic]
pipeline = pyemma.coordinates.pipeline(stages, chunksize=0)

still chunks into single frames:

calculate covariances
0% 7/186618 [00:12<85:52:42, 1.66s/it]

why?

clonker · 2018-04-07T12:53:55Z

The chunk size was not passed on correctly when using a pipeline. Also there was a bug in the default chunk size calculation where the number of possible frames was again divided by the dimension, yielding a much smaller chunk size than what would actually have been possible.

The PR should fix this behavior.

franknoe · 2018-04-07T23:19:38Z

Thanks Moritz. Can you confirm that TICA behaves poorly before your fix and well after, e.g. when using the BPTI notebook? Rafal, can you confirm this addresses your isse? When confirmed, we should do a patch release, and since this error is a showstopper, we should also issue an import warning that is triggered, when a user imports 2.5.1, asking to update to 2.5.2. Is that possible? Best, Frank. Sent from my T-Mobile 4G LTE Device

…

-------- Original message -------- From: Moritz Hoffmann <notifications@github.com> Date: 4/7/18 20:53 (GMT+08:00) To: markovmodel/PyEMMA <PyEMMA@noreply.github.com> Cc: Frank Noe <frank.noe@fu-berlin.de>, Comment <comment@noreply.github.com> Subject: Re: [markovmodel/PyEMMA] TICA appears 100x slower (on large dataset) in The chunk size was not passed on correctly when using a pipeline. Also there was a bug in the default chunk size calculation where the number of possible frames was again divided by the dimension, yielding a much smaller chunk size than what would actually have been possible. The PR should fix this behavior. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/markovmodel/PyEMMA","title":"markovmodel/PyEMMA","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/markovmodel/PyEMMA"}},"updates":{"snippets":[{"icon":"PERSON","message":"@clonker in #1284: The chunk size was not passed on correctly when using a pipeline. Also there was a bug in the default chunk size calculation where the number of possible frames was again divided by the dimension, yielding a much smaller chunk size than what would actually have been possible.\r\n\r\nThe PR should fix this behavior."}],"action":{"name":"View Issue","url":"#1284 (comment)"}}}

clonker · 2018-04-09T08:33:03Z

When using the BPTI notebook I can confirm that the chunksize before the fix was off:

before:
- chunksize: tica_obj.chunksize = 2216
- nbytes: tica_obj.chunksize * inp.dimension() * 4 = 1542336 which are about 1.5 megabytes
after:
- chunksize: tica_obj.chunksize = 385683
- tica_obj.chunksize * inp.dimension() * 4 = 268435368 which are about 270 MB

The 270 megabytes are to be understood in the decimal representation, i.e., 1 MB = 1000 kB = 1000 * 1000 byte, whereas the config is built on the 2^10 representation, i.e., 1 MiB = 2^10 KiB = 1024 * 1024 byte. In that representation it is 255 MiB. Since in the config file it is 256m and not 256MiB as maximal chunk size I think we should take the decimal representation (see here).

The actual runtime is pretty much the same, as the data is comparably low dimensional (6500 dimensions in Rafal's case, 170 in the BPTI case) and in the chunksize calculation there was a division by number of dimensions too much. I have timed the tica estimation and output process with the following results:

before the fix: 2.2 s ± 45.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
after the fix: 2.23 s ± 31.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

franknoe · 2018-04-09T09:48:48Z

Perhaps in Rafal s case the chunking led to many more IO operations. That would explain the large speed difference. Rafal, can you please try Moritz fix and confirm that it works for you? Sent from my T-Mobile 4G LTE Device

…

-------- Original message -------- From: Moritz Hoffmann <notifications@github.com> Date: 4/9/18 16:33 (GMT+08:00) To: markovmodel/PyEMMA <PyEMMA@noreply.github.com> Cc: Frank Noe <frank.noe@fu-berlin.de>, Comment <comment@noreply.github.com> Subject: Re: [markovmodel/PyEMMA] TICA appears 100x slower (on large dataset) in 2 When using the BPTI notebook I can confirm that the chunksize before the fix was off: before: chunksize: tica_obj.chunksize = 2216 nbytes: tica_obj.chunksize * inp.dimension() * 4 = 1542336 which are about 1.5 megabytes after: chunksize: tica_obj.chunksize = 385683 tica_obj.chunksize * inp.dimension() * 4 = 268435368 which are about 270 MB The 270 megabytes are to be understood in the decimal representation, i.e., 1 MB = 1000 kB = 1000 * 1000 byte, whereas the config is built on the 2^10 representation, i.e., 1 MiB = 2^10 KiB = 1024 * 1024 byte. In that representation it is 255 MiB. Since in the config file it is 256m and not 256MiB as maximal chunk size I think we should take the decimal representation (see here). The actual runtime is pretty much the same, as the data is comparably low dimensional (6500 dimensions in Rafal's case, 170 in the BPTI case) and in the chunksize calculation there was a division by number of dimensions too much. I have timed the tica estimation and output process with the following results: before the fix: 2.2 s ± 45.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) after the fix: 2.23 s ± 31.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/markovmodel/PyEMMA","title":"markovmodel/PyEMMA","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/markovmodel/PyEMMA"}},"updates":{"snippets":[{"icon":"PERSON","message":"@clonker in #1284: When using the BPTI notebook I can confirm that the chunksize before the fix was off:\r\n\r\n- before:\r\n - chunksize: `tica_obj.chunksize = 2216`\r\n - nbytes: `tica_obj.chunksize * inp.dimension() * 4 = 1542336` which are about 1.5 megabytes\r\n\r\n- after:\r\n - chunksize: `tica_obj.chunksize = 385683`\r\n - `tica_obj.chunksize * inp.dimension() * 4 = 268435368` which are about 270 MB\r\n\r\nThe 270 megabytes are to be understood in the decimal representation, i.e., `1 MB = 1000 kB = 1000 * 1000 byte`, whereas the config is built on the `2^10` representation, i.e., `1 MiB = 2^10 KiB = 1024 * 1024 byte`. In that representation it is `255 MiB`. Since in the config file it is `256m` and not `256MiB` as maximal chunk size I think we should take the decimal representation (see [here](https://en.wikipedia.org/wiki/Template:Bit_and_byte_prefixes)).\r\n\r\nThe actual runtime is pretty much the same, as the data is comparably low dimensional (6500 dimensions in Rafal's case, 170 in the BPTI case) and in the chunksize calculation there was a division by number of dimensions too much. I have timed the tica estimation and output process with the following results:\r\n\r\n- before the fix: `2.2 s ± 45.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)`\r\n- after the fix: `2.23 s ± 31.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)`"}],"action":{"name":"View Issue","url":"#1284 (comment)"}}}

marscher · 2018-04-09T10:37:12Z

When confirmed, we should do a patch release, and since this error is a showstopper, we should also issue an import warning that is triggered, when a user imports 2.5.1, asking to update to 2.5.2. Is that possible?

If the user is online, he/she will get a warning, if the used version is not latest available. So I think this would be kind of redundant.

franknoe · 2018-04-09T11:27:02Z

It's not redundant if the update is important to do. Sent from my T-Mobile 4G LTE Device

-------- Original message -------- From: "Martin K. Scherer" <notifications@github.com> Date: 4/9/18 18:37 (GMT+08:00) To: markovmodel/PyEMMA <PyEMMA@noreply.github.com> Cc: Frank Noe <frank.noe@fu-berlin.de>, Comment <comment@noreply.github.com>

Subject: Re: [markovmodel/PyEMMA] TICA appears 100x slower (on large dataset) When confirmed, we should do a patch release, and since this error is a showstopper, we should also issue an import warning that is triggered, when a user imports 2.5.1, asking to update to 2.5.2. Is that possible?

If the user is online, he/she will get a warning, if the used version is not latest available. So I think this would be kind of redundant. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread. {"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/markovmodel/PyEMMA","title":"markovmodel/PyEMMA","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/markovmodel/PyEMMA"}},"updates":{"snippets":[{"icon":"PERSON","message":"@marscher in #1284: \u003e When confirmed, we should do a patch release, and since this error is a showstopper, we should also issue an import warning that is triggered, when a user imports 2.5.1, asking to update to 2.5.2. Is that possible?\n\nIf the user is online, he/she will get a warning, if the used version is \nnot latest available. So I think this would be kind of redundant.\n"}],"action":{"name":"View Issue","url":"#1284 (comment)"}}}

- Fixed the calculation of the default chunk size in `iterable.py` and pass the chunk size into estimation when using a pipeline. - DataInMemory now returns the dtype of the first array as output_type - output_type now returns an instance of dtype rather than the class definition - Fixes #1284

clonker mentioned this issue Apr 7, 2018

Default chunk size #1285

Merged

marscher closed this as completed in #1285 Apr 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TICA appears 100x slower (on large dataset) in 2.5.1 vs. 2.4 #1284

TICA appears 100x slower (on large dataset) in 2.5.1 vs. 2.4 #1284

rafwiewiora commented Mar 29, 2018 •

edited

rafwiewiora commented Mar 29, 2018

marscher commented Mar 30, 2018 via email

marscher commented Mar 30, 2018 via email

marscher commented Apr 4, 2018

franknoe commented Apr 4, 2018 via email

franknoe commented Apr 4, 2018

rafwiewiora commented Apr 7, 2018

clonker commented Apr 7, 2018

franknoe commented Apr 7, 2018 via email

clonker commented Apr 9, 2018

franknoe commented Apr 9, 2018 via email

marscher commented Apr 9, 2018 via email

franknoe commented Apr 9, 2018 via email

TICA appears 100x slower (on large dataset) in 2.5.1 vs. 2.4 #1284

TICA appears 100x slower (on large dataset) in 2.5.1 vs. 2.4 #1284

Comments

rafwiewiora commented Mar 29, 2018 • edited

rafwiewiora commented Mar 29, 2018

marscher commented Mar 30, 2018 via email

marscher commented Mar 30, 2018 via email

marscher commented Apr 4, 2018

franknoe commented Apr 4, 2018 via email

franknoe commented Apr 4, 2018

rafwiewiora commented Apr 7, 2018

clonker commented Apr 7, 2018

franknoe commented Apr 7, 2018 via email

clonker commented Apr 9, 2018

franknoe commented Apr 9, 2018 via email

marscher commented Apr 9, 2018 via email

franknoe commented Apr 9, 2018 via email

rafwiewiora commented Mar 29, 2018 •

edited