Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.

Strategies for high-dimensional featurization #1497

Closed
palominohernandez opened this issue Mar 26, 2021 · 11 comments
Closed

Strategies for high-dimensional featurization #1497

palominohernandez opened this issue Mar 26, 2021 · 11 comments
Labels

Comments

@palominohernandez
Copy link

palominohernandez commented Mar 26, 2021

Dear all,

I have been following the tutorial notebooks (https://github.com/markovmodel/pyemma_tutorials/tree/master/notebooks) on my system, a 120-ish residue protein.

When doing the featurization, as in (https://github.com/markovmodel/pyemma_tutorials/blob/master/notebooks/02-dimension-reduction-and-discretization.ipynb), I noticed that computing pairs of heavy atom contacts for my protein is quite expensive to store computationally, and thus, used source instead of load:

    feat.active_features = []
    pairs = feat.pairs(feat.select_Heavy())
    feat.add_contacts(pairs, periodic=False)
    data_contacts = pyemma.coordinates.source(train_files, features=feat)
    data_contacts = data_contacts.get_output(stride=10) 

However, when trying to compute the VAMP-2 score, I get the error message of "Covariance matrix does not fit into memory. Input is too high-dimensional (200661)". I tried unsuccessfully with different values of chunk size:

   vamp_contacts = pyemma.coordinates.vamp(data_contacts, lag, dim, chunksize=100, ncov_max=30)

As an alternative, I also noticed that deeptime TICA does not have chunk size available (and it also did not allow me to perform the analysis):

    tica_est = dt.decomposition.TICA(lagtime=5, var_cutoff=0.9)
    tica = tica_est.fit(data_contacts).fetch_model()

Thus, I wanted to ask how to deal with these type of situations. One approach I could think of is to remove zero-variance features, in order to allow TICA and VAMP analysis to fit into memory, but I am unsure how valid is this.

Any leads will be appreciated.

Thanks,
Oscar

@clonker
Copy link
Member

clonker commented Mar 27, 2021

Hi Oscar,

you had the right idea using source, but instead of using data_contacts.get_output() you can create an iterator over chunks of data and make use of partial_fit:

it = data_contacts.iterator(lag=lag, chunk=128, stride=10, return_trajindex=True)
for itraj, (X, Y) in it:
    tica_est.partial_fit((X, Y))

(same with VAMP). That being said, it seems one frame of your data has 200661 dimensions. Using float32 precision that means that one (!) covariance matrix needs roughly 150 GB memory, and you need three of them. I suggest you first try to narrow down the input features (ie fewer contacts), then go with the iterator.

@clonker
Copy link
Member

clonker commented Mar 27, 2021

Actually I am unsure whether the error message is correct, you can double-check via print(feat.dimension()).

@palominohernandez
Copy link
Author

Dear Moritz,

Thanks for the answer! Let me have a look on the iterator.

The high dimensionality on my data was because I chose as features all pairs of heavy atom contacts, but I am starting to think it might be unwise.

On the mean time, a follow up on my first question: it should be fine to remove features with zero variance, right? what about features with almost-zero variance?

@clonker
Copy link
Member

clonker commented Mar 29, 2021

Dear Oscar,

indeed, if the value does not change (delta peak by, e.g., constrained distances) then it carries no information that could be used to identify processes.
In any case I recommend using the VAMP-2 score to make a quantitative decision on the feature selection.

Cheers,
Moritz

@palominohernandez
Copy link
Author

Thanks Moritz,

I will post a final part on this, as I think it still falls in this topic. For describing a single-domain protein which eventually opens and closes a loop, I used multiple features for a data set saved every nanosecond, from which the best scoring one (in VAMP2, at fixed lag and dimension) was backbone torsions. Thus, I performed a screening at diverse lags and dimensions, and contrary to the tutorials I never observed convergence in the VAMP2 score (see figure).

Screen Shot 2021-03-30 at 10 11 56 AM

Is this a behavior expected for large dimensionality sets (never converging to a plateau in VAMP2 score)? And if so, do I just take a cutoff, like 60 dim, and continue?

@clonker
Copy link
Member

clonker commented Mar 30, 2021

Dear Oscar,

you may also try combinations of features and look at their score, perhaps also try VAMPNets. It seems to me that for longer lagtimes (50 ns) the score does reach some kind of plateau for >= 90 dimensions. In any case you can still carry on with the analysis under the caveat that your featurization is suboptimal, i.e., you are making a larger than perhaps necessary projection error.
Also just to raise awareness (if you weren't aware already), VAMP scores are not comparable at different lagtimes. That the score at 5ns is higher than the one at 50ns does not mean that 5ns should be preferred over 50ns. Depending on which process you are interested in (and in particular at which timescales it happens) you might want to go to even longer lagtimes and possibly obtain VAMP score convergence for lower number of dimensions.

Hope this helps!
Moritz

@palominohernandez
Copy link
Author

Thanks for your input Moritz,

It might seemed like that, but then, the plot had different intentions:
Screen Shot 2021-03-30 at 5 45 20 PM

I assume it is normal to not see a plateau when doing VAMP for high dimensionality datasets, is that correct? Thus, can I just decide arbitrarily how many dimensions I wish to retain?

What do you mean by " if the value does not change (delta peak by, e.g., constrained distances)"? Can you clarify on this? I decided to retain features with variance larger than 0.1 units, and the shape of VAMP just changes slightly.

Thanks for your continuous input,
O.

@clonker
Copy link
Member

clonker commented Mar 31, 2021

Hi,

if you don't see a plateau it says that the (under the lagtime) slow dynamics isn't fully/optimally captured. You may still continue the analysis with the number of dimensions of your choice but you have to keep in mind that you are making a projection error. When building an MSM you further make a discretization error (that might be enlarged by higher number of dimensions) by clustering / tesselation of state space. Furthermore with higher number of dimensions you might need more cluster centers and with more cluster centers you might need more data to adequatly parametrize your model. It is all a tradeoff game (which can be scored via the MSM scoring function though).
Another thing: It might also be that you are overfitting, have you looked at the cross-validated VAMP score?

Regarding delta peaks: If the variance is very small (or 0 even) that means that there is very little information in the feature. This can happen if you are looking at bond lengths and certain bonds have been constrained in the simulation to have a certain length. Also if the VAMP score changes just very slightly you can probably ignore the feature.

As a last point: You probably want to try more and in particular higher lagtimes, something like [10, 50, 100, 200, 500].

Best,
Moritz

@palominohernandez
Copy link
Author

Dear Moritz,

Happy Easter break! xD.

I am starting to properly understand all the nuances in your comment. For example, I would expect that 50 ns as lagtime would be long enough to remove all the fast dynamics. But then again, I am learning. Will try larger lagtimes for VAMP scoring.

I find that for scoring numbers of dimensions with VAMP, there are two phases: a rapid scaling, where I assume the slowest dynamics get covered, and a second, steady increase, where other processes get covered, up until a plateau is reached. Is this interpretation correct? (Figure below)

A secondary question on this: in this featurization I have two opposing views: while using the set of all torsions has an increased VAMP, and thus should be better to describe protein dynamics, the set of heavy-atom distance appears to plateau in a decreased number fo dimensions (20-ish vs. 40-ish for the other set fo features).

Do I favor less dimensions against better VAMP2 scores? I mean, at the end, larger dimensions can be problematic for clustering, and for the populations for each cluster when computing the MSM.

Screen Shot 2021-04-04 at 7 14 25 PM

Thanks for your input, I really appreciate it.

Best,
O.

@clonker
Copy link
Member

clonker commented Apr 6, 2021

Hey,

thanks, happy easter break to you too! :-)

I find that for scoring numbers of dimensions with VAMP, there are two phases: a rapid scaling, where I assume the slowest dynamics get covered, and a second, steady increase, where other processes get covered, up until a plateau is reached. Is this interpretation correct? (Figure below)

This is not a necessarily true I would say, as also the slow processes might be covered in the lower figures of the score, just under a worse approximation (ie information gets lost, they appear faster than they are, ...).

A secondary question on this: in this featurization I have two opposing views: while using the set of all torsions has an increased VAMP, and thus should be better to describe protein dynamics, the set of heavy-atom distance appears to plateau in a decreased number fo dimensions (20-ish vs. 40-ish for the other set fo features).

Generally a higher score is better, plateau or not. You might want to try to combine torsions and heavy atom distances in one featurizer and score that, though. I.e., use the same featurizer object and just add both features to it. The estimators will prune away any dimensions with (almost) redundant information anyways.

Do I favor less dimensions against better VAMP2 scores? I mean, at the end, larger dimensions can be problematic for clustering, and for the populations for each cluster when computing the MSM.

You can also score the MSM object, which is exactly what you should do to quantitatively answer this question. :-) I can't really give you a rule of thumb, as there are multiple factors going into this (at the very least the clustering itself, number of dimensions, amount of data but probably more that I didn't think of right now).

@stale
Copy link

stale bot commented Oct 11, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Oct 11, 2021
@clonker clonker closed this as completed Dec 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants