Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create feature vectorizer class / API. #49

Closed
kyleabeauchamp opened this issue May 1, 2013 · 16 comments
Closed

Create feature vectorizer class / API. #49

kyleabeauchamp opened this issue May 1, 2013 · 16 comments

Comments

@kyleabeauchamp
Copy link
Contributor

The idea is that we want a robust framework for featurizing trajectories. Key abilities:

  1. Mix and match functions from geometry
  2. Possible combination of multiple vectorizers
  3. Possibly memory aware--split trajectories into chunks to avoid trying to featurize everything at once...
@rmcgibbo
Copy link
Member

rmcgibbo commented May 1, 2013

@kyleabeauchamp
Copy link
Contributor Author

Yes, I was explicitly thinking about sklearn as a model.

@kyleabeauchamp
Copy link
Contributor Author

We should also think about how to balance distance metric versus feature representations.

@tjlane
Copy link
Contributor

tjlane commented May 3, 2013

Tagging this issue.

Kyle is your vision for this a kind of map function, where you can efficiently apply a function that takes each member of a set of snapshots to a vector? Maybe you could write a little more about what the intended use cases of this are, so I (and maybe others) can get an idea of what you're thinking.

@rmcgibbo
Copy link
Member

rmcgibbo commented May 3, 2013

I think it's basically just re-imagining metric.prepare_trajectory as a separate class.

@kyleabeauchamp
Copy link
Contributor Author

As an example of how we might use pandas here, think about the case of calculating dihedral angles.

Right now, the output is this:

Returns
-------
rid : np.ndarray, shape=(n_chi, 4)
    The indices of the atoms involved in each of the
    chi dihedral angles
angles : np.ndarray, shape=(n_frames, n_chi)
    The value of the dihedral angle for each of the angles in each of
    the frames.

The way pandas could be useful is by providing a natural way to merge the metadata (rid) and the values (angles). We could also have a way to switch between a string index and a "multiindex", where the multi-index would contain the following:

Type of calculation (chi torsion)
Residue ID
atom ID

I'm not trying to claim that this is the best way to do this, but it is one way to help streamline this stuff...

@kyleabeauchamp
Copy link
Contributor Author

Then the "job" of the vectorizer would be to do two things:

  1. Calculate the quantity
  2. Give it a self-consistent label

@rmcgibbo
Copy link
Member

rmcgibbo commented Aug 6, 2013

I was just talking to @msultan about this yesterday afternoon. A useful part of the (feature/vector)izer api would be a minimal operator-type logic. For algorithms like ktICA where you're, in some sense throwing the kitchen sink at the problem in terms of very large feature spaces (sure, they might be implicit but it's still in the spirit), you might want to use a sort of operator logic on featurizers to build up a complex "compound" feature space.

i.e.

[...]
>>> traj.n_frames == 100
>>> # applying two featurizers like normal
>>> dihedral_featurizer(traj).shape == (100, 5)
>>> contact_featurizer(traj).shape == (100, 5)

>>> joint_featurizer = dihedral_featurizer + contact_featurizer
>>> # adding two operators together yields a new compound operator
>>> # when you apply it, you get something that does both
>>> joint_featurizer(trajectory).shape == (100, 15)

I'm not sure what operations really make sense. There's adding two featurizers. Multiplying by a scalar
makes sense. Also, perhaps a kind of generalized outer product makes sense. For example, if you have
two binary featurizers that each induce a 10 dimensional space and you take their outer product under
the 'logical and' operator, you'd get a 100 dimensional feature space with all of the pairwise logical ands --
of the form space1[i] && space2[j].

Maybe I'm overthinking this. I'm not really sure what the use cases are except for some kind of very exhaustive enumeration.

cc: @schwancr

@kyleabeauchamp
Copy link
Contributor Author

I think Christian should give his thoughts on the desired properties.
On Aug 6, 2013 3:58 AM, "Robert McGibbon" notifications@github.com wrote:

I was just talking to @msultan https://github.com/msultan about this
yesterday afternoon. A useful part of the (feature/vector)izer api would be
a minimal operator-type logic. For algorithms like ktICA where you're, in
some sense throwing the kitchen sink at the problem in terms of very large
feature spaces (sure, they might be implicit but it's still in the spirit),
you might want to use a sort of operator logic on featurizers to build up a
complex "compound" feature space.

i.e.

[...]

traj.n_frames == 100

applying two featurizers like normal

dihedral_featurizer(traj).shape == (100, 5)
contact_featurizer(traj).shape == (100, 5)

joint_featurizer = dihedral_featurizer + contact_featurizer

adding two operators together yields a new compound operator

when you apply it, you get something that does both

joint_featurizer(trajectory).shape == (100, 15)

I'm not sure what operations really make sense. There's adding two
featurizers. Multiplying by a scalar
makes sense. Also, perhaps a kind of generalized outer product makes
sense. For example, if you have
two binary featurizers that each induce a 10 dimensional space and you
take their outer product under
the 'logical and' operator, you'd get a 100 dimensional feature space with
all of the pairwise logical ands --
of the form space1[i] && space2[j].

Maybe I'm overthinking this. I'm not really sure what the use cases are
except for some kind of very exhaustive enumeration.

cc: @schwancr https://github.com/schwancr


Reply to this email directly or view it on GitHubhttps://github.com/rmcgibbo/mdtraj/issues/49#issuecomment-22171463
.

@schwancr
Copy link
Contributor

schwancr commented Aug 6, 2013

It sounds like a decent idea to provide some operators for the featurizer objects, but I'm worried it could be confusing, and I bet someone will end up adding the result of two featurizer.call's as opposed to adding two featurizers.

For instance, you could add operators for building a Hybrid metric:

>>> rmsd = RMSD()
>>> dihedral = Dihedral()
>>> hybrid = 0.1 * rmsd + 0.9 * dihedral
...
>>> hybrid = Hybrid([rmsd, dihedral], [0.1, 0.9])

Those two methods would be the same, but the first one to me could be confusing. I think I'd prefer to have a hybrid featurizer just like we have a hybrid metric. In fact, even in the __add__ case we would need (I think) to write this class as well, meaning it would just be initialized in a unique way.

@kyleabeauchamp
Copy link
Contributor Author

I agree that some form of addition operation is critical, as we don't want to manually keep track of calculating each feature.

@kyleabeauchamp
Copy link
Contributor Author

I'm not as fond of the outer product.

This does bring up the related point of how to design the MSMB3 hooks etc.

@schwancr
Copy link
Contributor

schwancr commented Aug 6, 2013

The outer product is a lot like using the kernel trick with second degree polynomials. So we could have a Polynomial featurizer if we wanted to. But again I think it's clearer to have the featurizers be initialized by calling __init__

@kyleabeauchamp
Copy link
Contributor Author

OK, I'm fine with creating features from lists, rather than explicitly adding them.

@schwancr
Copy link
Contributor

schwancr commented Aug 6, 2013

By the way, are we set on the "featurizer" name?

@rmcgibbo
Copy link
Member

rmcgibbo commented Aug 6, 2013

I don't think it was quite clear to me last night that this operator stuff is just an alternative interface to a bunch of constructors for init methods for classes like SumFeaturizer and ScalarMultipleFeaturizer, etc. 

I'm not set on the name featurizer though. 

-Robert
Sent from my iPhone.

On Tue, Aug 6, 2013 at 10:08 AM, Christian Schwantes
notifications@github.com wrote:

By the way, are we set on the "featurizer" name?

Reply to this email directly or view it on GitHub:
https://github.com/rmcgibbo/mdtraj/issues/49#issuecomment-22193291

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants