Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I/O for time gen class #2189

Closed
choldgraf opened this issue Jun 5, 2015 · 22 comments
Closed

I/O for time gen class #2189

choldgraf opened this issue Jun 5, 2015 · 22 comments

Comments

@choldgraf
Copy link
Contributor

I've been running some time gen models, and a tricky problem I've come across is how to deal with I/O. The model fitting (well, the predictions actually) can take a really long time, so I'd prefer to have one script do all the model fitting, and another script that looks at the results.

However, right now I don't have a good system for reading / writing from disk. I am wondering if it'd be worth implementing something similar to AverageTFR, which basically just takes attributes, writes them in a dictionary to an h5 file, and then reads them back from memory in a similar manner.

The main challenge I see to this would be that people may have arbitrary sklearn objects in the attribute for models. I think the h5 code will serialize this which is sub-optimal but at least shouldn't break. What do you guys think?

@kingjr
Copy link
Member

kingjr commented Jun 5, 2015

So far I've just been using pickle and save the whole object, but it's true that

  1. if people use home-made classes for the classifier and/or scorer, it can become a problem. I'm not sure whether h5 deals with this?
  2. it serializes everything too.

So yes, a EHN would be welcome in my opinion.

@choldgraf
Copy link
Contributor Author

Yeah - I've been doing pickling too, but I'm always worried about an API
change making it hard to unpickle. At least by storing components as H5,
it's easier to change code and still make the I/O work. Also pickling for
me is really slow especially for time-gen stuff with decently high sampling
rates. Don't know if it'd make a difference though if the sklearn object
would still be serialized...

On Fri, Jun 5, 2015 at 3:40 PM, J-R King notifications@github.com wrote:

So far I've just been using pickle and save the whole object, but it's
true that

  1. if people use home-made classes for the classifier and/or scorer, it
    can become a problem. I'm not sure whether h5 deals with this?
  2. it serializes everything too.

So yes, a EHN would be welcome in my opinion.


Reply to this email directly or view it on GitHub
#2189 (comment)
.

@choldgraf
Copy link
Contributor Author

So I was thinking about this a little bit more, and I think this would be straightforward for any timegen-specific attributes, but potentially very convoluted for the models, cv objects, etc. One option would be to use something like tables (maybe h5io?) which I think automatically pickles things that it can't easily handle.

However, doing this makes me wonder if it makes more sense to have the I/O just be something the user handles themselves for their use case. It doesn't seem like sklearn will improve their own I/O for the same reason of complexity. What do you think?

@kingjr
Copy link
Member

kingjr commented Dec 22, 2015

I dont know much about io so I don't have any strong opinion.

The large part of the gat object is 1) y_pred_ 2)scores_ and 3) the fitted
parameters of each estimators (typically clf.coef_ but that varies greatly
depending on the clf). The rest should be pretty light so I guess pickling
it should be fine (?).

ATM I personnally pickle everything but it's true that it s pretty slow..

On Tuesday, 22 December 2015, Chris Holdgraf notifications@github.com
wrote:

So I was thinking about this a little bit more, and I think this would be
straightforward for any timegen-specific attributes, but potentially very
convoluted for the models, cv objects, etc. One option would be to use
something like tables (maybe h5io?) which I think automatically pickles
things that it can't easily handle.

However, doing this makes me wonder if it makes more sense to have the I/O
just be something the user handles themselves for their use case. It
doesn't seem like sklearn will improve their own I/O for the same reason of
complexity. What do you think?


Reply to this email directly or view it on GitHub
#2189 (comment)
.

@choldgraf
Copy link
Contributor Author

Yep - I think all that is correct. As you mention, the challenge is that it's really hard to support arbitrary potential models that people will fit. (e.g., if they use logistic regression vs. a random forest, the attributes of interest change). In our case, we'd either have to hard-code ways to handle those specific use cases, or just pickle the whole thing anyway. And if we're pickling the bigger objects anyway, then it shouldn't be a big deal for the user to do the I/O with h5io on their own, no?

@kingjr
Copy link
Member

kingjr commented Dec 22, 2015

If we can incorporate the io, we might as well do it for them. It would
definitely be used .

On Tuesday, 22 December 2015, Chris Holdgraf notifications@github.com
wrote:

Yep - I think all that is correct. As you mention, the challenge is that
it's really hard to support arbitrary potential models that people will
fit. (e.g., if they use logistic regression vs. a random forest, the
attributes of interest change). In our case, we'd either have to hard-code
ways to handle those specific use cases, or just pickle the whole thing
anyway. And if we're pickling the bigger objects anyway, then it shouldn't
be a big deal for the user to do the I/O with h5io on their own, no?


Reply to this email directly or view it on GitHub
#2189 (comment)
.

@choldgraf
Copy link
Contributor Author

Hmm ok - let me cook something up then. The basic flow as I see it would be:

  1. Define parts of the time gen class that are easily I/O'ed (e.g., scores_ and the parameters that are set)
  2. Do one of:
    1. Detect the CV and/or model used and extract coefficients from it in an intelligent manner.
    2. Pickle all of the CV / clf stuff and unpickle it for the read.

WDYT?

@dengemann
Copy link
Member

I would instead of pickling save the attributes of sklearn objects and their parameters, to then re-construct them after reading. Most of these attributes should be standard types h5py can deal with.

@choldgraf
Copy link
Contributor Author

OK, so you vote for 'i' then? The only challenge is that it requires hard-coding for each type of sklearn object (e.g., it's coef_ if we're dealing with an SVM, but feature_importances_ if it's a RandomForest)

@jona-sassenhagen
Copy link
Contributor

If we had a mechanism for figuring out the field holding the feature importance, we could also use that to plot model topomaps.

@choldgraf
Copy link
Contributor Author

Yep agreed - this is why I think it's important to keep the model parameter
information instead of just keeping the scores etc. I'm +1 for doing this,
just need to decide whether to explicitly handle those parameters or just
blindly pickle it

On Tue, Dec 22, 2015 at 1:41 PM jona-sassenhagen notifications@github.com
wrote:

If we had a mechanism for figuring out the field holding the feature
importance, we could also use that to plot model topomaps.


Reply to this email directly or view it on GitHub
#2189 (comment)
.

@kingjr
Copy link
Member

kingjr commented Dec 22, 2015

I guess we can start easy with a blind pickling and we'll make subsequent
PR to improve it later. Or would this create incompatibility issues ?

On Tuesday, 22 December 2015, Chris Holdgraf notifications@github.com
wrote:

Yep agreed - this is why I think it's important to keep the model parameter
information instead of just keeping the scores etc. I'm +1 for doing this,
just need to decide whether to explicitly handle those parameters or just
blindly pickle it

On Tue, Dec 22, 2015 at 1:41 PM jona-sassenhagen <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');>
wrote:

If we had a mechanism for figuring out the field holding the feature
importance, we could also use that to plot model topomaps.


Reply to this email directly or view it on GitHub
<
#2189 (comment)

.


Reply to this email directly or view it on GitHub
#2189 (comment)
.

@agramfort
Copy link
Member

agramfort commented Dec 22, 2015 via email

@jona-sassenhagen
Copy link
Contributor

95% of usage cases would probably be covered by just thinking about SVM and LR right?

(Jean-Remí, you're still mainly using LR and SVMs for classification GAT too right?)

@kingjr
Copy link
Member

kingjr commented Dec 22, 2015

I also regularly use ridge lasso multi lasso and LDA

On Tuesday, 22 December 2015, jona-sassenhagen notifications@github.com
wrote:

95% of usage cases would probably be covered by just thinking about SVM
and LR right?

(Jean-Remí, you're still mainly using LR and SVMs for classification GAT
too right?)


Reply to this email directly or view it on GitHub
#2189 (comment)
.

@jona-sassenhagen
Copy link
Contributor

Ah okay. I've never seen Ridge or LDA give better results than LR for my usage cases ...

@kingjr
Copy link
Member

kingjr commented Dec 22, 2015

Ridge has been a bit better on some 32 elec eeg dataset. And there s a
ridgecv that simplify the grid search. But yeah all of thesr are close.

On Tuesday, 22 December 2015, jona-sassenhagen notifications@github.com
wrote:

Ah okay. I've never seen Ridge or LDA give better results than LR for my
usage cases ...


Reply to this email directly or view it on GitHub
#2189 (comment)
.

@choldgraf
Copy link
Contributor Author

So...choose 3-4 to support and then add support for new ones on an
as-requested-enough basis? Linear SVM, LR, LDA, and Ridge? Or is there an
Elastic Net classifier? Then we could just keep the tradeoff parameter in
there to support both ridge/lasso in one go?

On Tue, Dec 22, 2015 at 4:12 PM Jean-Rémi KING notifications@github.com
wrote:

Ridge has been a bit better on some 32 elec eeg dataset. And there s a
ridgecv that simplify the grid search. But yeah all of thesr are close.

On Tuesday, 22 December 2015, jona-sassenhagen notifications@github.com
wrote:

Ah okay. I've never seen Ridge or LDA give better results than LR for my
usage cases ...


Reply to this email directly or view it on GitHub
<
#2189 (comment)

.


Reply to this email directly or view it on GitHub
#2189 (comment)
.

@jona-sassenhagen
Copy link
Contributor

Interesting @kingjr , I use low density EEG a lot and I always see LR and SVM strongly dominate Ridge.

@choldgraf Maybe even just start with the default (LR) to get the API going?
(Yes, there is an Elastic Net classifier.)

@kingjr
Copy link
Member

kingjr commented Dec 23, 2015

@jona-sassenhagen in my case there s about 5000 epochs.

On Tuesday, 22 December 2015, jona-sassenhagen notifications@github.com
wrote:

Interesting @kingjr https://github.com/kingjr , I use low density EEG a
lot and I always see LR and SVM strongly dominate Ridge.

@choldgraf https://github.com/choldgraf Maybe even just start with the
default (LR) to get the API going?
(Yes, there is an Elastic Net classifier.)


Reply to this email directly or view it on GitHub
#2189 (comment)
.

@choldgraf
Copy link
Contributor Author

ok - though IIRC sklearn may behave differently for different kinds of coefficients. I think I remember an issue about this a while back where some estimators would let you set things like coef_, while others would raise an error. @agramfort correct me if I'm wrong?

One option, then, would be to assume that there will be a set of linear weights, one per input feature, that belong to a GAT object. Then we could set those weights as an array and do I/O accordingly. Though that would lose some of the provenance for the full details of the model / CV / etc

@agramfort
Copy link
Member

yes use pickle and we removed a lot of the old classes. closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants