I/O for time gen class #2189

choldgraf · 2015-06-05T21:43:30Z

I've been running some time gen models, and a tricky problem I've come across is how to deal with I/O. The model fitting (well, the predictions actually) can take a really long time, so I'd prefer to have one script do all the model fitting, and another script that looks at the results.

However, right now I don't have a good system for reading / writing from disk. I am wondering if it'd be worth implementing something similar to AverageTFR, which basically just takes attributes, writes them in a dictionary to an h5 file, and then reads them back from memory in a similar manner.

The main challenge I see to this would be that people may have arbitrary sklearn objects in the attribute for models. I think the h5 code will serialize this which is sub-optimal but at least shouldn't break. What do you guys think?

kingjr · 2015-06-05T22:40:48Z

So far I've just been using pickle and save the whole object, but it's true that

if people use home-made classes for the classifier and/or scorer, it can become a problem. I'm not sure whether h5 deals with this?
it serializes everything too.

So yes, a EHN would be welcome in my opinion.

choldgraf · 2015-06-05T23:40:58Z

Yeah - I've been doing pickling too, but I'm always worried about an API
change making it hard to unpickle. At least by storing components as H5,
it's easier to change code and still make the I/O work. Also pickling for
me is really slow especially for time-gen stuff with decently high sampling
rates. Don't know if it'd make a difference though if the sklearn object
would still be serialized...

On Fri, Jun 5, 2015 at 3:40 PM, J-R King notifications@github.com wrote:

So far I've just been using pickle and save the whole object, but it's
true that

if people use home-made classes for the classifier and/or scorer, it
can become a problem. I'm not sure whether h5 deals with this?

it serializes everything too.

So yes, a EHN would be welcome in my opinion.

—
Reply to this email directly or view it on GitHub
#2189 (comment)
.

choldgraf · 2015-12-22T17:53:14Z

So I was thinking about this a little bit more, and I think this would be straightforward for any timegen-specific attributes, but potentially very convoluted for the models, cv objects, etc. One option would be to use something like tables (maybe h5io?) which I think automatically pickles things that it can't easily handle.

However, doing this makes me wonder if it makes more sense to have the I/O just be something the user handles themselves for their use case. It doesn't seem like sklearn will improve their own I/O for the same reason of complexity. What do you think?

kingjr · 2015-12-22T18:47:59Z

I dont know much about io so I don't have any strong opinion.

The large part of the gat object is 1) y_pred_ 2)scores_ and 3) the fitted
parameters of each estimators (typically clf.coef_ but that varies greatly
depending on the clf). The rest should be pretty light so I guess pickling
it should be fine (?).

ATM I personnally pickle everything but it's true that it s pretty slow..

On Tuesday, 22 December 2015, Chris Holdgraf notifications@github.com
wrote:

So I was thinking about this a little bit more, and I think this would be
straightforward for any timegen-specific attributes, but potentially very
convoluted for the models, cv objects, etc. One option would be to use
something like tables (maybe h5io?) which I think automatically pickles
things that it can't easily handle.

However, doing this makes me wonder if it makes more sense to have the I/O
just be something the user handles themselves for their use case. It
doesn't seem like sklearn will improve their own I/O for the same reason of
complexity. What do you think?

—
Reply to this email directly or view it on GitHub
#2189 (comment)
.

choldgraf · 2015-12-22T18:58:03Z

Yep - I think all that is correct. As you mention, the challenge is that it's really hard to support arbitrary potential models that people will fit. (e.g., if they use logistic regression vs. a random forest, the attributes of interest change). In our case, we'd either have to hard-code ways to handle those specific use cases, or just pickle the whole thing anyway. And if we're pickling the bigger objects anyway, then it shouldn't be a big deal for the user to do the I/O with h5io on their own, no?

kingjr · 2015-12-22T19:09:37Z

If we can incorporate the io, we might as well do it for them. It would
definitely be used .

On Tuesday, 22 December 2015, Chris Holdgraf notifications@github.com
wrote:

Yep - I think all that is correct. As you mention, the challenge is that
it's really hard to support arbitrary potential models that people will
fit. (e.g., if they use logistic regression vs. a random forest, the
attributes of interest change). In our case, we'd either have to hard-code
ways to handle those specific use cases, or just pickle the whole thing
anyway. And if we're pickling the bigger objects anyway, then it shouldn't
be a big deal for the user to do the I/O with h5io on their own, no?

—
Reply to this email directly or view it on GitHub
#2189 (comment)
.

choldgraf · 2015-12-22T19:21:12Z

Hmm ok - let me cook something up then. The basic flow as I see it would be:

Define parts of the time gen class that are easily I/O'ed (e.g., scores_ and the parameters that are set)
Do one of:
1. Detect the CV and/or model used and extract coefficients from it in an intelligent manner.
2. Pickle all of the CV / clf stuff and unpickle it for the read.

WDYT?

dengemann · 2015-12-22T19:27:40Z

I would instead of pickling save the attributes of sklearn objects and their parameters, to then re-construct them after reading. Most of these attributes should be standard types h5py can deal with.

choldgraf · 2015-12-22T19:30:59Z

OK, so you vote for 'i' then? The only challenge is that it requires hard-coding for each type of sklearn object (e.g., it's coef_ if we're dealing with an SVM, but feature_importances_ if it's a RandomForest)

jona-sassenhagen · 2015-12-22T19:41:50Z

If we had a mechanism for figuring out the field holding the feature importance, we could also use that to plot model topomaps.

choldgraf · 2015-12-22T19:47:56Z

Yep agreed - this is why I think it's important to keep the model parameter
information instead of just keeping the scores etc. I'm +1 for doing this,
just need to decide whether to explicitly handle those parameters or just
blindly pickle it

On Tue, Dec 22, 2015 at 1:41 PM jona-sassenhagen notifications@github.com
wrote:

If we had a mechanism for figuring out the field holding the feature
importance, we could also use that to plot model topomaps.

—
Reply to this email directly or view it on GitHub
#2189 (comment)
.

kingjr · 2015-12-22T19:50:37Z

I guess we can start easy with a blind pickling and we'll make subsequent
PR to improve it later. Or would this create incompatibility issues ?

On Tuesday, 22 December 2015, Chris Holdgraf notifications@github.com
wrote:

Yep agreed - this is why I think it's important to keep the model parameter
information instead of just keeping the scores etc. I'm +1 for doing this,
just need to decide whether to explicitly handle those parameters or just
blindly pickle it

On Tue, Dec 22, 2015 at 1:41 PM jona-sassenhagen <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');>
wrote:

If we had a mechanism for figuring out the field holding the feature
importance, we could also use that to plot model topomaps.

—
Reply to this email directly or view it on GitHub
<
#2189 (comment)

.

—
Reply to this email directly or view it on GitHub
#2189 (comment)
.

agramfort · 2015-12-22T20:20:03Z

FYI sklearn has never invested ressource to allow saveing to disk models offering forward compatibility. Use pickle if you want but be ready to have io problems if you update mne or sklearn. There is no easy way out unless we had custom code restricted to linear models (not even pipelines).

jona-sassenhagen · 2015-12-22T20:21:52Z

95% of usage cases would probably be covered by just thinking about SVM and LR right?

(Jean-Remí, you're still mainly using LR and SVMs for classification GAT too right?)

kingjr · 2015-12-22T22:06:55Z

I also regularly use ridge lasso multi lasso and LDA

On Tuesday, 22 December 2015, jona-sassenhagen notifications@github.com
wrote:

95% of usage cases would probably be covered by just thinking about SVM
and LR right?

(Jean-Remí, you're still mainly using LR and SVMs for classification GAT
too right?)

—
Reply to this email directly or view it on GitHub
#2189 (comment)
.

jona-sassenhagen · 2015-12-22T22:09:36Z

Ah okay. I've never seen Ridge or LDA give better results than LR for my usage cases ...

kingjr · 2015-12-22T22:12:37Z

Ridge has been a bit better on some 32 elec eeg dataset. And there s a
ridgecv that simplify the grid search. But yeah all of thesr are close.

On Tuesday, 22 December 2015, jona-sassenhagen notifications@github.com
wrote:

Ah okay. I've never seen Ridge or LDA give better results than LR for my
usage cases ...

—
Reply to this email directly or view it on GitHub
#2189 (comment)
.

choldgraf · 2015-12-22T22:19:54Z

So...choose 3-4 to support and then add support for new ones on an
as-requested-enough basis? Linear SVM, LR, LDA, and Ridge? Or is there an
Elastic Net classifier? Then we could just keep the tradeoff parameter in
there to support both ridge/lasso in one go?

On Tue, Dec 22, 2015 at 4:12 PM Jean-Rémi KING notifications@github.com
wrote:

Ridge has been a bit better on some 32 elec eeg dataset. And there s a
ridgecv that simplify the grid search. But yeah all of thesr are close.

On Tuesday, 22 December 2015, jona-sassenhagen notifications@github.com
wrote:

Ah okay. I've never seen Ridge or LDA give better results than LR for my
usage cases ...

—
Reply to this email directly or view it on GitHub
<
#2189 (comment)

.

—
Reply to this email directly or view it on GitHub
#2189 (comment)
.

jona-sassenhagen · 2015-12-22T22:31:56Z

Interesting @kingjr , I use low density EEG a lot and I always see LR and SVM strongly dominate Ridge.

@choldgraf Maybe even just start with the default (LR) to get the API going?
(Yes, there is an Elastic Net classifier.)

kingjr · 2015-12-23T00:13:30Z

@jona-sassenhagen in my case there s about 5000 epochs.

On Tuesday, 22 December 2015, jona-sassenhagen notifications@github.com
wrote:

Interesting @kingjr https://github.com/kingjr , I use low density EEG a
lot and I always see LR and SVM strongly dominate Ridge.

@choldgraf https://github.com/choldgraf Maybe even just start with the
default (LR) to get the API going?
(Yes, there is an Elastic Net classifier.)

—
Reply to this email directly or view it on GitHub
#2189 (comment)
.

choldgraf · 2015-12-23T01:44:18Z

ok - though IIRC sklearn may behave differently for different kinds of coefficients. I think I remember an issue about this a while back where some estimators would let you set things like coef_, while others would raise an error. @agramfort correct me if I'm wrong?

One option, then, would be to assume that there will be a set of linear weights, one per input feature, that belong to a GAT object. Then we could set those weights as an array and do I/O accordingly. Though that would lose some of the provenance for the full details of the model / CV / etc

agramfort · 2017-07-22T12:11:12Z

yes use pickle and we removed a lot of the old classes. closing

agramfort closed this as completed Jul 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I/O for time gen class #2189

I/O for time gen class #2189

choldgraf commented Jun 5, 2015

kingjr commented Jun 5, 2015

choldgraf commented Jun 5, 2015

choldgraf commented Dec 22, 2015

kingjr commented Dec 22, 2015

choldgraf commented Dec 22, 2015

kingjr commented Dec 22, 2015

choldgraf commented Dec 22, 2015

dengemann commented Dec 22, 2015

choldgraf commented Dec 22, 2015

jona-sassenhagen commented Dec 22, 2015

choldgraf commented Dec 22, 2015

kingjr commented Dec 22, 2015

agramfort commented Dec 22, 2015 via email

jona-sassenhagen commented Dec 22, 2015

kingjr commented Dec 22, 2015

jona-sassenhagen commented Dec 22, 2015

kingjr commented Dec 22, 2015

choldgraf commented Dec 22, 2015

jona-sassenhagen commented Dec 22, 2015

kingjr commented Dec 23, 2015

choldgraf commented Dec 23, 2015

agramfort commented Jul 22, 2017

I/O for time gen class #2189

I/O for time gen class #2189

Comments

choldgraf commented Jun 5, 2015

kingjr commented Jun 5, 2015

choldgraf commented Jun 5, 2015

choldgraf commented Dec 22, 2015

kingjr commented Dec 22, 2015

choldgraf commented Dec 22, 2015

kingjr commented Dec 22, 2015

choldgraf commented Dec 22, 2015

dengemann commented Dec 22, 2015

choldgraf commented Dec 22, 2015

jona-sassenhagen commented Dec 22, 2015

choldgraf commented Dec 22, 2015

kingjr commented Dec 22, 2015

agramfort commented Dec 22, 2015 via email

jona-sassenhagen commented Dec 22, 2015

kingjr commented Dec 22, 2015

jona-sassenhagen commented Dec 22, 2015

kingjr commented Dec 22, 2015

choldgraf commented Dec 22, 2015

jona-sassenhagen commented Dec 22, 2015

kingjr commented Dec 23, 2015

choldgraf commented Dec 23, 2015

agramfort commented Jul 22, 2017