How to serialize models #11

joaquinvanschoren · 2015-12-23T22:14:20Z

More of a developer-to-developer question: we are working on exporting scikit-learn runs, but we are unsure what is the best way to share learned models. On first sight, creating a pickle is the best and most general way to go. Matthias confirms that this works with scikit-learn SVMs, even though the files can get large for large datasets.

However, scikit-learn recommends to use joblib because it is more efficient: http://scikit-learn.org/stable/modules/model_persistence.html

The problem here is that it creates a bunch of files in a folder. This is much harder to share, and sending many many files to the OpenML server for every single run seems unwieldy and error-prone.

Would creating a single pickle file still be the best way forward, or is there a better solution?

zardaloop · 2015-12-23T22:47:03Z

I guess in joblib you can use the compress option to make a single file https://pythonhosted.org/joblib/persistence.html
Would that answer your question ?

zardaloop · 2015-12-23T23:01:08Z

However by reading this article (https://pythonhosted.org/joblib/generated/joblib.dump.html) I don't think the parameter is boolean, instead it is an integer between 0 to 9.

joaquinvanschoren · 2015-12-23T23:02:34Z

Ah, that looks really useful.

I did notice that joblib pickles are not supported across python versions. That means that if someone built a scikit-learn model with Python 2 it cannot be loaded by someone running it on Python 3? Should we worry about that or can it be easily solved?

zardaloop · 2015-12-24T06:46:14Z

Where did you read that?

joaquinvanschoren · 2015-12-24T10:18:19Z

@zardaloop On the bottom of the link you posted :)
https://pythonhosted.org/joblib/persistence.html

zardaloop · 2015-12-24T11:00:32Z

Well I guess you really need to rethink about this . Because joblib is only for local storage and that's all about it. Eeven scikit-learn to be able to rebuild a model with its future version needs additional metadata along with the pickled model which contains :
The training data, e.g. a reference to a immutable snapshot
The python source code used to generate the model
The versions of scikit-learn and its dependencies
The cross validation score obtained on the training data
http://scikit-learn.org/stable/modules/model_persistence.html

zardaloop · 2015-12-24T11:07:30Z

Therefore as Matthias recomended I also think pickle is your best bet. But you need to make sure to include the metadata along with the pickled model so it can work in the future version of scikit-learn 😊

mfeurer · 2015-12-24T11:41:18Z

I'm not sure if it's possible to easily read pickles which were done with python2 in python3 and vice versa. Given that python2 is getting less and less used, one might think of not supporting it at all.

Besides that, @zardaloop has a valid point that storing sklearn models is not that easy and I don't think sklearn has a common way to solve this issue except storing all metadata as @zardaloop suggested. We should have a look at this in the new year.

amueller · 2015-12-29T16:15:02Z

I think joblib will do single-file exports soon. Maybe for the moment pickle is enough. Be sure to use the latest protocol of pickle, because the default results in much larger files (at least in python2, not sure about python3).

Both joblib and pickle have the issue that they serialize a class, without the corresponding class definition. So it is only guaranteed that a model will work and give the same result when using the exact same code it was created with.
We try to keep conflicts in loading to a minimum, but the trees frequently change their internal representation.

To make sure a result is entirely reproducible, the "easiest" way is to use docker containers or similar virtual environments (conda envs might be enough) with the exact same version of everything.

What is your exact use case?
A big question is whether you want the model to "work" or want the exact same results. Changing the numpy or scipy version, or changing the BLAS, might give different results. So If you want the exact same results, that's hard to achieve without some very controlled environment.

If you want to load a model that "works", having the same scikit-learn version is sufficient.

Even if the learning of a model, and therefore the serialization didn't change between versions, it could be that a bug in the prediction code was fixed. So even if you can load a model from an older version, it is not ensured that you get similar predictions.

Hope that helps. This is a tricky issue. Feel free to ping me on these discussions, I don't generally follow the tracker atm, but I'm happy to give input.

joaquinvanschoren · 2015-12-29T22:01:46Z

Thanks, Andreas, for your valuable input. When it comes down to sharing the
model itself, it is sufficient that it just works (will be able to give the
same predictions given the same instances). It seems then that storing the
exact scikit version in the run, and storing the pickle, it the most
workable solution.

The reproducibility discussion is equally important though, and we should
look into this when sharing flows. We are currently thinking of storing
just the python script that creates the model given a task, with
meta-information such as the scikit-learn version, but a docker container
would be a better solution (and we are exploring the same thing for R right
now). We could generate those for each major scikit-learn version? Do you
have experience with this in the scikit-learn team?

On Tue, Dec 29, 2015 at 5:15 PM Andreas Mueller notifications@github.com
wrote:

I think joblib will do single-file exports soon. Maybe for the moment
pickle is enough. Be sure to use the latest protocol of pickle, because the
default results in much larger files (at least in python2, not sure about
python3).

Both joblib and pickle have the issue that they serialize a class, without
the corresponding class definition. So it is only guaranteed that a model
will work and give the same result when using the exact same code it was
created with.
We try to keep conflicts in loading to a minimum, but the trees frequently
change their internal representation.

To make sure a result is entirely reproducible, the "easiest" way is to
use docker containers or similar virtual environments (conda envs might be
enough) with the exact same version of everything.

What is your exact use case?
A big question is whether you want the model to "work" or want the exact
same results. Changing the numpy or scipy version, or changing the BLAS,
might give different results. So If you want the exact same results, that's
hard to achieve without some very controlled environment.

If you want to load a model that "works", having the same scikit-learn
version is sufficient.

Even if the learning of a model, and therefore the serialization didn't
change between versions, it could be that a bug in the prediction code was
fixed. So even if you can load a model from an older version, it is not
ensured that you get similar predictions.

Hope that helps. This is a tricky issue. Feel free to ping me on these
discussions, I don't generally follow the tracker atm, but I'm happy to
give input.

—
Reply to this email directly or view it on GitHub
https://github.com/openml/python/issues/21#issuecomment-167823391.

mikecroucher · 2015-12-30T10:31:27Z

The best I can do at the moment is to offer advice on what not to do. Don't use pickle!

Here's a summary as to why

http://eev.ee/blog/2015/10/15/dont-use-pickle-use-camel/

I'm not sure what one should use instead though...still trying to figure that out myself.

joaquinvanschoren · 2015-12-30T22:44:30Z

Interesting as that blog post is, do we really have an alternative right
now? A library like scikit-learn could likely come up with something
better, but expecting this for everyone running ML experiments in Python
seems a tall order?

Incidentally, what causes pickles to break? Will they still break if one
also provides a docker container with an environment in which they work?

Practically speaking, for the experiments that I want to run now, is it
ok to use pickle until something better comes along?

On Wed, Dec 30, 2015 at 11:31 AM Mike Croucher notifications@github.com
wrote:

The best I can do at the moment is to offer advice on what not to do.
Don't use pickle!

Here's a summary as to why

http://eev.ee/blog/2015/10/15/dont-use-pickle-use-camel/

I'm not sure what one should use instead though...still trying to figure
that out myself.

—
Reply to this email directly or view it on GitHub
https://github.com/openml/python/issues/21#issuecomment-167974433.

amueller · 2016-01-04T16:24:17Z

A library like scikit-learn could likely come up with something better

If you think that, you overestimate our resources by a lot. We haven't been able to provide better backward compatibility, even with pickle.

When it comes down to sharing the model itself, it is sufficient that it just works (will be able to give the same predictions given the same instances)

Well given the same predictions given the same instances can really only be guaranteed with a full container (because of blas issues etc). If you system is reasonably static, storing the scikit-learn version will work as an intermediate solution. But once your hosting provider upgrades their distribution, you might be in trouble.
A conda environment is reasonably save, I think.

We haven't done docker containers for reproducibility. We use travis and circleci and appveyor for continuous integration. But we don't really have a need to create highly reproducible environments.

amueller · 2016-01-04T16:25:18Z

I think pickle or joblib or dill + conda is the best solution for now, with pickle or joblib or dill + conda + docker the optimum upgrade.

drj11 · 2016-01-05T11:10:21Z

@mikecroucher asked me to comment. I'm a Python old-hand, but know nothing of scikit-learn, so what I have to say is slanted more towards generic Python advice.

To be able to answer a question like "is pickle adequate" we have to be able to pin down some requirements. For example, is it required that:

basic persistence: I can persist a model and load it into a later session of the same software configuration;
forward persistence: I can persist a model and load it into a later session that uses newer versions of the software;
backward persistence: I can persist a model and load it into a later session that uses older versions of the software;
portable persistence: I can persist a model and transfer that persisted model for someone else to load into a later session.

I would guess that various people would want all of these in some combination, so the real issue is how much do you want to pay (in money, time, and tears) for each of these things.

Additionally, there are various semantic issues. For example: I might be able to load the model, but it gives different predictions, but the predictions are different only in ways that are unimportant (for example, a few ULP). @amueller seems to be aware of these.

With that in mind, pickle is terrible for all of those requirements except basic persistence. Loading a pickle runs arbitrary code, so you should never download and open a pickle. Pickles are extremely brittle (many reasons, but for example, they refer to classes by their module location, so if you reorganise your files for an internal class, everything breaks), so are next to useless for providing forwards or backwards compatibility.

zardaloop · 2016-01-05T13:12:13Z

@amueller and @drj11 Many thanks for the great input to the issue here.
So I guess Dill + Conda seems to be the best possible option available, which I personally really like the approach if I am understanding it correct. So Andreas just to be clear about what you are suggesting here regarding the use of Dill + Conda, do you mean serialise the scikit-learn result object into a file using dill and then making a Conda package by including scikit-learn metadata as well as the serialised file and any other files which will be needed to be able to rebuild the model?

mfeurer · 2016-01-05T14:23:14Z

Conda seems to be a good idea to persist an environment. I'm not sure about Dill though. From the github website it seems like there is only a single developer/maintainer. We should keep that in mind if we want to base the python interface on that package.

joaquinvanschoren · 2016-01-05T14:39:33Z

I think @amueller meant (pickle or joblib or dill) + conda, so instead of dill, pickle or joblib could also be used. I think that they all have the same problem that @drj11 mentions, though? Does joblib also execute arbitrary code?

@drj11, do you think that Conda mitigates the other problems that you mentioned (about software versions)?

amueller · 2016-01-05T18:22:37Z

yes, dill and joblib also execute arbitrary code. Though I don't think that there are security concerns here, as we/you will be creating the pickles, right? People won't be able to upload their own, right?

joblib and dill build on pickle, btw.

And for conda I mean create a conda virtual environment, build a model, store the model using scikit-learn, and also store the complete conda config (all versions, which are binary versions!). Then, you can recreate the exact same conda environment later using the conda config file, and load the model (using pickle or joblib or dill).

amueller · 2016-01-05T18:27:03Z

also thanks @drj11. Scikit-learn doesn't support anything but basic persistence currently.

The issue is that "version of the software" is a combination of scikit-learn, atlas, numpy, scipy, python, the bit-ness and the operating system. And it is hard to say which changes in the non-scikit-learn parts will lead only to ULP issues vs qualitative differences. Numeric computations, wohoo!

Using conda gives you at least fixed binaries for the libraries, and if we only share between OpenML servers, the OS will be pretty fixed, too.

joaquinvanschoren · 2016-01-06T10:01:47Z

Thanks @amueller. Note that OpenML allows you to run your algorithms locally (or using any remote hardware you like), and then submit your results through the API. Otherwise it would not scale. Hence the OS can differ for different users. Does this complicate things for conda?

The pickles/joblibs/dills would indeed be created by the openml module (code that we provide and that does the interfacing with the OpenML API). In theory you could overwrite the module and in a contrived way link bonafide predictions to malicious code (in clear violation of the terms of use). To check that we could test the models, e.g. in a sandboxed environment, on the server. However, I don't think that this kind of attack is very likely, as OpenML is a collaboration tool: I will typically only reuse models of people that I am collaborating with, or that I trust as a researcher in good standing.

I like the pickle/joblib/dill + conda approach, and it is likely the best thing to do right now. Some other ML libs have their own model format, e.g. Caffe (http://caffe.berkeleyvision.org/model_zoo.html), which is safer, but as a general approach I think it will work fine.

drj11 · 2016-01-06T10:15:37Z

Just FTR since I was asked: I don't know enough about conda to have a reliable opinion, but if it can be used to record all versions of all software in use (as @amueller suggests), then that's a good start.

amueller · 2016-01-06T16:41:11Z

@joaquinvanschoren Ok, if people can submit their models, then you would need them to use conda and submit their conda environment config with the model. That is not terribly hard and probably the most feasible way.

There might still be minor differences due to OS, but the only way to avoid those is to have every user work in a virtual machine (or docker container) and provides the virtual machine with the model. That is way more complicated, and probably not worth the effort.

@drj11 conda is basically a cross-platform package manager that ships binaries (unlike pip), mostly for python and related scientific software.

amueller · 2016-01-06T16:57:41Z

btw, you might be interested in reprozip and hyperos which are two approaches to create reproducible environments (but they are kinda alpha-stage iirc). Conda or docker seem the better choices for now.
One downside of conda is that it does not necessarily capture all dependencies.

If someone wrote a custom transformer (which probably most interesting models have), you have some code part that is not a standard package. So in addition to the environment config you get from conda, and the state you get from pickle, you also need to have access to the source of the custom part.

asmeurer · 2016-01-12T17:04:44Z

@zardaloop has asked me to comment here. I am not very familiar with the situation so my comment will be generic. I don't have much experience with serialization, so I can't comment on
that. As for creating a conda package, I can tell you that it is a good fit if the packaged files are read-only and can be installed to a location in the library prefix (the conda environment). If this is not the case, then conda packages are not a good fit.

joaquinvanschoren · 2018-03-17T10:03:11Z

It would be great to rekindle this discussion, because it looks like it was converging towards a good solution, and storing models in OpenML would be very useful.

Would a conda+joblib/dill/pickle approach work? Even if it covers a large percentage of use cases it would make many people happy :)
@ameuller What do you think of reprozip and hyperos 2 years later?

mfeurer · 2018-03-19T08:13:40Z

Another thing I'd like to mention: security. Pickle is a bit insecure and I am very hesitant putting a solution based on pickle in the python package. See here.

rizplate · 2018-06-15T21:03:06Z

+1

janvanrijn · 2018-06-15T21:43:25Z

and storing models in OpenML would be very useful.

Is there any (scientific or praktical) use case in which storing models becomes relevant? The only thing that I can think of is when a new test set of data becomes available, the model can be reevaluated on this. However, this unfortunately rarely happens.

joaquinvanschoren · 2018-06-16T21:11:52Z

Generally looking into what your model looks like (visualize a tree, count support vectors, looking at learned weights,...)
Transfer learning, few-shot learning
Streaming data (where new data actually comes in)
Making predictions for new test instances (granted, this does not happen for older public datasets, but it does happen all the time in real applications)

I agree it is challenging, but I would really love to track the models I'm building. Maybe not during a large-scale benchmark, but there are plenty of other cases where I either want to look at the models to better understand what they are doing or share them so that other people may learn from them and reuse them.

joaquinvanschoren · 2018-06-16T21:20:46Z

I recently talked to Matei (MLflow). They use a simple format which is just a file containing the model (could be a pickle) and some meta-data on how to read it in.

It is probably best to leave this to the user. The python API should just retrieve the file and meta-data to tell the user what to do with it. Reading in models will probably be done rather occasionally.

mfeurer · 2018-06-18T07:33:26Z

One more thing to keep in mind is file size. Running the default scikit-learn random forest on the popular EEG-Eye-State dataset (1471) results in 7.5MB:

In [1]: %paste
import openml
import sklearn.ensemble
import pickle
data = openml.datasets.get_dataset(1471)
X, y = data.get_data(target=data.default_target_attribute)
rf = sklearn.ensemble.RandomForestClassifier()
rf.fit(X, y)
string = pickle.dumps(rf)
len(string) / 1024. / 1024.

Out[1]: 7.461672782897949

The most popular task on that dataset has ~85k runs, assuming that only 1 percent of these are random forests, that would require at least 6.3GB. If you would increase the tree size from 10 trees to something reasonable, this space requirement would grow drastically.

rquintino · 2018-08-31T23:18:03Z

Hi everyone! thinking a lot on this issue these past days, slightly more related to operationalization, pipeline reuse (ex: eval), retraining, & complete reproducibility. Remembered from Joaquin that this was an hot question for openml, & this was a great read/help!

I'm perfectly aware of the security implications and overal versioning issues of loaded resources, but even so, really pipelines solve so much of the issues that were bothering me.... (if only they could be slightly easier to work with :) )

Adding one additional problem, like mentioned above by @amueller , customtransformers. if we have to track the actual code for these... hard to see how can this could be proper operationalized. (and very error prone)

I did some tests with cloudpickle (dill probably will do similar?), and it seems to persist everything that is needed. No need to save/track any customtransformer code. Can load multiple pipelines with no problem. Everything seems really straightforward, save pipeline, (on new kernel) load, predict, refit, just works. Huge flexibility, ex: eval on new refits.

I also did some experiments on mixing sequential preparation flow, but in a fit/transform compatible way....
(sample below, or you can test in binder here https://mybinder.org/v2/gh/DevScope/ai-lab/master?filepath=notebooks%2Fdeconstructing%20-pipelines )

(seems to good to be true... what do you think?
ps-anyone knows if the actual code is "recoverable" from the saved cloudpickle?)
thanks!

class FitState():
    def __init__(self):
        pass
    
class PrepPipeline(BaseEstimator, TransformerMixin):
    
    def __init__(self,impute_age=True,impute_cabin=True,
                 add_missing_indicators=True,
                 train_filter="",copy=True,notes=None):
        self.impute_age=impute_age
        self.notes=notes
        self.copy=copy
        self.train_filter=train_filter
        self.impute_cabin=impute_cabin
        self.add_missing_indicators=add_missing_indicators
    
    def fit(self, X, y=None):
        print("Fitting...")
        self.fit_state=FitState()
        self.prepare(X=X,y=y,fit=True)
        return self

    def transform(self, X,y=None):
        assert isinstance(X, pd.DataFrame)
        print("Transforming...")
        return self.prepare(X=X,y=y,fit=False)
    
    def show_params(self):
        print("fit_state",vars(self.fit_state))
        print("params",self.get_params())
        
    # Experiment is reduce class overhead, bring related fit & transform closer, no models without pipelines
    def prepare(self,X,y=None,fit=False):
        print(f"Notes: {self.notes}")
        
        fit_state=self.fit_state
        if (self.copy):
            X=X.copy()
        
        # Fit only steps, ex: filtering, drop cols 
        if (fit):
            # Probably a very bad idea... thinking on it...
            if self.train_filter:
                X.query(self.train_filter,inplace=True)
        
        if (self.add_missing_indicators):
            if fit:
                fit_state.cols_with_nas=X.columns[X.isna().any()].tolist()
            X=pd.concat([X,X[fit_state.cols_with_nas].isnull().astype(int).add_suffix('_missing')],axis=1)
            
        # A typical titanic prep step (grabbed few ones from kaggle kernels)   
        if (self.impute_age):
            if fit:
                fit_state.impute_age=X.Age.median()

            X.Age.fillna(fit_state.impute_age,inplace=True)

        # Another one
        if (self.impute_cabin):
            if fit:
                fit_state.impute_cabin=X.Cabin.mode()[0]
            X.Cabin.fillna(fit_state.impute_cabin,inplace=True)
            
        return X


prep_pipeline=PrepPipeline(impute_age=True,impute_cabin=True, copy=False,train_filter="Sex=='female'",notes="test1")
X=prep_pipeline.fit_transform(df_full.copy())
prep_pipeline.show_params()
print(X.info())

rquintino · 2018-08-31T23:25:07Z

Similar concept, using dill
https://www.analyticsvidhya.com/blog/2017/09/machine-learning-models-as-apis-using-flask/

rquintino · 2018-08-31T23:36:50Z

ps-lik mentioned above,probably the size and amount of runs will be a challenge for openml,
nevertheless really interesting that, when saving the full pipelines (complete flow with all prep/model), we can refit/predict with new train/test folds at any time, ex: refresh leaderboard.

noting that if the pipeline was really a gridsearch fit, then refitting would be rather expensive. :)

rth · 2018-09-17T10:22:50Z

For serialization, onnx format might also be relevant (cf https://github.com/onnx/onnxmltools)

PGijsbers · 2023-02-20T10:21:36Z

@mfeurer I suggest we archive this in the broader OpenML discussion board

mfeurer · 2023-02-24T13:54:31Z

I fully agree on this, will you do so?

PGijsbers · 2023-02-27T16:18:32Z

Please leave any further comments on this issue in the related OpenML Discussion thread.

mfeurer mentioned this issue Oct 21, 2016

Feature/upload flow openml/openml-python#167

Merged

joaquinvanschoren mentioned this issue Mar 17, 2018

Support for feature selection studies openml/OpenML#131

Closed

mfeurer added the wontfix This will not be worked on label Nov 11, 2019

PGijsbers closed this as completed Feb 27, 2023

joaquinvanschoren mentioned this issue Dec 29, 2015

How to serialize models #11

Closed

PGijsbers transferred this issue from openml/openml-python Feb 27, 2023

openml locked and limited conversation to collaborators Feb 27, 2023

PGijsbers converted this issue into discussion #12 Feb 27, 2023

This issue was moved to a discussion.

How to serialize models #11

How to serialize models #11

Comments

joaquinvanschoren commented Dec 23, 2015

zardaloop commented Dec 23, 2015

zardaloop commented Dec 23, 2015

joaquinvanschoren commented Dec 23, 2015

zardaloop commented Dec 24, 2015

joaquinvanschoren commented Dec 24, 2015

zardaloop commented Dec 24, 2015

zardaloop commented Dec 24, 2015

mfeurer commented Dec 24, 2015

amueller commented Dec 29, 2015

joaquinvanschoren commented Dec 29, 2015

mikecroucher commented Dec 30, 2015

joaquinvanschoren commented Dec 30, 2015

amueller commented Jan 4, 2016

amueller commented Jan 4, 2016

drj11 commented Jan 5, 2016

zardaloop commented Jan 5, 2016

mfeurer commented Jan 5, 2016

joaquinvanschoren commented Jan 5, 2016

amueller commented Jan 5, 2016

amueller commented Jan 5, 2016

joaquinvanschoren commented Jan 6, 2016

drj11 commented Jan 6, 2016

amueller commented Jan 6, 2016

amueller commented Jan 6, 2016

asmeurer commented Jan 12, 2016

joaquinvanschoren commented Mar 17, 2018

mfeurer commented Mar 19, 2018

rizplate commented Jun 15, 2018

janvanrijn commented Jun 15, 2018

joaquinvanschoren commented Jun 16, 2018

joaquinvanschoren commented Jun 16, 2018

mfeurer commented Jun 18, 2018

rquintino commented Aug 31, 2018

rquintino commented Aug 31, 2018

rquintino commented Aug 31, 2018

rth commented Sep 17, 2018

PGijsbers commented Feb 20, 2023

mfeurer commented Feb 24, 2023

PGijsbers commented Feb 27, 2023

This issue was moved to a discussion.