Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to serialize models #11

Closed
Tracked by #14
joaquinvanschoren opened this issue Dec 23, 2015 · 39 comments
Closed
Tracked by #14

How to serialize models #11

joaquinvanschoren opened this issue Dec 23, 2015 · 39 comments
Labels
wontfix This will not be worked on

Comments

@joaquinvanschoren
Copy link
Sponsor

More of a developer-to-developer question: we are working on exporting scikit-learn runs, but we are unsure what is the best way to share learned models. On first sight, creating a pickle is the best and most general way to go. Matthias confirms that this works with scikit-learn SVMs, even though the files can get large for large datasets.

However, scikit-learn recommends to use joblib because it is more efficient: http://scikit-learn.org/stable/modules/model_persistence.html

The problem here is that it creates a bunch of files in a folder. This is much harder to share, and sending many many files to the OpenML server for every single run seems unwieldy and error-prone.

Would creating a single pickle file still be the best way forward, or is there a better solution?

@zardaloop
Copy link

I guess in joblib you can use the compress option to make a single file https://pythonhosted.org/joblib/persistence.html
Would that answer your question ?

@zardaloop
Copy link

However by reading this article (https://pythonhosted.org/joblib/generated/joblib.dump.html) I don't think the parameter is boolean, instead it is an integer between 0 to 9.

@joaquinvanschoren
Copy link
Sponsor Author

Ah, that looks really useful.

I did notice that joblib pickles are not supported across python versions. That means that if someone built a scikit-learn model with Python 2 it cannot be loaded by someone running it on Python 3? Should we worry about that or can it be easily solved?

@zardaloop
Copy link

Where did you read that?

@joaquinvanschoren
Copy link
Sponsor Author

@zardaloop On the bottom of the link you posted :)
https://pythonhosted.org/joblib/persistence.html

@zardaloop
Copy link

Well I guess you really need to rethink about this . Because joblib is only for local storage and that's all about it. Eeven scikit-learn to be able to rebuild a model with its future version needs additional metadata along with the pickled model which contains :
The training data, e.g. a reference to a immutable snapshot
The python source code used to generate the model
The versions of scikit-learn and its dependencies
The cross validation score obtained on the training data
http://scikit-learn.org/stable/modules/model_persistence.html

@zardaloop
Copy link

Therefore as Matthias recomended I also think pickle is your best bet. But you need to make sure to include the metadata along with the pickled model so it can work in the future version of scikit-learn 😊

@mfeurer
Copy link

mfeurer commented Dec 24, 2015

I'm not sure if it's possible to easily read pickles which were done with python2 in python3 and vice versa. Given that python2 is getting less and less used, one might think of not supporting it at all.

Besides that, @zardaloop has a valid point that storing sklearn models is not that easy and I don't think sklearn has a common way to solve this issue except storing all metadata as @zardaloop suggested. We should have a look at this in the new year.

@amueller
Copy link

I think joblib will do single-file exports soon. Maybe for the moment pickle is enough. Be sure to use the latest protocol of pickle, because the default results in much larger files (at least in python2, not sure about python3).

Both joblib and pickle have the issue that they serialize a class, without the corresponding class definition. So it is only guaranteed that a model will work and give the same result when using the exact same code it was created with.
We try to keep conflicts in loading to a minimum, but the trees frequently change their internal representation.

To make sure a result is entirely reproducible, the "easiest" way is to use docker containers or similar virtual environments (conda envs might be enough) with the exact same version of everything.

What is your exact use case?
A big question is whether you want the model to "work" or want the exact same results. Changing the numpy or scipy version, or changing the BLAS, might give different results. So If you want the exact same results, that's hard to achieve without some very controlled environment.

If you want to load a model that "works", having the same scikit-learn version is sufficient.

Even if the learning of a model, and therefore the serialization didn't change between versions, it could be that a bug in the prediction code was fixed. So even if you can load a model from an older version, it is not ensured that you get similar predictions.

Hope that helps. This is a tricky issue. Feel free to ping me on these discussions, I don't generally follow the tracker atm, but I'm happy to give input.

@joaquinvanschoren
Copy link
Sponsor Author

Thanks, Andreas, for your valuable input. When it comes down to sharing the
model itself, it is sufficient that it just works (will be able to give the
same predictions given the same instances). It seems then that storing the
exact scikit version in the run, and storing the pickle, it the most
workable solution.

The reproducibility discussion is equally important though, and we should
look into this when sharing flows. We are currently thinking of storing
just the python script that creates the model given a task, with
meta-information such as the scikit-learn version, but a docker container
would be a better solution (and we are exploring the same thing for R right
now). We could generate those for each major scikit-learn version? Do you
have experience with this in the scikit-learn team?

On Tue, Dec 29, 2015 at 5:15 PM Andreas Mueller notifications@github.com
wrote:

I think joblib will do single-file exports soon. Maybe for the moment
pickle is enough. Be sure to use the latest protocol of pickle, because the
default results in much larger files (at least in python2, not sure about
python3).

Both joblib and pickle have the issue that they serialize a class, without
the corresponding class definition. So it is only guaranteed that a model
will work and give the same result when using the exact same code it was
created with.
We try to keep conflicts in loading to a minimum, but the trees frequently
change their internal representation.

To make sure a result is entirely reproducible, the "easiest" way is to
use docker containers or similar virtual environments (conda envs might be
enough) with the exact same version of everything.

What is your exact use case?
A big question is whether you want the model to "work" or want the exact
same results. Changing the numpy or scipy version, or changing the BLAS,
might give different results. So If you want the exact same results, that's
hard to achieve without some very controlled environment.

If you want to load a model that "works", having the same scikit-learn
version is sufficient.

Even if the learning of a model, and therefore the serialization didn't
change between versions, it could be that a bug in the prediction code was
fixed. So even if you can load a model from an older version, it is not
ensured that you get similar predictions.

Hope that helps. This is a tricky issue. Feel free to ping me on these
discussions, I don't generally follow the tracker atm, but I'm happy to
give input.


Reply to this email directly or view it on GitHub
https://github.com/openml/python/issues/21#issuecomment-167823391.

@mikecroucher
Copy link

The best I can do at the moment is to offer advice on what not to do. Don't use pickle!

Here's a summary as to why

http://eev.ee/blog/2015/10/15/dont-use-pickle-use-camel/

I'm not sure what one should use instead though...still trying to figure that out myself.

@joaquinvanschoren
Copy link
Sponsor Author

Interesting as that blog post is, do we really have an alternative right
now? A library like scikit-learn could likely come up with something
better, but expecting this for everyone running ML experiments in Python
seems a tall order?

Incidentally, what causes pickles to break? Will they still break if one
also provides a docker container with an environment in which they work?

Practically speaking, for the experiments that I want to run now, is it
ok to use pickle until something better comes along?

On Wed, Dec 30, 2015 at 11:31 AM Mike Croucher notifications@github.com
wrote:

The best I can do at the moment is to offer advice on what not to do.
Don't use pickle!

Here's a summary as to why

http://eev.ee/blog/2015/10/15/dont-use-pickle-use-camel/

I'm not sure what one should use instead though...still trying to figure
that out myself.


Reply to this email directly or view it on GitHub
https://github.com/openml/python/issues/21#issuecomment-167974433.

@amueller
Copy link

amueller commented Jan 4, 2016

A library like scikit-learn could likely come up with something better

If you think that, you overestimate our resources by a lot. We haven't been able to provide better backward compatibility, even with pickle.

When it comes down to sharing the model itself, it is sufficient that it just works (will be able to give the same predictions given the same instances)

Well given the same predictions given the same instances can really only be guaranteed with a full container (because of blas issues etc). If you system is reasonably static, storing the scikit-learn version will work as an intermediate solution. But once your hosting provider upgrades their distribution, you might be in trouble.
A conda environment is reasonably save, I think.

We haven't done docker containers for reproducibility. We use travis and circleci and appveyor for continuous integration. But we don't really have a need to create highly reproducible environments.

@amueller
Copy link

amueller commented Jan 4, 2016

I think pickle or joblib or dill + conda is the best solution for now, with pickle or joblib or dill + conda + docker the optimum upgrade.

@drj11
Copy link

drj11 commented Jan 5, 2016

@mikecroucher asked me to comment. I'm a Python old-hand, but know nothing of scikit-learn, so what I have to say is slanted more towards generic Python advice.

To be able to answer a question like "is pickle adequate" we have to be able to pin down some requirements. For example, is it required that:

  • basic persistence: I can persist a model and load it into a later session of the same software configuration;
  • forward persistence: I can persist a model and load it into a later session that uses newer versions of the software;
  • backward persistence: I can persist a model and load it into a later session that uses older versions of the software;
  • portable persistence: I can persist a model and transfer that persisted model for someone else to load into a later session.

I would guess that various people would want all of these in some combination, so the real issue is how much do you want to pay (in money, time, and tears) for each of these things.

Additionally, there are various semantic issues. For example: I might be able to load the model, but it gives different predictions, but the predictions are different only in ways that are unimportant (for example, a few ULP). @amueller seems to be aware of these.

With that in mind, pickle is terrible for all of those requirements except basic persistence. Loading a pickle runs arbitrary code, so you should never download and open a pickle. Pickles are extremely brittle (many reasons, but for example, they refer to classes by their module location, so if you reorganise your files for an internal class, everything breaks), so are next to useless for providing forwards or backwards compatibility.

@zardaloop
Copy link

@amueller and @drj11 Many thanks for the great input to the issue here.
So I guess Dill + Conda seems to be the best possible option available, which I personally really like the approach if I am understanding it correct. So Andreas just to be clear about what you are suggesting here regarding the use of Dill + Conda, do you mean serialise the scikit-learn result object into a file using dill and then making a Conda package by including scikit-learn metadata as well as the serialised file and any other files which will be needed to be able to rebuild the model?

@mfeurer
Copy link

mfeurer commented Jan 5, 2016

Conda seems to be a good idea to persist an environment. I'm not sure about Dill though. From the github website it seems like there is only a single developer/maintainer. We should keep that in mind if we want to base the python interface on that package.

@joaquinvanschoren
Copy link
Sponsor Author

I think @amueller meant (pickle or joblib or dill) + conda, so instead of dill, pickle or joblib could also be used. I think that they all have the same problem that @drj11 mentions, though? Does joblib also execute arbitrary code?

@drj11, do you think that Conda mitigates the other problems that you mentioned (about software versions)?

@amueller
Copy link

amueller commented Jan 5, 2016

yes, dill and joblib also execute arbitrary code. Though I don't think that there are security concerns here, as we/you will be creating the pickles, right? People won't be able to upload their own, right?

joblib and dill build on pickle, btw.

And for conda I mean create a conda virtual environment, build a model, store the model using scikit-learn, and also store the complete conda config (all versions, which are binary versions!). Then, you can recreate the exact same conda environment later using the conda config file, and load the model (using pickle or joblib or dill).

@amueller
Copy link

amueller commented Jan 5, 2016

also thanks @drj11. Scikit-learn doesn't support anything but basic persistence currently.

The issue is that "version of the software" is a combination of scikit-learn, atlas, numpy, scipy, python, the bit-ness and the operating system. And it is hard to say which changes in the non-scikit-learn parts will lead only to ULP issues vs qualitative differences. Numeric computations, wohoo!

Using conda gives you at least fixed binaries for the libraries, and if we only share between OpenML servers, the OS will be pretty fixed, too.

@joaquinvanschoren
Copy link
Sponsor Author

Thanks @amueller. Note that OpenML allows you to run your algorithms locally (or using any remote hardware you like), and then submit your results through the API. Otherwise it would not scale. Hence the OS can differ for different users. Does this complicate things for conda?

The pickles/joblibs/dills would indeed be created by the openml module (code that we provide and that does the interfacing with the OpenML API). In theory you could overwrite the module and in a contrived way link bonafide predictions to malicious code (in clear violation of the terms of use). To check that we could test the models, e.g. in a sandboxed environment, on the server. However, I don't think that this kind of attack is very likely, as OpenML is a collaboration tool: I will typically only reuse models of people that I am collaborating with, or that I trust as a researcher in good standing.

I like the pickle/joblib/dill + conda approach, and it is likely the best thing to do right now. Some other ML libs have their own model format, e.g. Caffe (http://caffe.berkeleyvision.org/model_zoo.html), which is safer, but as a general approach I think it will work fine.

@drj11
Copy link

drj11 commented Jan 6, 2016

Just FTR since I was asked: I don't know enough about conda to have a reliable opinion, but if it can be used to record all versions of all software in use (as @amueller suggests), then that's a good start.

@amueller
Copy link

amueller commented Jan 6, 2016

@joaquinvanschoren Ok, if people can submit their models, then you would need them to use conda and submit their conda environment config with the model. That is not terribly hard and probably the most feasible way.

There might still be minor differences due to OS, but the only way to avoid those is to have every user work in a virtual machine (or docker container) and provides the virtual machine with the model. That is way more complicated, and probably not worth the effort.

@drj11 conda is basically a cross-platform package manager that ships binaries (unlike pip), mostly for python and related scientific software.

@amueller
Copy link

amueller commented Jan 6, 2016

btw, you might be interested in reprozip and hyperos which are two approaches to create reproducible environments (but they are kinda alpha-stage iirc). Conda or docker seem the better choices for now.
One downside of conda is that it does not necessarily capture all dependencies.

If someone wrote a custom transformer (which probably most interesting models have), you have some code part that is not a standard package. So in addition to the environment config you get from conda, and the state you get from pickle, you also need to have access to the source of the custom part.

@asmeurer
Copy link

@zardaloop has asked me to comment here. I am not very familiar with the situation so my comment will be generic. I don't have much experience with serialization, so I can't comment on
that. As for creating a conda package, I can tell you that it is a good fit if the packaged files are read-only and can be installed to a location in the library prefix (the conda environment). If this is not the case, then conda packages are not a good fit.

@joaquinvanschoren
Copy link
Sponsor Author

It would be great to rekindle this discussion, because it looks like it was converging towards a good solution, and storing models in OpenML would be very useful.

Would a conda+joblib/dill/pickle approach work? Even if it covers a large percentage of use cases it would make many people happy :)
@ameuller What do you think of reprozip and hyperos 2 years later?

@mfeurer
Copy link

mfeurer commented Mar 19, 2018

Another thing I'd like to mention: security. Pickle is a bit insecure and I am very hesitant putting a solution based on pickle in the python package. See here.

@rizplate
Copy link

+1

@janvanrijn
Copy link
Member

and storing models in OpenML would be very useful.

Is there any (scientific or praktical) use case in which storing models becomes relevant? The only thing that I can think of is when a new test set of data becomes available, the model can be reevaluated on this. However, this unfortunately rarely happens.

@joaquinvanschoren
Copy link
Sponsor Author

  • Generally looking into what your model looks like (visualize a tree, count support vectors, looking at learned weights,...)
  • Transfer learning, few-shot learning
  • Streaming data (where new data actually comes in)
  • Making predictions for new test instances (granted, this does not happen for older public datasets, but it does happen all the time in real applications)

I agree it is challenging, but I would really love to track the models I'm building. Maybe not during a large-scale benchmark, but there are plenty of other cases where I either want to look at the models to better understand what they are doing or share them so that other people may learn from them and reuse them.

@joaquinvanschoren
Copy link
Sponsor Author

I recently talked to Matei (MLflow). They use a simple format which is just a file containing the model (could be a pickle) and some meta-data on how to read it in.

It is probably best to leave this to the user. The python API should just retrieve the file and meta-data to tell the user what to do with it. Reading in models will probably be done rather occasionally.

@mfeurer
Copy link

mfeurer commented Jun 18, 2018

One more thing to keep in mind is file size. Running the default scikit-learn random forest on the popular EEG-Eye-State dataset (1471) results in 7.5MB:

In [1]: %paste
import openml
import sklearn.ensemble
import pickle
data = openml.datasets.get_dataset(1471)
X, y = data.get_data(target=data.default_target_attribute)
rf = sklearn.ensemble.RandomForestClassifier()
rf.fit(X, y)
string = pickle.dumps(rf)
len(string) / 1024. / 1024.

Out[1]: 7.461672782897949

The most popular task on that dataset has ~85k runs, assuming that only 1 percent of these are random forests, that would require at least 6.3GB. If you would increase the tree size from 10 trees to something reasonable, this space requirement would grow drastically.

@rquintino
Copy link

Hi everyone! thinking a lot on this issue these past days, slightly more related to operationalization, pipeline reuse (ex: eval), retraining, & complete reproducibility. Remembered from Joaquin that this was an hot question for openml, & this was a great read/help!

I'm perfectly aware of the security implications and overal versioning issues of loaded resources, but even so, really pipelines solve so much of the issues that were bothering me.... (if only they could be slightly easier to work with :) )

Adding one additional problem, like mentioned above by @amueller , customtransformers. if we have to track the actual code for these... hard to see how can this could be proper operationalized. (and very error prone)

I did some tests with cloudpickle (dill probably will do similar?), and it seems to persist everything that is needed. No need to save/track any customtransformer code. Can load multiple pipelines with no problem. Everything seems really straightforward, save pipeline, (on new kernel) load, predict, refit, just works. Huge flexibility, ex: eval on new refits.

I also did some experiments on mixing sequential preparation flow, but in a fit/transform compatible way....
(sample below, or you can test in binder here https://mybinder.org/v2/gh/DevScope/ai-lab/master?filepath=notebooks%2Fdeconstructing%20-pipelines )

(seems to good to be true... what do you think?
ps-anyone knows if the actual code is "recoverable" from the saved cloudpickle?)
thanks!

class FitState():
    def __init__(self):
        pass
    
class PrepPipeline(BaseEstimator, TransformerMixin):
    
    def __init__(self,impute_age=True,impute_cabin=True,
                 add_missing_indicators=True,
                 train_filter="",copy=True,notes=None):
        self.impute_age=impute_age
        self.notes=notes
        self.copy=copy
        self.train_filter=train_filter
        self.impute_cabin=impute_cabin
        self.add_missing_indicators=add_missing_indicators
    
    def fit(self, X, y=None):
        print("Fitting...")
        self.fit_state=FitState()
        self.prepare(X=X,y=y,fit=True)
        return self

    def transform(self, X,y=None):
        assert isinstance(X, pd.DataFrame)
        print("Transforming...")
        return self.prepare(X=X,y=y,fit=False)
    
    def show_params(self):
        print("fit_state",vars(self.fit_state))
        print("params",self.get_params())
        
    # Experiment is reduce class overhead, bring related fit & transform closer, no models without pipelines
    def prepare(self,X,y=None,fit=False):
        print(f"Notes: {self.notes}")
        
        fit_state=self.fit_state
        if (self.copy):
            X=X.copy()
        
        # Fit only steps, ex: filtering, drop cols 
        if (fit):
            # Probably a very bad idea... thinking on it...
            if self.train_filter:
                X.query(self.train_filter,inplace=True)
        
        if (self.add_missing_indicators):
            if fit:
                fit_state.cols_with_nas=X.columns[X.isna().any()].tolist()
            X=pd.concat([X,X[fit_state.cols_with_nas].isnull().astype(int).add_suffix('_missing')],axis=1)
            
        # A typical titanic prep step (grabbed few ones from kaggle kernels)   
        if (self.impute_age):
            if fit:
                fit_state.impute_age=X.Age.median()

            X.Age.fillna(fit_state.impute_age,inplace=True)

        # Another one
        if (self.impute_cabin):
            if fit:
                fit_state.impute_cabin=X.Cabin.mode()[0]
            X.Cabin.fillna(fit_state.impute_cabin,inplace=True)
            
        return X


prep_pipeline=PrepPipeline(impute_age=True,impute_cabin=True, copy=False,train_filter="Sex=='female'",notes="test1")
X=prep_pipeline.fit_transform(df_full.copy())
prep_pipeline.show_params()
print(X.info())

@rquintino
Copy link

@rquintino
Copy link

ps-lik mentioned above,probably the size and amount of runs will be a challenge for openml,
nevertheless really interesting that, when saving the full pipelines (complete flow with all prep/model), we can refit/predict with new train/test folds at any time, ex: refresh leaderboard.

noting that if the pipeline was really a gridsearch fit, then refitting would be rather expensive. :)

@rth
Copy link

rth commented Sep 17, 2018

For serialization, onnx format might also be relevant (cf https://github.com/onnx/onnxmltools)

@mfeurer mfeurer added the wontfix This will not be worked on label Nov 11, 2019
@PGijsbers
Copy link
Contributor

@mfeurer I suggest we archive this in the broader OpenML discussion board

@mfeurer
Copy link

mfeurer commented Feb 24, 2023

I fully agree on this, will you do so?

@PGijsbers
Copy link
Contributor

Please leave any further comments on this issue in the related OpenML Discussion thread.

@PGijsbers PGijsbers transferred this issue from openml/openml-python Feb 27, 2023
@openml openml locked and limited conversation to collaborators Feb 27, 2023
@PGijsbers PGijsbers converted this issue into discussion #12 Feb 27, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests