Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent prediction results after dumping and loading lightgbm.LGBMClassifier via pickle #2449

Closed
slam3085 opened this issue Sep 25, 2019 · 10 comments

Comments

@slam3085
Copy link

Basically, title.

  1. train model LGBMClassifier
  2. get predictions for test set
  3. dump model via pickle (I also tried joblib - had same issue)
  4. load model
  5. get different predictions for test set

Both models have same parameters and metrics in properties.
I'm using python 3.7.3 and lightgbm 2.2.3

@StrikerRUS
Copy link
Collaborator

@slam3085 Please post the whole snippet. Also, attach the data you use, if it's not sensitive. Or can you reproduce this issue with random data?

We do have a test where predictions are the same:

def test_joblib(self):
X, y = load_boston(True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
gbm = lgb.LGBMRegressor(n_estimators=10, objective=custom_asymmetric_obj,
silent=True, importance_type='split')
gbm.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric=mse, early_stopping_rounds=5, verbose=False,
callbacks=[lgb.reset_parameter(learning_rate=list(np.arange(1, 0, -0.1)))])
joblib.dump(gbm, 'lgb.pkl') # test model with custom functions
gbm_pickle = joblib.load('lgb.pkl')
self.assertIsInstance(gbm_pickle.booster_, lgb.Booster)
self.assertDictEqual(gbm.get_params(), gbm_pickle.get_params())
np.testing.assert_array_equal(gbm.feature_importances_, gbm_pickle.feature_importances_)
self.assertAlmostEqual(gbm_pickle.learning_rate, 0.1)
self.assertTrue(callable(gbm_pickle.objective))
for eval_set in gbm.evals_result_:
for metric in gbm.evals_result_[eval_set]:
np.testing.assert_allclose(gbm.evals_result_[eval_set][metric],
gbm_pickle.evals_result_[eval_set][metric])
pred_origin = gbm.predict(X_test)
pred_pickle = gbm_pickle.predict(X_test)
np.testing.assert_allclose(pred_origin, pred_pickle)

@StrikerRUS StrikerRUS added the bug label Sep 26, 2019
@slam3085
Copy link
Author

Hi!

I can't share the data and the snippet looks like this:

try:
    with open(path, "rb") as f:
        mdl = pickle.load(f)
    print(f'lightgbm_fitted_model_{today} loaded')
except:
    print(f'Unable to open lightgbm_fitted_model_{today}; fitting...')
    X_actual_train, X_eval, y_actual_train, y_eval = train_test_split(X_train, y_train, test_size=0.1,
                                                                      random_state=42, stratify=y_train)

    hyperparams = {'n_estimators': 200, 'class_weight': 'balanced', 'random_state': 42}
    mdl = lightgbm.LGBMClassifier(**hyperparams)
    mdl.fit(X_actual_train, y_actual_train, eval_set=(X_eval, y_eval), eval_metric='roc_auc')
    with open(path, "wb") as f:
        pickle.dump(mdl, f)
pred_proba = mdl.predict_proba(X_test)[:, 1]
n_pred_proba_95 = sum(pred_proba >= 0.95)
print(f'stats: {n_pred_proba_95} out of {len(pred_proba)} are 95% objects of class 1')
return pred_proba

Basically either I fit new model and dump it via pickle, or I use already trained one.

I tried to reproduce it in local environment only and couldn't, and now I understand why - I have python 3.7.3 as local environment and python 3.7.0 as production environment. So, actual steps to reproduce are:

  1. train model in python 3.7.3
  2. load it in python 3.7.0
  3. results of predictions vary

I guess it's not really a bug then or very minor one and issue can be closed.

@guolinke guolinke removed the bug label Sep 26, 2019
@StrikerRUS
Copy link
Collaborator

@slam3085

I tried to reproduce it in local environment only and couldn't, and now I understand why - I have python 3.7.3 as local environment and python 3.7.0 as production environment.

Makes sense! Thanks a lot for your investigation.

@dcanones
Copy link

dcanones commented Dec 11, 2019

We are having the same problem. It happens when we try to deploy our models to production in a different system/environment than the one we trained the model in, and it is not easy to reproduce. Observed this issue two times:

  1. Trying to deploy a model trained in a system (a node part of a Hadoop cluster with CentOS, 512GB RAM and a 32 cores Xeon) and loading the pickle/text in the rest of the nodes. Predictions are correct only in the same system where the model was trained and inconsistent in the rest (imagine, a continuous variable with values like [123.5, 143.2, 456.4] got predictions like [-4.0, 5.0, -9.0]). Of course using same data.

  2. Today while deploying using MLFlow and setting up an endpoint (Ubuntu 16.04 LTS), exactly same problem.

Using both scikit-learn API and classic Python API. Changing the model in our pipeline for a typical scikit-learn RandomForestRegressor and everything is OK.

Any idea about what is happening? It looks like we are missing something important and this is driving us crazy.

@guolinke
Copy link
Collaborator

What is the difference for the prediction results?

@dcanones
Copy link

For example, predictions on the same machine with the same environment the model was trained:

  • [123.5, 143.2, 456.4] (reasonable predictions, close to the original values we try to predict)

Predictions loading the model (joblib or txt) in another machine/environment, always installing LightGBM with conda or pip:

  • [-4.0, 5.0, -9.0] (weird predictions, negative and integer-like with the .0)

@guolinke
Copy link
Collaborator

with this large gap, I think it is not the same problem as this issue.
did you try to save and then load from the same machine/env?

@dcanones
Copy link

Yes, in the same machine and using the same environment there is no problem, in the same machine and using a different environment (for example, deploying the model using MLFlow) the results are different, same as saving the model and loading it in other machines...

The thing is, the models are supposed to be portable, at least saving it as a .txt file.

@guolinke
Copy link
Collaborator

@dcanones could you try the cli version in the same machine? I think it is possible the bug of env.

@lock
Copy link

lock bot commented Feb 28, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants