Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A couple of errors when building the ensemble #47

Closed
flippercy opened this issue Mar 19, 2021 · 6 comments
Closed

A couple of errors when building the ensemble #47

flippercy opened this issue Mar 19, 2021 · 6 comments

Comments

@flippercy
Copy link

flippercy commented Mar 19, 2021

Hi:

Our team has explored the ensemble option in the fit function of automl and got a few errors:

  1. There is an error when using both the GLM (LRL1/LRL2) and MLs for the ensemble. For example:

from flaml.data import load_openml_dataset
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id = 1169, data_dir = './')

settings = {
"time_budget": 40,
"metric": 'roc_auc',
"task": 'classification',
"estimator_list": [
'lrl1'
,'lrl2'
,'lgbm'
, 'xgboost'
],
"log_file_name": 'airlines_experiment.log',
}

automl.fit(X_train = X_train, y_train = y_train, ensemble=True, **settings)

[flaml.automl: 03-18 17:34:40] {1157} INFO - [('xgboost', <flaml.model.XGBoostSklearnEstimator object at 0x7f61f8659ed0>), ('lgbm', <flaml.model.LGBMEstimator object at 0x7f61f8687350>), ('lrl2', <flaml.model.LRL2Classifier object at 0x7f61f8687090>), ('lrl1', <flaml.model.LRL1Classifier object at 0x7f61f8654150>)]

RuntimeError: Cannot clone object <flaml.model.LRL2Classifier object at 0x7f84877a1c10>, as the constructor either does not set or modifies parameter penalty.

This is similar to the error we've discussed before.

  1. The other error is more confusing. We've created a few customized ML learners with monotone constraints and used them for the automl. For example, below are the codes for a monotonic xgboost and a lightGBM both using GBDT as the booster:

class MyMonotonicXGBGBTreeClassifier(BaseEstimator):

def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):

    super().__init__(task, **params)

    self.estimator_class = XGBClassifier

    # convert to int for integer hyperparameters
    self.params = {
        'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
        'booster': params['booster'] if 'booster' in params else 'gbtree',
        'learning_rate': params['learning_rate'],
        'gamma': params['gamma'],
        'max_depth': int(params['max_depth']),
        'min_child_weight': int(params['min_child_weight']),
        'subsample': params['subsample'],
        'colsample_bylevel':params['colsample_bylevel'],
        'n_estimators':int(params['n_estimators']),
        'reg_lambda': params['reg_lambda'],
        'reg_alpha': params['reg_alpha'],
        'random_state': params['random_state'] if 'random_state' in params else randomseed,
        "monotone_constraints": params['monotone_constraints'] if 'monotone_constraints' in params else monotone,
        
   }   

@classmethod
def search_space(cls, data_size, task):
    
    space = {        
    'max_depth': {'domain': tune.uniform(lower=4, upper=15), 'init_value': 8},
    'n_estimators': {'domain': tune.uniform(lower = 50, upper = 800), 'init_value': 200},
    'min_child_weight': {'domain': tune.uniform(lower = 1, upper = 1000), 'init_value': 100},
    'subsample': {'domain': tune.uniform(lower = 0.7, upper = 1), 'init_value': 0.7},
    'colsample_bylevel': {'domain': tune.uniform(lower = 0.6, upper = 1), 'init_value': 0.8},
    'learning_rate': {'domain': tune.loguniform(lower = 0.001, upper = 1), 'init_value': 0.1},
    'gamma': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001},
    'reg_lambda': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 1},
     'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001},
    }
    return space

class MyMonotonicLightGBMGBDTClassifier(BaseEstimator):

def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):

    super().__init__(task, **params)

    self.estimator_class = LGBMClassifier

    # convert to int for integer hyperparameters
    self.params = {
        'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
        'boosting_type':params['boosting_type'] if 'boosting_type' in params else 'gbdt',
        'learning_rate': params['learning_rate'],
        'min_split_gain': params['min_split_gain'],
        'max_depth': int(params['max_depth']),
        'min_data_in_leaf': int(params['min_data_in_leaf']),
        'min_sum_hessian_in_leaf': params['min_sum_hessian_in_leaf'],
        'subsample': params['subsample'],
        'colsample_bytree':params['colsample_bytree'],
        'n_estimators':int(params['n_estimators']),
        'subsample_freq':int(params['subsample_freq']),
        'reg_lambda': params['reg_lambda'],
        'reg_alpha': params['reg_alpha'],
        'random_state': params['random_state'] if 'random_state' in params else randomseed,
        "monotone_constraints":params['monotone_constraints'] if 'monotone_constraints' in params else monotone,
        
   }   

@classmethod
def search_space(cls, data_size, task):
    
    space = {        
    'max_depth': {'domain': tune.uniform(lower=4, upper=15), 'init_value': 8},
    'subsample_freq': {'domain': tune.uniform(lower=1, upper=10), 'init_value': 5},
    'n_estimators': {'domain': tune.uniform(lower = 50, upper = 800), 'init_value': 200},
    'min_data_in_leaf': {'domain': tune.uniform(lower = 1, upper = 1000), 'init_value': 100},
    'min_sum_hessian_in_leaf': {'domain': tune.loguniform(lower = 0.000001, upper = 0.1), 'init_value': 0.001},
    'subsample': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.67},
    'colsample_bytree': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.9},
    'learning_rate': {'domain': tune.loguniform(lower = 0.001, upper = 1), 'init_value': 0.1},
    'min_split_gain': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001},
    'reg_lambda': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 1},
     'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001},
    }
    return space

Without the ensemble, both worked well as individual learners. However, when we set ensemble=True, the monotonic xgboost learner still worked well but the process always crashed if the monotonic lightGBM learner was included in the list of estimators. The kernel of Jupyter just went dead without any error message. In the .out file generated at the backend, there is an error message:

[LightGBM] [Fatal] Check failed: static_cast<size_t>(num_total_features_) == io_config.monotone_constraints.size() at /__w/1/s/python-package/compile/src/io/dataset.cpp, line 314

What does it mean? It seems that something is wrong with the monotone_constraints but the size of the constraints matches the number of variables.

This error can be replicated using the airlines data; to make it easier just let monotone=(0, 0, 0, 0, 0, 0, 0).

Appreciate your help.

@flippercy flippercy changed the title A couple of errors during building the ensemble A couple of errors when building the ensemble Mar 19, 2021
@sonichi sonichi linked a pull request Mar 19, 2021 that will close this issue
@sonichi
Copy link
Collaborator

sonichi commented Mar 19, 2021

Problem 1 will be fixed in #45
Problem 2 is because when training the final estimator in the stacked ensemble learner, the predictions from base learners are added as new features. Then the number of monotonic constraints will be unequal to the # features in the final estimator. When the selected final estimator does not have monotonic constraints this error won't appear. To make the monotonic constraints work in the final estimator, the number of monotonic constraints need to be revised based on X_train, e.g., in fit().

@flippercy
Copy link
Author

flippercy commented Mar 19, 2021

Thank you Chi. I see what you meant for problem 2. After checking the codes, it turns out that the ensemble is built using the estimator of the best base model as the final estimator with passthrough=True, causing the problem.

A few questions and thoughts:

  1. Is it possible to let users specify the final_estimator and passthrough for the ensemble, please? In practice sometimes the only meta learner can be accepted by the business is GLM. Single boosting models are OK but a boosting model of boosting models is just too complicated for the legal team and regulators. Regarding the passthrough, there is no guarantee that one way will be better than the other so perhaps it is better to let the users decide.

  2. Why does only the customized lightGBM cause this error? We've also built monotonic xgboost, RF and catboost; none of them triggered the error when used as the final_estimator. Does only lightGBM do a check on the number of features vs. length of monotone_constraint?

  3. With the current setting, is there a way to solve this issue? Not sure how to specify a new monotone_constraint for the lightGBM only when it is used as the final_estimator.

Appreciate your help!

@sonichi
Copy link
Collaborator

sonichi commented Mar 19, 2021

  1. pass_through is for original features. If you set pass_through=False, the original features will not be passed and the predictions are used as features for the final estimator. That won't solve the constraint mismatching problem.
  2. Possibly.
  3. Override the fit() function of the customized lgbm estimator by changing the constraints before calling self._fit(). Like:
def fit(self, X_train, y_train, budget=None, **kwargs):
     # self.params['monotonic_constraints'] = ...
     self._fit(X_train, y_train, **kwargs)

@sonichi sonichi removed a link to a pull request Mar 19, 2021
@flippercy
Copy link
Author

I understand. Passthrough is unrelated with this issue; just a side thought.

@sonichi
Copy link
Collaborator

sonichi commented Mar 19, 2021

I understand. Passthrough is unrelated with this issue; just a side thought.

I created a separate issue #48 for this.

@sonichi
Copy link
Collaborator

sonichi commented Apr 2, 2021

@flippercy I'm closing this issue now. If your problem is not solved feel free to reopen it.

@sonichi sonichi closed this as completed Apr 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants