A couple of errors when building the ensemble #47

flippercy · 2021-03-19T03:30:22Z

Hi:

Our team has explored the ensemble option in the fit function of automl and got a few errors:

There is an error when using both the GLM (LRL1/LRL2) and MLs for the ensemble. For example:

from flaml.data import load_openml_dataset
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id = 1169, data_dir = './')

settings = {
"time_budget": 40,
"metric": 'roc_auc',
"task": 'classification',
"estimator_list": [
'lrl1'
,'lrl2'
,'lgbm'
, 'xgboost'
],
"log_file_name": 'airlines_experiment.log',
}

automl.fit(X_train = X_train, y_train = y_train, ensemble=True, **settings)

[flaml.automl: 03-18 17:34:40] {1157} INFO - [('xgboost', <flaml.model.XGBoostSklearnEstimator object at 0x7f61f8659ed0>), ('lgbm', <flaml.model.LGBMEstimator object at 0x7f61f8687350>), ('lrl2', <flaml.model.LRL2Classifier object at 0x7f61f8687090>), ('lrl1', <flaml.model.LRL1Classifier object at 0x7f61f8654150>)]

RuntimeError: Cannot clone object <flaml.model.LRL2Classifier object at 0x7f84877a1c10>, as the constructor either does not set or modifies parameter penalty.

This is similar to the error we've discussed before.

The other error is more confusing. We've created a few customized ML learners with monotone constraints and used them for the automl. For example, below are the codes for a monotonic xgboost and a lightGBM both using GBDT as the booster:

class MyMonotonicXGBGBTreeClassifier(BaseEstimator):

def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):

    super().__init__(task, **params)

    self.estimator_class = XGBClassifier

    # convert to int for integer hyperparameters
    self.params = {
        'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
        'booster': params['booster'] if 'booster' in params else 'gbtree',
        'learning_rate': params['learning_rate'],
        'gamma': params['gamma'],
        'max_depth': int(params['max_depth']),
        'min_child_weight': int(params['min_child_weight']),
        'subsample': params['subsample'],
        'colsample_bylevel':params['colsample_bylevel'],
        'n_estimators':int(params['n_estimators']),
        'reg_lambda': params['reg_lambda'],
        'reg_alpha': params['reg_alpha'],
        'random_state': params['random_state'] if 'random_state' in params else randomseed,
        "monotone_constraints": params['monotone_constraints'] if 'monotone_constraints' in params else monotone,
        
   }   

@classmethod
def search_space(cls, data_size, task):
    
    space = {        
    'max_depth': {'domain': tune.uniform(lower=4, upper=15), 'init_value': 8},
    'n_estimators': {'domain': tune.uniform(lower = 50, upper = 800), 'init_value': 200},
    'min_child_weight': {'domain': tune.uniform(lower = 1, upper = 1000), 'init_value': 100},
    'subsample': {'domain': tune.uniform(lower = 0.7, upper = 1), 'init_value': 0.7},
    'colsample_bylevel': {'domain': tune.uniform(lower = 0.6, upper = 1), 'init_value': 0.8},
    'learning_rate': {'domain': tune.loguniform(lower = 0.001, upper = 1), 'init_value': 0.1},
    'gamma': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001},
    'reg_lambda': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 1},
     'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001},
    }
    return space

class MyMonotonicLightGBMGBDTClassifier(BaseEstimator):

def __init__(self, task = 'binary:logistic', n_jobs = num_cores, **params):

    super().__init__(task, **params)

    self.estimator_class = LGBMClassifier

    # convert to int for integer hyperparameters
    self.params = {
        'n_jobs': params['n_jobs'] if 'n_jobs' in params else num_cores,
        'boosting_type':params['boosting_type'] if 'boosting_type' in params else 'gbdt',
        'learning_rate': params['learning_rate'],
        'min_split_gain': params['min_split_gain'],
        'max_depth': int(params['max_depth']),
        'min_data_in_leaf': int(params['min_data_in_leaf']),
        'min_sum_hessian_in_leaf': params['min_sum_hessian_in_leaf'],
        'subsample': params['subsample'],
        'colsample_bytree':params['colsample_bytree'],
        'n_estimators':int(params['n_estimators']),
        'subsample_freq':int(params['subsample_freq']),
        'reg_lambda': params['reg_lambda'],
        'reg_alpha': params['reg_alpha'],
        'random_state': params['random_state'] if 'random_state' in params else randomseed,
        "monotone_constraints":params['monotone_constraints'] if 'monotone_constraints' in params else monotone,
        
   }   

@classmethod
def search_space(cls, data_size, task):
    
    space = {        
    'max_depth': {'domain': tune.uniform(lower=4, upper=15), 'init_value': 8},
    'subsample_freq': {'domain': tune.uniform(lower=1, upper=10), 'init_value': 5},
    'n_estimators': {'domain': tune.uniform(lower = 50, upper = 800), 'init_value': 200},
    'min_data_in_leaf': {'domain': tune.uniform(lower = 1, upper = 1000), 'init_value': 100},
    'min_sum_hessian_in_leaf': {'domain': tune.loguniform(lower = 0.000001, upper = 0.1), 'init_value': 0.001},
    'subsample': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.67},
    'colsample_bytree': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.9},
    'learning_rate': {'domain': tune.loguniform(lower = 0.001, upper = 1), 'init_value': 0.1},
    'min_split_gain': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001},
    'reg_lambda': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 1},
     'reg_alpha': {'domain': tune.loguniform(lower = 0.000000000001, upper = 1), 'init_value': 0.000000000001},
    }
    return space

Without the ensemble, both worked well as individual learners. However, when we set ensemble=True, the monotonic xgboost learner still worked well but the process always crashed if the monotonic lightGBM learner was included in the list of estimators. The kernel of Jupyter just went dead without any error message. In the .out file generated at the backend, there is an error message:

[LightGBM] [Fatal] Check failed: static_cast<size_t>(num_total_features_) == io_config.monotone_constraints.size() at /__w/1/s/python-package/compile/src/io/dataset.cpp, line 314

What does it mean? It seems that something is wrong with the monotone_constraints but the size of the constraints matches the number of variables.

This error can be replicated using the airlines data; to make it easier just let monotone=(0, 0, 0, 0, 0, 0, 0).

Appreciate your help.

The text was updated successfully, but these errors were encountered:

sonichi · 2021-03-19T06:10:56Z

Problem 1 will be fixed in #45
Problem 2 is because when training the final estimator in the stacked ensemble learner, the predictions from base learners are added as new features. Then the number of monotonic constraints will be unequal to the # features in the final estimator. When the selected final estimator does not have monotonic constraints this error won't appear. To make the monotonic constraints work in the final estimator, the number of monotonic constraints need to be revised based on X_train, e.g., in fit().

flippercy · 2021-03-19T15:45:05Z

Thank you Chi. I see what you meant for problem 2. After checking the codes, it turns out that the ensemble is built using the estimator of the best base model as the final estimator with passthrough=True, causing the problem.

A few questions and thoughts:

Is it possible to let users specify the final_estimator and passthrough for the ensemble, please? In practice sometimes the only meta learner can be accepted by the business is GLM. Single boosting models are OK but a boosting model of boosting models is just too complicated for the legal team and regulators. Regarding the passthrough, there is no guarantee that one way will be better than the other so perhaps it is better to let the users decide.
Why does only the customized lightGBM cause this error? We've also built monotonic xgboost, RF and catboost; none of them triggered the error when used as the final_estimator. Does only lightGBM do a check on the number of features vs. length of monotone_constraint?
With the current setting, is there a way to solve this issue? Not sure how to specify a new monotone_constraint for the lightGBM only when it is used as the final_estimator.

Appreciate your help!

sonichi · 2021-03-19T16:39:17Z

pass_through is for original features. If you set pass_through=False, the original features will not be passed and the predictions are used as features for the final estimator. That won't solve the constraint mismatching problem.
Possibly.
Override the fit() function of the customized lgbm estimator by changing the constraints before calling self._fit(). Like:

def fit(self, X_train, y_train, budget=None, **kwargs):
     # self.params['monotonic_constraints'] = ...
     self._fit(X_train, y_train, **kwargs)

flippercy · 2021-03-19T16:42:25Z

I understand. Passthrough is unrelated with this issue; just a side thought.

sonichi · 2021-03-19T16:48:08Z

I understand. Passthrough is unrelated with this issue; just a side thought.

I created a separate issue #48 for this.

sonichi · 2021-04-02T15:52:03Z

@flippercy I'm closing this issue now. If your problem is not solved feel free to reopen it.

flippercy changed the title ~~A couple of errors during building the ensemble~~ A couple of errors when building the ensemble Mar 19, 2021

sonichi linked a pull request Mar 19, 2021 that will close this issue

data validation #45

Merged

sonichi removed a link to a pull request Mar 19, 2021

data validation #45

Merged

This was referenced Mar 19, 2021

data validation #45

Merged

let users specify the final_estimator and passthrough for the ensemble #48

Closed

sonichi closed this as completed Apr 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A couple of errors when building the ensemble #47

A couple of errors when building the ensemble #47

flippercy commented Mar 19, 2021 •

edited

sonichi commented Mar 19, 2021 •

edited

flippercy commented Mar 19, 2021 •

edited

sonichi commented Mar 19, 2021

flippercy commented Mar 19, 2021

sonichi commented Mar 19, 2021

sonichi commented Apr 2, 2021

A couple of errors when building the ensemble #47

A couple of errors when building the ensemble #47

Comments

flippercy commented Mar 19, 2021 • edited

sonichi commented Mar 19, 2021 • edited

flippercy commented Mar 19, 2021 • edited

sonichi commented Mar 19, 2021

flippercy commented Mar 19, 2021

sonichi commented Mar 19, 2021

sonichi commented Apr 2, 2021

flippercy commented Mar 19, 2021 •

edited

sonichi commented Mar 19, 2021 •

edited

flippercy commented Mar 19, 2021 •

edited