Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix: small values of max_bin cause program to crash #2299

Merged
merged 6 commits into from Aug 16, 2019

Conversation

btrotta
Copy link
Collaborator

@btrotta btrotta commented Jul 31, 2019

Fixes a minor bug where setting max_bin < 4 can cause lightgbm to crash. Below is an example to reproduce the bug. I think the problem is that in bin.cpp, the method FindMaxBinWithZeroAsOneBin sometimes attempts to create more bins than allowed by max_bin.

import numpy as np
import lightgbm as lgb

np.random.seed(0)
y = np.random.choice([0, 1], 100)
X = np.zeros((100, 1))
X[:30, 0] = -1
X[30:60, 0] = 1
X[60:, 0] = 2
X[0] = np.nan
params = {'objective': 'binary', 'seed': 0, 'min_data_in_leaf': 1, 'max_bin': 2}  # causes crash in training
#params = {'objective': 'binary', 'seed': 0, 'min_data_in_leaf': 1, 'max_bin': 3}  # works normally
lgb_x = lgb.Dataset(X, label=y)
est = lgb.train(params, lgb_x)

# another example, max_bin is 3 and X contains nan
X[0, 0] = np.nan
params = {'objective': 'binary', 'seed': 0, 'min_data_in_leaf': 1, 'max_bin': 3}  # causes crash
#params = {'objective': 'binary', 'seed': 0, 'min_data_in_leaf': 1, 'max_bin': 4}  # works
lgb_x = lgb.Dataset(X, label=y)
est = lgb.train(params, lgb_x)

@StrikerRUS StrikerRUS requested a review from guolinke July 31, 2019 13:40
@btrotta
Copy link
Collaborator Author

btrotta commented Aug 1, 2019

I have changed an existing test to account for the new binning behavior. The method FindMaxBinWithZeroAsOneBin now will always create a zero bin if max_bin >=3. (I assume this is the intended behavior, based on the method name; let me know if not.) This means it needs to have bin upper bounds -kZeroThreshold, kZeroThreshold, inf even if there are no negative values. This reduces the number of distinct predicted values in the test I modified.

Here is a description of the current behavior of FindMaxBinWithZeroAsOneBin whenmax_bin <= 3:

  • If max_bin = 1, returns [inf]
  • If max_bin = 2 and there are no negative data values, returns [kZeroThreshold, inf]
  • if max_bin = 2 and there is at least one negative data value, returns [-kZeroThreshold, inf]
  • if max_bin = 3, returns [-kZeroThreshold, kZeroThreshold, inf]

@guolinke guolinke merged commit c421f89 into microsoft:master Aug 16, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Mar 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants