Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lgb.cv is constantly crashing in R #2357

Closed
deepdalytics opened this issue Aug 26, 2019 · 8 comments · Fixed by #2400
Closed

lgb.cv is constantly crashing in R #2357

deepdalytics opened this issue Aug 26, 2019 · 8 comments · Fixed by #2400

Comments

@deepdalytics
Copy link

Unfortunately, my R Session is always aborted after a few seconds when running lgb.cv. I call lgb.cv several times since it's part of a scoring function I use to do Bayesian Optimization and hence find the optima hyper parameters. However, the error does seem to come from lightgbm.

In RStudio, I get the following error:

[LightGBM] [Fatal] Check failed: cache_size >= 2 at /private/var/folders/6q/.../T/RtmpGrFClE/R.INSTALL10ba5306a71e/lightgbm/src/src/treelearner/feature_histogram.hpp, line 676.

If you have any idea what is this supposed to mean and how I could solve the issue, I would highly appreciate your help.

@guolinke
Copy link
Collaborator

It seems the memory is not enough

@jameslamb
Copy link
Collaborator

@guolinke could you elaborate? I looked at the code around https://github.com/microsoft/LightGBM/blob/master/src/treelearner/feature_histogram.hpp#L676 but can't figure out how it's possible to violate that check.

As a reminder (it's been almost 3 years since you rote this code in #86, according to the blame 😂 ), it has this comment

in class HistogramPool

  /*!
  * \brief Reset pool size
  * \param cache_size Max cache size
  * \param total_size Total size will be used
  */
  void Reset(int cache_size, int total_size) {
    cache_size_ = cache_size;
    // at least need 2 bucket to store smaller leaf and larger leaf
    CHECK(cache_size_ >= 2);

@guolinke
Copy link
Collaborator

refer to :

void SerialTreeLearner::Init(const Dataset* train_data, bool is_constant_hessian) {
train_data_ = train_data;
num_data_ = train_data_->num_data();
num_features_ = train_data_->num_features();
is_constant_hessian_ = is_constant_hessian;
int max_cache_size = 0;
// Get the max size of pool
if (config_->histogram_pool_size <= 0) {
max_cache_size = config_->num_leaves;
} else {
size_t total_histogram_size = 0;
for (int i = 0; i < train_data_->num_features(); ++i) {
total_histogram_size += sizeof(HistogramBinEntry) * train_data_->FeatureNumBin(i);
}
max_cache_size = static_cast<int>(config_->histogram_pool_size * 1024 * 1024 / total_histogram_size);
}
// at least need 2 leaves
max_cache_size = std::max(2, max_cache_size);
max_cache_size = std::min(max_cache_size, config_->num_leaves);
histogram_pool_.DynamicChangeSize(train_data_, config_, max_cache_size, config_->num_leaves);
// push split information for all leaves
best_split_per_leaf_.resize(config_->num_leaves);
splits_per_leaf_.resize(config_->num_leaves*train_data_->num_features());

I think it should not fail (unless the number of leaves = 1)

@guolinke
Copy link
Collaborator

but num_leaves should >1,

GetInt(params, "num_leaves", &num_leaves);
CHECK(num_leaves >1);

@chris-smith-zocdoc
Copy link
Contributor

I was able to reproduce this. The bug is triggered by the max_depth parameter at certain values, and they follow a predictable pattern. It occurs every 32 bits at the end of the octet, indicating a byte allignment issue?

Failed on 30 (0000001e)
Failed on 31 (0000001f)
Failed on 62 (0000003e)
Failed on 63 (0000003f)
Failed on 94 (0000005e)
Failed on 95 (0000005f)
Failed on 126 (0000007e)
Failed on 127 (0000007f)
Failed on 158 (0000009e)
Failed on 159 (0000009f)
Failed on 190 (000000be)
Failed on 191 (000000bf)
Failed on 222 (000000de)
Failed on 223 (000000df)
Failed on 254 (000000fe)
Failed on 255 (000000ff)
Failed on 286 (0000011e)
Failed on 287 (0000011f)
Failed on 318 (0000013e)
Failed on 319 (0000013f)
Failed on 350 (0000015e)
Failed on 351 (0000015f)
...
Failed on 927 (0000039f)
Failed on 958 (000003be)
Failed on 959 (000003bf)
Failed on 990 (000003de)
Failed on 991 (000003df)

Here the java program I used to reproduce

import com.microsoft.ml.lightgbm.*;


public class issue_2357 {

    private static void validate(int result, String component) throws RuntimeException {
        if (result == -1) {
            throw new RuntimeException(component + " call failed in LightGBM with error: " + lightgbmlib.LGBM_GetLastError());
        }
    }

    public static void main(String[] args) throws Exception {

        try {
            String osName = System.getProperty("os.name").toLowerCase();
            if (osName.startsWith("mac os x")) {

                String prefix = System.getProperty("user.home") + "/src/LightGBM/";
                System.load(prefix + "lib_lightgbm.dylib");
                System.load(prefix + "lib_lightgbm_swig.jnilib");
            } else {
                System.load("/src/LightGBM/lib_lightgbm.so");
                System.load("/src/LightGBM/lib_lightgbm_swig.so");

            }
        } catch (UnsatisfiedLinkError e) {
            System.err.println(e.getMessage());
            e.printStackTrace();
            return;
        }

        int numRow = 1000;
        int numCols = 79;

        System.out.println("allocating data");
        double[][] data = new double[numRow][numCols];
        for (int i = 0; i < data.length; i++) {
            for (int j = 0; j < data[i].length; j++) {
                if (Math.random() < .2) {
                    data[i][j] = Math.random();
                }
            }
        }

        System.out.println("generating dataset");
        SWIGTYPE_p_p_void dataset = generateDenseDataset(numRow, data);
        SWIGTYPE_p_void dataset_handle = lightgbmlib.voidpp_value(dataset);


        for (int i = 0; i < 1000; i++) {
            try {
                System.out.println("CREATE BOOSTER");
                SWIGTYPE_p_p_void boosterOutPtr = lightgbmlib.voidpp_handle();

                validate(lightgbmlib.LGBM_BoosterCreate(
                        dataset_handle,
                        "max_depth=" + i,
                        boosterOutPtr),
                        "Booster LGBM_BoosterCreate");

            } catch (Exception e) {
                System.err.println("Failed on " + i + " (" + String.format("%08x", i) + ")");
            }
        }


        System.out.println("Done");
    }

    private static SWIGTYPE_p_double generateData(int numRows, double[][] rowsAsDoubleArray) {
        int numCols = rowsAsDoubleArray[0].length;
        SWIGTYPE_p_double data = lightgbmlib.new_doubleArray(numCols * numRows);

        for (int i = 0; i < numRows; i++) {
            for (int j = 0; j < rowsAsDoubleArray[i].length; j++) {
                lightgbmlib.doubleArray_setitem(data, i * numCols + j, rowsAsDoubleArray[i][j]);
            }
        }

        return data;
    }

    private static SWIGTYPE_p_p_void generateDenseDataset(int numRows, double[][] rowsAsDoubleArray) throws RuntimeException {
        int numCols = 79;
        int isRowMajor = 1;

        SWIGTYPE_p_p_void datasetOutPtr = lightgbmlib.voidpp_handle();
        String datasetParams = "max_bin=255";
        int data64bitType = lightgbmlibConstants.C_API_DTYPE_FLOAT64;

        SWIGTYPE_p_double data = generateData(numRows, rowsAsDoubleArray);

        validate(lightgbmlib.LGBM_DatasetCreateFromMat(
                lightgbmlib.double_to_voidp_ptr(data),
                data64bitType,
                numRows,
                numCols,
                isRowMajor, datasetParams, null, datasetOutPtr),
                "Dataset create");

        lightgbmlib.delete_doubleArray(data);

        return datasetOutPtr;
    }
}

@chris-smith-zocdoc
Copy link
Contributor

There are two integer overflows in config.cpp

The static_cast is overflowing because its casting a double with value > __INT32_MAX__

int full_num_leaves = static_cast<int>(std::pow(2, max_depth));

And the bitshift is wrapping around the integer multiple times for max_depth > 29

num_leaves = std::min(num_leaves, 2 << max_depth);

For the first one, we can leave full_num_leaves as a double.

For the second one, we might need to check if an overflow would be caused

if (max_depth <= 29) {
   num_leaves = std::min(num_leaves, 2 << max_depth);
}

There is also an inconsistency in the num_leaves calculation. I'm not sure if this is intentional or not

// 2 ^ max_depth
std::pow(2, max_depth) 

// 2 ^ (max_depth + 1)
2 << max_depth

@guolinke
Copy link
Collaborator

Thanks @chris-smith-zocdoc
I think too many leaves/depth is not reasonable.
How about giving the upper bounds to them?
I think 2 << 16 is enough for a single tree.
ping @StrikerRUS

@chris-smith-zocdoc
Copy link
Contributor

@guolinke Where do you think we should add validation? Should I add it to this function?

const int max_leaves =  2 << 16;
if (num_leaves > max_leaves) {
  Log::Fatal("num_leaves (%d) is larger than maximum allowed leaves (%d)", num_leaves, max_leaves);
}

The regression was introduced in #2216 to fix #2215

I can submit a pr if you would like.

chris-smith-zocdoc added a commit to Zocdoc/LightGBM that referenced this issue Sep 11, 2019
guolinke pushed a commit that referenced this issue Sep 27, 2019
* Fix integer overflow #2357

* Use 2 spaces not 4

* Move constant the config.h
Move check outside the max_depth check

* Move the max leaves check to config.h

* Remove unnecessary check
@lock lock bot locked as resolved and limited conversation to collaborators Mar 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants