remove unnecessary omp single that cause deadlock (fixes #6273) #6394

morokosi · 2024-03-30T14:45:07Z

omp single clause has implicit barrier, which waits for all threads, including threads that did not execute omp single block.
Therefore, if omp single clause is used conditionally inside omp parallel, it will cause a deadlock (explanation below).

I think it is safe to remove omp single clause from omp_get_max_threads() (introuduced in #6226).

omp_get_max_threads() is thread-safe https://www.openmp.org/spec-html/5.1/openmpse6.html#x27-260001.6
omp single does not affect the current team (threads participating in the execution of a parallel region) from which omp_get_max_threads() gets its value.

background

I digged into #6273.

import lightgbm as lgb
import numpy as np
import pandas as pd

X = np.random.randint(0, 5000, 10000)
X = pd.DataFrame({"x1": X})
y = np.random.rand(10000)

full_data = lgb.Dataset(X, y, categorical_feature=["x1"])
full_data.construct() # <-- hangs

I suspect a deadlock near OMP_NUM_THREADS() from the backtrace.

  thread #15
    frame #0: 0x000000018820106c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018823e5fc libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000016bd97ca0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 264
    frame #3: 0x000000016bd818c0 libomp.dylib`kmp_flag_64<false, true>::wait(kmp_info*, int, void*) + 2012
    frame #4: 0x000000016bd7dcc8 libomp.dylib`__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) + 152
    frame #5: 0x000000016bd7c8c8 libomp.dylib`__kmp_barrier + 1276
    frame #6: 0x000000016bd52dd0 libomp.dylib`__kmpc_barrier + 340
    frame #7: 0x000000016c3a4db8 lib_lightgbm.so`OMP_NUM_THREADS + 112
    frame #8: 0x000000016c11a8f0 lib_lightgbm.so`LightGBM::ArrayArgs<int>::ArgMaxMT(std::__1::vector<int, std::__1::allocator<int>> const&) + 44
    frame #9: 0x000000016c151cdc lib_lightgbm.so`LightGBM::BinMapper::FindBin(double*, int, unsigned long, int, int, int, bool, LightGBM::BinType, bool, bool, std::__1::vector<double, std::__1::allocator<double>> const&) + 8412
    frame #10: 0x000000016c1cd4cc lib_lightgbm.so`.omp_outlined. + 792
    frame #11: 0x000000016bdaee5c libomp.dylib`__kmp_invoke_microtask + 156
    frame #12: 0x000000016bd6235c libomp.dylib`__kmp_invoke_task_func + 312
    frame #13: 0x000000016bd61500 libomp.dylib`__kmp_launch_thread + 400
    frame #14: 0x000000016bd9669c libomp.dylib`__kmp_launch_worker(void*) + 280
    frame #15: 0x000000018823e034 libsystem_pthread.dylib`_pthread_start + 136

In this case, LightGBM::DatasetLoader::ConstructFromSampleData has omp parallel and OMP_NUM_THREADS() (FindBin => ArgMax => ArgMaxMT => OMP_NUM_THREADS()) has omp single.

deadlock explanation code

#include <omp.h>
#include <stdio.h>

void fn() {
    #pragma omp single
    {
        puts("single");
        printf("omp_get_max_threads: %d\n", omp_get_max_threads());
        printf("omp_get_num_threads: %d\n", omp_get_num_threads());
    }
    // <-- implicit barrier (waits for all threads, including threads that did not execute omp single block)
    // deadlock here because other threads will not reach the barrier
}

int main() {
    printf("omp_get_max_threads: %d\n", omp_get_max_threads());
    printf("omp_get_num_threads: %d\n", omp_get_num_threads());

    #pragma omp parallel for schedule(static)
    for (int i = 0; i < 10; i++) {
        if (i == 1) {
            fn();
        }
    }
    puts("finish");
    return 0;
}

jameslamb

Thank you very much for the EXCELLENT investigation and write-up! This fix makes sense to me.

To ensure we don't introduce a deadlock like this on this codepath again, could you please add the example Python code as a test here?

Right after this test:

LightGBM/tests/python_package_test/test_basic.py

Line 507 in 28536a0

def test_dataset_construction_overwrites_user_provided_metadata_fields():

Something like this:

def test_dataset_construction_with_high_cardinality_categorical_succeeds():
    pd = pytest.importorskip("pandas")
    X = pd.DataFrame({"x1": np.random.randint(0, 5_000, 10_000)})
    y = np.random.rand(10_000)
    dtrain = lgb.Dataset(X, y, categorical_feature=["x1"])
    dtrain.construct()
    assert ds.num_data() == 10_000
    assert ds.num_feature() == 1

morokosi · 2024-03-31T05:21:15Z

@microsoft-github-policy-service agree

jameslamb · 2024-04-11T14:38:31Z

Thanks for the update! We'll review them soon.

In the future when you contribute to LightGBM (and we hope you will!), don't force-push here. It's not necessary, as we squash all commits into 1 when merging to master: https://github.com/microsoft/LightGBM/commits/master/.

borchero

Nice, excellent fix!

jameslamb

Thanks very much! I'd still like the opportunity to test this on a large multi-core machine to be sure it's working as expected (similar to the "How I tested this" steps in #6226). Will try to do that soon.

jameslamb

I tested this tonight using an approach similar to #6226, with both the R and Python packages. Confirmed that the processing time scales as expected with higher values of environment variable OMP_NUM_THREADS. Also confirmed that the unit test you've added here gets deadlocked on master but runs quickly and successfully with the changes here 🎉

Thank you SO MUCH for taking the time to investigate and fix this, and for the great explanation in the PR description.

morokosi requested review from guolinke, jameslamb, shiyu1994, jmoralez and borchero as code owners March 30, 2024 14:45

jameslamb added the fix label Mar 31, 2024

jameslamb mentioned this pull request Mar 31, 2024

[ci] Azure Mariner CI jobs regularly failing: "File not found: 'docker'" #6316

Closed

jameslamb requested changes Mar 31, 2024

View reviewed changes

morokosi force-pushed the remove-omp-single branch 2 times, most recently from 75ba6eb to 112f757 Compare March 31, 2024 05:23

morokosi requested a review from jameslamb March 31, 2024 07:46

This was referenced Apr 1, 2024

[python-package] LGBM hangs with high number of categories #6400

Closed

[python-package] Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas. #6273

Closed

morokosi added 2 commits April 11, 2024 23:12

remove unnecessary omp single

b714ddb

add regression test for microsoft#6273

0749516

morokosi force-pushed the remove-omp-single branch from 112f757 to 0749516 Compare April 11, 2024 14:12

jameslamb added the awaiting review label Apr 11, 2024

Merge branch 'master' into remove-omp-single

ce6a077

borchero approved these changes Apr 22, 2024

View reviewed changes

jameslamb requested changes Apr 22, 2024

View reviewed changes

jameslamb self-requested a review April 23, 2024 02:42

jameslamb approved these changes Apr 23, 2024

View reviewed changes

jameslamb removed the awaiting review label Apr 23, 2024

jameslamb merged commit 1871350 into microsoft:master Apr 23, 2024
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove unnecessary omp single that cause deadlock (fixes #6273) #6394

remove unnecessary omp single that cause deadlock (fixes #6273) #6394

morokosi commented Mar 30, 2024 •

edited

Loading

jameslamb left a comment

morokosi commented Mar 31, 2024

jameslamb commented Apr 11, 2024

borchero left a comment

jameslamb left a comment

jameslamb left a comment

remove unnecessary omp single that cause deadlock (fixes #6273) #6394

remove unnecessary omp single that cause deadlock (fixes #6273) #6394

Conversation

morokosi commented Mar 30, 2024 • edited Loading

background

deadlock explanation code

jameslamb left a comment

Choose a reason for hiding this comment

morokosi commented Mar 31, 2024

jameslamb commented Apr 11, 2024

borchero left a comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb left a comment

Choose a reason for hiding this comment

morokosi commented Mar 30, 2024 •

edited

Loading