Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove unnecessary omp single that cause deadlock (fixes #6273) #6394

Merged
merged 3 commits into from
Apr 23, 2024

Conversation

morokosi
Copy link
Contributor

@morokosi morokosi commented Mar 30, 2024

fixes #6273

omp single clause has implicit barrier, which waits for all threads, including threads that did not execute omp single block.
Therefore, if omp single clause is used conditionally inside omp parallel, it will cause a deadlock (explanation below).

I think it is safe to remove omp single clause from omp_get_max_threads() (introuduced in #6226).

background

I digged into #6273.

import lightgbm as lgb
import numpy as np
import pandas as pd

X = np.random.randint(0, 5000, 10000)
X = pd.DataFrame({"x1": X})
y = np.random.rand(10000)

full_data = lgb.Dataset(X, y, categorical_feature=["x1"])
full_data.construct() # <-- hangs

I suspect a deadlock near OMP_NUM_THREADS() from the backtrace.

  thread #15
    frame #0: 0x000000018820106c libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x000000018823e5fc libsystem_pthread.dylib`_pthread_cond_wait + 1228
    frame #2: 0x000000016bd97ca0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 264
    frame #3: 0x000000016bd818c0 libomp.dylib`kmp_flag_64<false, true>::wait(kmp_info*, int, void*) + 2012
    frame #4: 0x000000016bd7dcc8 libomp.dylib`__kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) + 152
    frame #5: 0x000000016bd7c8c8 libomp.dylib`__kmp_barrier + 1276
    frame #6: 0x000000016bd52dd0 libomp.dylib`__kmpc_barrier + 340
    frame #7: 0x000000016c3a4db8 lib_lightgbm.so`OMP_NUM_THREADS + 112
    frame #8: 0x000000016c11a8f0 lib_lightgbm.so`LightGBM::ArrayArgs<int>::ArgMaxMT(std::__1::vector<int, std::__1::allocator<int>> const&) + 44
    frame #9: 0x000000016c151cdc lib_lightgbm.so`LightGBM::BinMapper::FindBin(double*, int, unsigned long, int, int, int, bool, LightGBM::BinType, bool, bool, std::__1::vector<double, std::__1::allocator<double>> const&) + 8412
    frame #10: 0x000000016c1cd4cc lib_lightgbm.so`.omp_outlined. + 792
    frame #11: 0x000000016bdaee5c libomp.dylib`__kmp_invoke_microtask + 156
    frame #12: 0x000000016bd6235c libomp.dylib`__kmp_invoke_task_func + 312
    frame #13: 0x000000016bd61500 libomp.dylib`__kmp_launch_thread + 400
    frame #14: 0x000000016bd9669c libomp.dylib`__kmp_launch_worker(void*) + 280
    frame #15: 0x000000018823e034 libsystem_pthread.dylib`_pthread_start + 136

In this case, LightGBM::DatasetLoader::ConstructFromSampleData has omp parallel and OMP_NUM_THREADS() (FindBin => ArgMax => ArgMaxMT => OMP_NUM_THREADS()) has omp single.

deadlock explanation code

#include <omp.h>
#include <stdio.h>

void fn() {
    #pragma omp single
    {
        puts("single");
        printf("omp_get_max_threads: %d\n", omp_get_max_threads());
        printf("omp_get_num_threads: %d\n", omp_get_num_threads());
    }
    // <-- implicit barrier (waits for all threads, including threads that did not execute omp single block)
    // deadlock here because other threads will not reach the barrier
}

int main() {
    printf("omp_get_max_threads: %d\n", omp_get_max_threads());
    printf("omp_get_num_threads: %d\n", omp_get_num_threads());

    #pragma omp parallel for schedule(static)
    for (int i = 0; i < 10; i++) {
        if (i == 1) {
            fn();
        }
    }
    puts("finish");
    return 0;
}

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the EXCELLENT investigation and write-up! This fix makes sense to me.

To ensure we don't introduce a deadlock like this on this codepath again, could you please add the example Python code as a test here?

Right after this test:

def test_dataset_construction_overwrites_user_provided_metadata_fields():

Something like this:

def test_dataset_construction_with_high_cardinality_categorical_succeeds():
    pd = pytest.importorskip("pandas")
    X = pd.DataFrame({"x1": np.random.randint(0, 5_000, 10_000)})
    y = np.random.rand(10_000)
    dtrain = lgb.Dataset(X, y, categorical_feature=["x1"])
    dtrain.construct()
    assert ds.num_data() == 10_000
    assert ds.num_feature() == 1

@morokosi
Copy link
Contributor Author

@microsoft-github-policy-service agree

@jameslamb
Copy link
Collaborator

Thanks for the update! We'll review them soon.

In the future when you contribute to LightGBM (and we hope you will!), don't force-push here. It's not necessary, as we squash all commits into 1 when merging to master: https://github.com/microsoft/LightGBM/commits/master/.

Copy link
Collaborator

@borchero borchero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, excellent fix!

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much! I'd still like the opportunity to test this on a large multi-core machine to be sure it's working as expected (similar to the "How I tested this" steps in #6226). Will try to do that soon.

@jameslamb jameslamb self-requested a review April 23, 2024 02:42
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this tonight using an approach similar to #6226, with both the R and Python packages. Confirmed that the processing time scales as expected with higher values of environment variable OMP_NUM_THREADS. Also confirmed that the unit test you've added here gets deadlocked on master but runs quickly and successfully with the changes here 🎉

Thank you SO MUCH for taking the time to investigate and fix this, and for the great explanation in the PR description.

@jameslamb jameslamb merged commit 1871350 into microsoft:master Apr 23, 2024
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[python-package] Dataset construction hangs with high cardinality categorical features under 4.2.0/Pandas.
3 participants