Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA kernel failed : an illegal memory access was encountered #21

Open
HelinXu opened this issue Feb 27, 2022 · 7 comments
Open

CUDA kernel failed : an illegal memory access was encountered #21

HelinXu opened this issue Feb 27, 2022 · 7 comments

Comments

@HelinXu
Copy link

HelinXu commented Feb 27, 2022

Hi! Thanks for the great work.
I'm running HAIS on my own dataset, and have modified class_numpoint_mean_dict and class_radius_mean accordingly. The problem is, I get CUDA kernel failed: an illegal memory access was encountered when testing.
I traced the problem by inserting cudaGetLastError() in hiererchical_aggregation.cu, and found the code breaks around the following lines:

    assert(primary_num <= MAX_PRIMARY_NUM);
    concat_fragments_<<<primary_num, (int)1>>>(
        cuda_fragment_idxs, cuda_fragment_offsets,
        cuda_primary_idxs, cuda_primary_offsets,
        cuda_primary_absorb_fragment_idx, cuda_primary_absorb_fragment_cnt,
        cuda_concat_idxs, cuda_concat_point_num,
        primary_num);
    cudaDeviceSynchronize();

// got CUDA kernel failed : an illegal memory access was encountered here.

Do you have any suggestions?

@outsidercsy
Copy link
Member

It seems a problem of memory overflow.
If point clouds of your datasets are much denser or larger than ScanNet, I suggest to increase the values of these macros.

#define MAX_PRIMARY_NUM 1024
#define MAX_PER_PRIMARY_ABSORB_FRAGMENT_NUM 1024
#define INFINITY_DIS_SQUARE 10000
#define MAX_PER_PRIMARY_ABSORB_POINT_NUM 8192

@HelinXu
Copy link
Author

HelinXu commented Mar 20, 2022

Hi, sorry to bother you again.

Each point cloud in my dataset always consists of 20,000 points. I've tried your suggestion out, and have increased/decreased those macros. I found that with different macros, the number of samples passes before failing is actually different! For example, with original settings, I can get 8 samples to pass the test script while increasing the numbers by 16 times, I can only get 3 samples to pass. It breaks after either one of the case (both after sync):

    // // for each fragment, find its primary
    int *cuda_primary_absorb_fragment_idx; // array for saving the fragment idxs
    int *cuda_primary_absorb_fragment_cnt; // array for saving the fragment nums
    cudaMalloc((void**)&cuda_primary_absorb_fragment_idx, primary_num * MAX_PER_PRIMARY_ABSORB_FRAGMENT_NUM * sizeof(int) + sizeof(int));
    cudaMalloc((void**)&cuda_primary_absorb_fragment_cnt, primary_num * sizeof(int) + sizeof(int));
    printf("fragment_num = %d\n", fragment_num); // I've printed out fragment number, which is mostly 10~40, never 0 on my dataset.
    if (fragment_num != 0) {
        fragment_find_primary_<<<int(DIVUP(fragment_num, MAX_THREADS_PER_BLOCK)), (int)MAX_THREADS_PER_BLOCK>>>(
            primary_num, cuda_primary_offsets, cuda_primary_centers,
            fragment_num, cuda_fragment_offsets, cuda_fragment_centers,
            cuda_primary_absorb_fragment_idx, cuda_primary_absorb_fragment_cnt);
        }
    err  = cudaGetLastError();
    if (cudaSuccess != err) {
        fprintf(stderr, "before sync CUDA kernel failed : %s\n", cudaGetErrorString(err)); // This never fails in my case.
        exit(-1);
    }
    cudaDeviceSynchronize();
    err  = cudaGetLastError();
    if (cudaSuccess != err) {
        fprintf(stderr, " after sync CUDA kernel failed : %s\n", cudaGetErrorString(err)); // However, this fails!
        exit(-1);
    }
    // concatenate fragments belonging to the same primary
    int *cuda_concat_idxs;
    int *cuda_concat_point_num;
    cudaMalloc((void**)&cuda_concat_idxs, primary_num * MAX_PER_PRIMARY_ABSORB_POINT_NUM * 2 * sizeof(int) + sizeof(int));
    cudaMalloc((void**)&cuda_concat_point_num, primary_num *  sizeof(int) + sizeof(int));
    assert(primary_num <= MAX_PRIMARY_NUM);
    concat_fragments_<<<primary_num, (int)1>>>(
        cuda_fragment_idxs, cuda_fragment_offsets,
        cuda_primary_idxs, cuda_primary_offsets,
        cuda_primary_absorb_fragment_idx, cuda_primary_absorb_fragment_cnt,
        cuda_concat_idxs, cuda_concat_point_num,
        primary_num);
    err  = cudaGetLastError();
    if (cudaSuccess != err) {
        fprintf(stderr, "before sync 2 CUDA kernel failed : %s\n", cudaGetErrorString(err));
        exit(-1);
    }
    cudaDeviceSynchronize();
    err  = cudaGetLastError();
    if (cudaSuccess != err) {
        fprintf(stderr, "after sync 2 CUDA kernel failed : %s\n", cudaGetErrorString(err));
        exit(-1);
    }

@Dragon-Vae
Copy link

@HelinXu Hi, I have the same problem as you, How did you solve it?

@HelinXu
Copy link
Author

HelinXu commented May 19, 2022

@Dragon-Vae Hi, I am on a different project and have not fully understood this behavior yet. I'll let you know if I figure it out :)

@Dragon-Vae
Copy link

@HelinXu OK, thank you for your reply

@eamonn-zh
Copy link
Contributor

@outsidercsy Hi, I also encountered an illegal memory access issue, will #32 be the trick?

@hmax233
Copy link

hmax233 commented Oct 10, 2022

@eamonn-zh I encountered the same question,and I found that when I change the batch size to 4 instead of 1,there is no problem. So I think maybe it's not because of memory overflow?
And do you figure out this question now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants