Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos shared memory on Cuda uses a lot of registers #31

Closed
mhoemmen opened this issue Jun 23, 2015 · 10 comments
Closed

Kokkos shared memory on Cuda uses a lot of registers #31

mhoemmen opened this issue Jun 23, 2015 · 10 comments

Comments

@mhoemmen
Copy link
Contributor

Our intern Ryan Eberhardt has been experimenting with shared memory on CUDA. He found out that using Kokkos to access shared memory uses a lot more registers than not. I suspect that this has to do with the error checking and printf error message that I added to Kokkos::ScratchMemorySpace (see kokkos/core/src/Kokkos_ScratchSpace.hpp) a while back.

The first example code uses Kokkos to access shared memory. CUDA says: ptxas info : Used 30 registers, 336 bytes cmem[0]

#include <Kokkos_Core.hpp>

using Kokkos::TeamPolicy;
using Kokkos::parallel_for;
typedef TeamPolicy<>::member_type member_type;

struct shared_mem_kernel {
    size_t team_shmem_size(int team_size) const {
        return team_size*sizeof(double);
    }
    KOKKOS_INLINE_FUNCTION
    void operator() (member_type team_member) const {
        int alloc_size = team_member.team_size()*sizeof(double);
        double *shared = (double*) team_member.team_shmem().get_shmem(alloc_size);
        shared[team_member.team_rank()] = 0;
    }
};

int main() {
    Kokkos::initialize();
    shared_mem_kernel kernel;
    int team_size = Kokkos::TeamPolicy<>::team_size_recommended(kernel);
    Kokkos::TeamPolicy<> policy(1, team_size);
    Kokkos::parallel_for(policy, kernel);
    Kokkos::fence();
    Kokkos::finalize();
}

The second example code does NOT use shared memory, but still uses Kokkos. CUDA says: ptxas info : Used 4 registers, 368 bytes cmem[0]

#include <Kokkos_Core.hpp>

using Kokkos::TeamPolicy;
using Kokkos::parallel_for;
typedef Kokkos::View<double*> vec_view_t;
typedef vec_view_t::HostMirror host_vec_view_t;
typedef TeamPolicy<>::member_type member_type;

struct global_mem_kernel {
    vec_view_t vec;
    global_mem_kernel(vec_view_t vec_): vec(vec_) {}
    KOKKOS_INLINE_FUNCTION
    void operator() (member_type team_member) const {
        int alloc_size = team_member.team_size()*sizeof(double);
        vec(team_member.team_rank()) = 0;
    }
};

int main() {
    Kokkos::initialize();
    vec_view_t vec("vec", 256);
    global_mem_kernel kernel(vec);
    int team_size = Kokkos::TeamPolicy<>::team_size_recommended(kernel);
    Kokkos::TeamPolicy<> policy(1, team_size);
    Kokkos::parallel_for(policy, kernel);
    Kokkos::fence();
    Kokkos::finalize();
}

The third example code uses raw CUDA -- no Kokkos -- to access shared memory. CUDA says: ptxas info : Used 2 registers, 32 bytes cmem[0]

__global__ void shared_mem_kernel() {
    int alloc_size = blockDim.x*sizeof(double);
    // Do something with alloc_size so that the compiler includes it
    for(; alloc_size>0; alloc_size--);

    extern __shared__ double shared[];
    shared[threadIdx.x] = 0;
}

int main() {
    shared_mem_kernel<<<1, 256, 256*sizeof(double)>>>();
    cudaDeviceSynchronize();
}

Ryan verified that this actually uses shared memory (the compiler doesn't optimize it away).

@mhoemmen
Copy link
Contributor Author

Ryan is working now to see whether commenting out the printf that I added to scratch space allocation makes it use fewer registers.

The one reason I added that printf was because some of the Kokkos examples were failing to allocate shared memory in the Kokkos::Serial case. That was making the examples fail without any obvious reason why. It would make sense to rewrite those examples so that they check whether scratch allocation returned a NULL pointer, and fail out safely in that case.

@crtrott
Copy link
Member

crtrott commented Jun 23, 2015

Shared memory should work in Serial. If it doesn't file a bug.

@mhoemmen
Copy link
Contributor Author

Ryan says: "With the printf string removed (printf("");), it drops to 23 registers. With the printf call removed completely, it drops to 5 registers."

@mhoemmen
Copy link
Contributor Author

Hi Christian -- my issue with the examples was more that they don't check whether scratch allocations succeeded (returned non-NULL). I put the printf there a while back to help diagnose that. I realize I should have looked carefully at performance before stuffing a printf into the loop!

@crtrott
Copy link
Member

crtrott commented Jun 23, 2015

?Wow wouldn't have expected that.


From: Mark Hoemmen notifications@github.com
Sent: Tuesday, June 23, 2015 10:55 AM
To: kokkos/kokkos
Cc: Trott, Christian Robert (-EXP)
Subject: [EXTERNAL] Re: [kokkos] Kokkos shared memory on Cuda uses a lot of registers (#31)

Ryan says: "With the printf string removed (printf("");), it drops to 23 registers. With the printf call removed completely, it drops to 5 registers."

Reply to this email directly or view it on GitHubhttps://github.com//issues/31#issuecomment-114571783.

@crtrott
Copy link
Member

crtrott commented Jun 23, 2015

How about generating a pull request which lets the printf statement in but protects it with the macro KOKKOS_HAVE_DEBUG

@mhoemmen
Copy link
Contributor Author

Sure, will do!

@mhoemmen
Copy link
Contributor Author

btw there is only one place in the whole Kokkos package that uses KOKKOS_HAVE_DEBUG: Kokkos_Parallel.hpp, lines 61-63. It doesn't even have an effect there, other than to include <iostream>.

@crtrott
Copy link
Member

crtrott commented Jun 23, 2015

Ah hm. I wanted to rename it anyway to KOKKOS_ENABLE_DEBUG
One reason for so little occurrences is that the other stuff is done via their own macros such as
KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK

KOKKOS_ENABLE_DEBUG should probably be used more. For example when checking whether lengths given to View allocators are negative etc.

@crtrott
Copy link
Member

crtrott commented Jun 23, 2015

Merged into develop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants