Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MD iteration policy ignores lower bound on GPUs #1041

Closed
azrael417 opened this issue Aug 15, 2017 · 2 comments
Closed

MD iteration policy ignores lower bound on GPUs #1041

azrael417 opened this issue Aug 15, 2017 · 2 comments
Assignees
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Milestone

Comments

@azrael417
Copy link

azrael417 commented Aug 15, 2017

hello, it seems that the md iteration policy ignore lower loop bounds in the GPU code. It computes the total range correctly:

core/src/Cuda/Kokkos_Cuda_Parallle.hpp:434

```const dim3 block( m_rp.m_tile[0] , m_rp.m_tile[1] , m_rp.m_tile[2] );
  const dim3 grid(
      std::min( ( m_rp.m_upper[0] - m_rp.m_lower[0] + block.x - 1 ) / block.x , maxblocks )
    , std::min( ( m_rp.m_upper[1] - m_rp.m_lower[1] + block.y - 1 ) / block.y , maxblocks )
    , std::min( ( m_rp.m_upper[2] - m_rp.m_lower[2] + block.z - 1 ) / block.z , maxblocks )
    );```

but then forgets to add the offset back. For example:

template< typename RP , typename Functor >
struct DeviceIterateTile<3,RP,Functor,void >
{
  using index_type = typename RP::index_type;

  __device__
  DeviceIterateTile( const RP & rp_ , const Functor & f_ )
  : m_rp(rp_)
  , m_func(f_)
  {}

  inline __device__
  void exec_range() const
  {
    // LL                                                                                                                      
    if (RP::inner_direction == RP::Left) {
      for ( index_type tile_id2 = (index_type)blockIdx.z; tile_id2 < m_rp.m_tile_end[2]; tile_id2 += gridDim.z ) {
        const index_type offset_2 = tile_id2*m_rp.m_tile[2] + (index_type)threadIdx.z;
        if ( offset_2 < m_rp.m_upper[2] && (index_type)threadIdx.z < m_rp.m_tile[2] ) {

          for ( index_type tile_id1 = (index_type)blockIdx.y; tile_id1 < m_rp.m_tile_end[1]; tile_id1 += gridDim.y ) {
            const index_type offset_1 = tile_id1*m_rp.m_tile[1] + (index_type)threadIdx.y;
            if ( offset_1 < m_rp.m_upper[1] && (index_type)threadIdx.y < m_rp.m_tile[1] ) {

              for ( index_type tile_id0 = (index_type)blockIdx.x; tile_id0 < m_rp.m_tile_end[0]; tile_id0 += gridDim.x ) {
                const index_type offset_0 = tile_id0*m_rp.m_tile[0] + (index_type)threadIdx.x;
                if ( offset_0 < m_rp.m_upper[0] && (index_type)threadIdx.x < m_rp.m_tile[0] ) {
                  m_func(offset_0 , offset_1 , offset_2);
                }
              }
            }
          }
        }
      }```

Some of the specializations look OK though:

```// Specializations for void tag type                                                                                           
template< typename RP , typename Functor , typename Tag, typename ValueType >
struct DeviceIterateTile<4,RP,Functor,Tag,ValueType, typename std::enable_if< !is_array_type<ValueType>::value && !is_void< Ta\
g >::value >::type >```

In here we have:

``` // LL                                                                                                                  
        if (RP::inner_direction == RP::Left) {
          for (int i=0; i<RP::rank; ++i) {
            m_offset[i] = (tile_idx % m_rp.m_tile_end[i]) * m_rp.m_tile[i] + m_rp.m_lower[i] ;
            tile_idx /= m_rp.m_tile_end[i];```

And we have more of these in the following, starting after declaring functions namespace Reduce.

Best Regards
Thorsten




@ibaned ibaned added this to the 2017-September (mid) milestone Aug 15, 2017
@ibaned ibaned added the Question For Kokkos internal and external contributors and users label Aug 15, 2017
@ndellingwood
Copy link
Contributor

Fixed and testing now.

ndellingwood added a commit to ndellingwood/kokkos that referenced this issue Aug 15, 2017
This commit addresses issue kokkos#1041
The lower range offset was missing from the multi-dim index calculations
in the Cuda_iterateTile ParallelFor impls.
Added the missing offset.
Enhanced unit tests to properly exercise this.
@ndellingwood ndellingwood added the Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) label Aug 15, 2017
@ndellingwood
Copy link
Contributor

PR #1042 issued with fix for this issue.

@crtrott crtrott added InDevelop and removed Question For Kokkos internal and external contributors and users labels Aug 17, 2017
@crtrott crtrott closed this as completed Sep 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Projects
None yet
Development

No branches or pull requests

4 participants