Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_work_partition casts int64_t to int, causing a seg fault #1481

Closed
bfzinser opened this issue Mar 21, 2018 · 4 comments
Closed

get_work_partition casts int64_t to int, causing a seg fault #1481

bfzinser opened this issue Mar 21, 2018 · 4 comments
Assignees
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Milestone

Comments

@bfzinser
Copy link

The line in Gemma that leads to this problem is (after removing typedefs)

Kokkos::View< Kokkos::complex<double> **, Kokkos::HostSpace > d_systemMatrix ("systemMatrix" , 56925, 56925);

This constructor requests a View with 3240455625 entries. This number fits in int64_t, but not int.

At some point, the view constructor calls get_work_partition, a function in Kokkos_HostThreadTeam.hpp:

  std::pair<int64_t,int64_t> get_work_partition() noexcept
    {
      return std::pair<int64_t,int64_t>
        ( m_work_range.first * m_work_chunk
        , m_work_range.second * m_work_chunk < m_work_end
        ? m_work_range.second * m_work_chunk : m_work_end );
    }

In get_work_partition, m_work_range.second is type int64_t but m_work_chunk is type int. Multiplying them together casts the result to type int. In our specific case, m_work_range.second * m_work_chunk is -1040187392, which is always less than m_work_end. Therefore, the first return part of the ternary expression is always used and get_work_partition returns a pair whose second value is negative.

Later, in Kokkos_OpenMP_Parallel.hpp, the negative value is cast to an unsigned long (a number larger than the number of entries in the View) and the constructor for Kokkos::complex attempts to initialize memory to double(0.0) that was not allocated by the View constructor. In our specific case, Gemma crashes at this point.

Changing get_work_partition to cast m_work_range.second and m_work_chunk to type int64_t in the following way and recompiling appears to fix this problem for Gemma.

  std::pair<int64_t,int64_t> get_work_partition() noexcept
    {
      return std::pair<int64_t,int64_t>
        ( m_work_range.first * m_work_chunk
        , ( int64_t ) m_work_range.second * ( int64_t ) m_work_chunk < m_work_end
        ? m_work_range.second * m_work_chunk : m_work_end );
    }

However, there are other lines of code in Kokkos_HostThreadTeam.hpp that have the same ternary if statement. I expect that the similar casting to int64_t should be done to other parts of get_work_partition.

Can you confirm that I understand what is happening in the View construction correctly? If so, for the sake of a short-term solution, it would be nice to know which other statements should be modified in our local copy of Kokkos so that we can run problems of this size.

As always, thank you for your time.

@ibaned
Copy link
Contributor

ibaned commented Mar 22, 2018

@bfzinser I remember fixing one integer shortening issue in that function, but I guess I didn't fix it fully. We'll get this fixed and also add a unit test to confirm its really fixed.

@ibaned ibaned added Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos) help wanted labels Mar 22, 2018
@ibaned ibaned added this to the 2018 April milestone Mar 22, 2018
@ibaned ibaned self-assigned this Mar 22, 2018
@bfzinser
Copy link
Author

For what it's worth, it appears that my naive fix works for at least --kokkos-threads=2, but it does not work for --kokkos-threads=16.

@crtrott crtrott assigned crtrott and unassigned ibaned Apr 6, 2018
@crtrott
Copy link
Member

crtrott commented Apr 6, 2018

Tests are running, gonna have a pull request in a bit.

@crtrott
Copy link
Member

crtrott commented Apr 7, 2018

Should be fixed. Unit Tests was added and runs nightly. The unit test is not on by default unless run on Sandia machines where I know I got the 50GB required to run the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Projects
None yet
Development

No branches or pull requests

4 participants