get_work_partition casts int64_t to int, causing a seg fault #1481
Labels
Bug
Broken / incorrect code; it could be Kokkos' responsibility, or others’ (e.g., Trilinos)
Milestone
The line in Gemma that leads to this problem is (after removing typedefs)
This constructor requests a View with 3240455625 entries. This number fits in int64_t, but not int.
At some point, the view constructor calls
get_work_partition
, a function in Kokkos_HostThreadTeam.hpp:In
get_work_partition
,m_work_range.second
is type int64_t butm_work_chunk
is type int. Multiplying them together casts the result to type int. In our specific case,m_work_range.second * m_work_chunk
is -1040187392, which is always less thanm_work_end
. Therefore, the first return part of the ternary expression is always used andget_work_partition
returns a pair whose second value is negative.Later, in Kokkos_OpenMP_Parallel.hpp, the negative value is cast to an unsigned long (a number larger than the number of entries in the View) and the constructor for Kokkos::complex attempts to initialize memory to double(0.0) that was not allocated by the View constructor. In our specific case, Gemma crashes at this point.
Changing
get_work_partition
to castm_work_range.second
andm_work_chunk
to type int64_t in the following way and recompiling appears to fix this problem for Gemma.However, there are other lines of code in Kokkos_HostThreadTeam.hpp that have the same ternary if statement. I expect that the similar casting to int64_t should be done to other parts of
get_work_partition
.Can you confirm that I understand what is happening in the View construction correctly? If so, for the sake of a short-term solution, it would be nice to know which other statements should be modified in our local copy of Kokkos so that we can run problems of this size.
As always, thank you for your time.
The text was updated successfully, but these errors were encountered: