Skip to content

Update to the latest Legion#1007

Merged
marcinz merged 12 commits intonv-legate:branch-23.09from
marcinz:update_legion_version
Aug 9, 2023
Merged

Update to the latest Legion#1007
marcinz merged 12 commits intonv-legate:branch-23.09from
marcinz:update_legion_version

Conversation

@marcinz
Copy link
Copy Markdown
Collaborator

@marcinz marcinz commented Jul 26, 2023

For testing, this PR points to the PR version of legate.core. Before merge, the version of legate.core should be changed to the commit on the legate.core dev branch.

@marcinz marcinz added the category:task PR is a simple task and will not be included in release notes label Jul 26, 2023
@marcinz marcinz requested a review from manopapad July 26, 2023 21:56
@marcinz
Copy link
Copy Markdown
Collaborator Author

marcinz commented Jul 27, 2023

@manopapad Tests seem to be failing with the latest Legion.

@manopapad
Copy link
Copy Markdown
Contributor

I am seeing the following:

  • test_put_along_axis.py hangs on 2 CPUs
  • SIGKILL was sent to the 1-GPU run
  • 11 tests failed on the 2-GPU run with exit code 247 (SIGKILL)
  • test_reshape.py failed on 2-GPU run because a NaN was produced (by both NumPy and cuNumeric in the same spot in the compared arrays), and NaNs don't compare equal. Probably something that should be changed in the test.
  • 3 tests failed on the 2-GPU run due to heap corruption inside Legion
  • 2 tests segfaulted inside Legion on the 2-OMPs run
  • 2 tests failed on the 2-OMPs run due to heap corruption inside Legion

The most problematic tests appear to be:

  • indexing_routines.py
  • test_put_along_axis.py
  • test_take_along_axis.py
  • test_put.py

I was able to reproduce some of these heap corruptions on computelab, here are the backtraces I was able to get:

Heap corruption backtraces
tests/integration/test_put.py ............................................................................................free(): invalid next size (fast)
Thread 10 "legion_python" received signal SIGSEGV, Segmentation fault.
(gdb) bt
#0  0x000015554f46cd8b in _int_malloc (av=av@entry=0x154cb8000020, bytes=bytes@entry=8) at malloc.c:3608
#1  0x000015554f46f299 in __GI___libc_malloc (bytes=8) at malloc.c:3066
#2  0x000015554f6b9a40 in operator new (sz=8) at ../../../../libstdc++-v3/libsupc++/new_op.cc:50
#3  0x000015555385ee18 in __gnu_cxx::new_allocator<unsigned long>::allocate (this=0x155541d315b0, __n=1) at /usr/include/c++/9/ext/new_allocator.h:114
#4  0x00001555538542a3 in std::allocator_traits<std::allocator<unsigned long> >::allocate (__a=..., __n=1) at /usr/include/c++/9/bits/alloc_traits.h:443
#5  0x000015555385ae0c in std::_Vector_base<unsigned long, std::allocator<unsigned long> >::_M_allocate (this=0x155541d315b0, __n=1) at /usr/include/c++/9/bits/stl_vector.h:343
#6  0x00001555538c05d7 in std::_Vector_base<unsigned long, std::allocator<unsigned long> >::_M_create_storage (this=0x155541d315b0, __n=1) at /usr/include/c++/9/bits/stl_vector.h:358
#7  0x00001555538be695 in std::_Vector_base<unsigned long, std::allocator<unsigned long> >::_Vector_base (this=0x155541d315b0, __n=1, __a=...) at /usr/include/c++/9/bits/stl_vector.h:302
#8  0x00001555538bd17d in std::vector<unsigned long, std::allocator<unsigned long> >::vector (this=0x155541d315b0, __n=1, __value=@0x155541d314f8: 4096, __a=...) at /usr/include/c++/9/bits/stl_vector.h:521
#9  0x00001555540bdb4c in Legion::Internal::MemoryManager::create_future_instance (this=0x154bd40523b0, op=0x5555568b7ab0, creator_uid=2, ready_event=..., size=4096, eager=true) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/runtime.cc:10499
#10 0x00001555540957c3 in Legion::Internal::FutureImpl::find_or_create_instance (this=0x154cb964bef0, memory=..., op=0x5555568b7ab0, creator_uid=2, eager=true, need_lock=false, ready_event=..., existing=0x0) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/runtime.cc:1906
#11 0x00001555540955f7 in Legion::Internal::FutureImpl::find_or_create_instance (this=0x154cb964bef0, memory=..., op=0x5555568b7ab0, creator_uid=2, eager=true, need_lock=true, ready_event=..., existing=0x0) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/runtime.cc:1876
#12 0x0000155554091262 in Legion::Internal::FutureImpl::get_buffer (this=0x154cb964bef0, memory=..., extent_in_bytes=0x155541d35aa0, check_extent=false, silence_warnings=false, warning_string=0x0) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/runtime.cc:1034
#13 0x0000155554091093 in Legion::Internal::FutureImpl::get_buffer (this=0x154cb964bef0, proc=..., memkind=Realm::Memory::SYSTEM_MEM, extent_in_bytes=0x155541d35aa0, check_extent=false, silence_warnings=false, warning_string=0x0) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/runtime.cc:1022
#14 0x0000155553c00898 in Legion::Future::get_buffer (this=0x154cb96bb090, memory=Realm::Memory::SYSTEM_MEM, extent_in_bytes=0x155541d35aa0, check_size=false, silence_warnings=false, warning_string=0x0) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion.cc:2433
#15 0x0000155553a72a3c in Legion::Future::get_untyped_pointer (this=0x154cb96bb090, silence_warnings=false, warning_string=0x0) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion.inl:19401
#16 0x0000155553a57205 in legion_future_get_untyped_pointer (handle_=...) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion_c.cc:3227
#17 0x0000155540724a4a in ffi_call_unix64 () from /home/scratch.mpapadakis_sw/computelab/a/env/lib/python3.10/site-packages/../../libffi.so.8
#18 0x0000155540723fea in ffi_call_int () from /home/scratch.mpapadakis_sw/computelab/a/env/lib/python3.10/site-packages/../../libffi.so.8
#19 0x00001555407474a4 in cdata_call () from /home/scratch.mpapadakis_sw/computelab/a/env/lib/python3.10/site-packages/_cffi_backend.cpython-310-x86_64-linux-gnu.so
#20 0x00001555414d3f0b in _PyObject_MakeTpCall (tstate=0x154cb801ed70, callable=<_cffi_backend._CDataBase at remote 0x155510757180>, args=<optimized out>, nargs=1, keywords=0x0) at /usr/local/src/conda/python-3.10.12/Objects/call.c:215
tests/integration/test_take_along_axis.py ...........xx..free(): invalid next size (fast)
Thread 793 "legion_python" received signal SIGABRT, Aborted.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x000015554f3f7859 in __GI_abort () at abort.c:79
#2  0x000015554f46226e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x15554f58c298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x000015554f46a2fc in malloc_printerr (str=str@entry=0x15554f58e600 "free(): invalid next size (fast)") at malloc.c:5347
#4  0x000015554f46bbac in _int_free (av=0x154cb8000020, p=0x154cb9652c20, have_lock=0) at malloc.c:4249
#5  0x0000155553cb8a6d in Legion::Internal::AllReduceOp::deactivate (this=0x154cb92fce90, freeop=true) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion_ops.cc:23225
#6  0x0000155553c5b480 in Legion::Internal::Operation::commit_operation (this=0x154cb92fce90, do_deactivate=true, wait_on=...) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion_ops.cc:2226
#7  0x0000155553c59721 in Legion::Internal::Operation::trigger_commit (this=0x154cb92fce90) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion_ops.cc:1714
#8  0x0000155553c5ac8a in Legion::Internal::Operation::complete_operation (this=0x154cb92fce90, wait_on=...) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion_ops.cc:2109
#9  0x0000155553c596f2 in Legion::Internal::Operation::trigger_complete (this=0x154cb92fce90) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion_ops.cc:1707
#10 0x0000155553c5a741 in Legion::Internal::Operation::complete_execution (this=0x154cb92fce90, wait_on=...) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion_ops.cc:1996
#11 0x0000155553cb9ab5 in Legion::Internal::AllReduceOp::trigger_execution (this=0x154cb92fce90) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion_ops.cc:23453
#12 0x0000155553b0bea2 in Legion::Internal::InnerContext::process_trigger_execution_queue (this=0x154c98000d00) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion_context.cc:8223
#13 0x0000155553b1a9e4 in Legion::Internal::InnerContext::handle_trigger_execution_queue (args=0x153fd7e80420) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion_context.cc:11715
#14 0x000015555411132b in Legion::Internal::Runtime::legion_runtime_task (args=0x153fd7e80420, arglen=12, userdata=0x555556726510, userlen=8, p=...) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/runtime.cc:32007
#15 0x00001555504e37c8 in Realm::LocalTaskProcessor::execute_task (this=0x5555566705b0, func_id=4, task_args=...) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/realm/proc_impl.cc:1175
#16 0x000015555055b39c in Realm::Task::execute_on_processor (this=0x153fd7e802a0, p=...) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/realm/tasks.cc:326
#17 0x000015555055f76a in Realm::KernelThreadTaskScheduler::execute_task (this=0x555556603d60, task=0x153fd7e802a0) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/realm/tasks.cc:1421
#18 0x000015555055e4b1 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x555556603d60) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/realm/tasks.cc:1160
#19 0x000015555055eaff in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x555556603d60) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/realm/tasks.cc:1272
#20 0x000015555056719e in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x555556603d60) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/realm/threads.inl:97
#21 0x00001555505743b8 in Realm::KernelThread::pthread_entry (data=0x154bdc0ab710) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/realm/threads.cc:781
#22 0x000015554d41a609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#23 0x000015554f4f4133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
tests/integration/test_put.py ............................................................................................free(): invalid next size (fast)
Thread 675 "legion_python" received signal SIGABRT, Aborted.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x000015554f3f7859 in __GI_abort () at abort.c:79
#2  0x000015554f46226e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x15554f58c298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x000015554f46a2fc in malloc_printerr (str=str@entry=0x15554f58e600 "free(): invalid next size (fast)") at malloc.c:5347
#4  0x000015554f46bbac in _int_free (av=0x155510000020, p=0x1555114e3890, have_lock=0) at malloc.c:4249
#5  0x0000155553e9bb32 in __gnu_cxx::new_allocator<Legion::Internal::ApUserEvent>::deallocate (this=0x155439af9e90, __p=0x1555114e38a0) at /usr/include/c++/9/ext/new_allocator.h:128
#6  0x0000155553e917ce in std::allocator_traits<std::allocator<Legion::Internal::ApUserEvent> >::deallocate (__a=..., __p=0x1555114e38a0, __n=2) at /usr/include/c++/9/bits/alloc_traits.h:469
#7  0x0000155553e877f4 in std::_Vector_base<Legion::Internal::ApUserEvent, std::allocator<Legion::Internal::ApUserEvent> >::_M_deallocate (this=0x155439af9e90, __p=0x1555114e38a0, __n=2) at /usr/include/c++/9/bits/stl_vector.h:351
#8  0x0000155553e7fc40 in std::_Vector_base<Legion::Internal::ApUserEvent, std::allocator<Legion::Internal::ApUserEvent> >::~_Vector_base (this=0x155439af9e90, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/stl_vector.h:332
#9  0x0000155553e7fc95 in std::vector<Legion::Internal::ApUserEvent, std::allocator<Legion::Internal::ApUserEvent> >::~vector (this=0x155439af9e90, __in_chrg=<optimized out>) at /usr/include/c++/9/bits/stl_vector.h:680
#10 0x0000155553e5c052 in Legion::Internal::SingleTask::launch_task (this=0x1550e001a040, inline_task=false) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion_tasks.cc:4512
#11 0x0000155553b0ba41 in Legion::Internal::InnerContext::process_launch_task_queue (this=0x1554e8000cd0) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion_context.cc:8163
#12 0x0000155553b1a8d0 in Legion::Internal::InnerContext::handle_launch_task_queue (args=0x154ad2cce420) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/legion_context.cc:11694
#13 0x00001555541119a3 in Legion::Internal::Runtime::legion_runtime_task (args=0x154ad2cce420, arglen=12, userdata=0x5555556f6640, userlen=8, p=...) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/legion/runtime.cc:32255
#14 0x00001555504e37c8 in Realm::LocalTaskProcessor::execute_task (this=0x555555723a30, func_id=4, task_args=...) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/realm/proc_impl.cc:1175
#15 0x000015555055b39c in Realm::Task::execute_on_processor (this=0x154ad2cce2a0, p=...) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/realm/tasks.cc:326
#16 0x000015555055f76a in Realm::KernelThreadTaskScheduler::execute_task (this=0x555555723cf0, task=0x154ad2cce2a0) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/realm/tasks.cc:1421
#17 0x000015555055e4b1 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x555555723cf0) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/realm/tasks.cc:1160
#18 0x000015555055eaff in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x555555723cf0) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/realm/tasks.cc:1272
#19 0x000015555056719e in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x555555723cf0) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/realm/threads.inl:97
#20 0x00001555505743b8 in Realm::KernelThread::pthread_entry (data=0x1552e0190030) at /home/scratch.mpapadakis_sw/computelab/a/legion/runtime/realm/threads.cc:781
#21 0x000015554d41a609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#22 0x000015554f4f4133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@lightsighter before we go digging any further, are you aware of any change in Legion between commits a2ec81dde867e542b335ea98826475f8d601c2ad and 70f9fbbce07ec4772696c57036d5fc7f84ea264d that could be causing this heap corruption?

@lightsighter
Copy link
Copy Markdown
Contributor

Have you tried turning off the caching allocator?
StanfordLegion/legion#1513 (comment)

@manopapad
Copy link
Copy Markdown
Contributor

Have you tried turning off the caching allocator?

That didn't fix it

@manopapad
Copy link
Copy Markdown
Contributor

git bisect identified this control_replication commit as the culprit:

commit 6e31c84148103ace33b7812f267bc92f33b7aada
Merge: ea62f9acf 73fbb52c5
Author: Steven Gurfinkel <sgurfinkel@nvidia.com>
Date:   Fri Jul 21 14:56:44 2023 +0000

    Merge branch 'sgurfinkel/reduce_future' into 'control_replication'

    [Legion] add initial value parameter to reduce_future_map()

    See merge request StanfordLegion/legion!846

@lightsighter
Copy link
Copy Markdown
Contributor

Here is a backtrace for some invalid reads that are occuring:

==3189703== Invalid read of size 8
==3189703==    at 0x4848B5F: memmove (vg_replace_strmem.c:1398)
==3189703==  Address 0xf23d0a8 is 7 bytes after a block of size 1 alloc'd
==3189703==    at 0x483D635: malloc (vg_replace_malloc.c:392)
==3189703==    by 0x18A3D4479: legate::pack_returned_exception(legate::ReturnedException const&, void*&, unsigned long&) (return.cc:104)
==3189703==    by 0x18A3D44CB: legate::returned_exception_init(Realm::ReductionOpUntyped const*, void*&, unsigned long&) (return.cc:113)
==3189703==    by 0x6B6F67C: Legion::Internal::AllReduceOp::initialize(Legion::Internal::InnerContext*, Legion::FutureMap const&, int, bool, unsigned int, unsigned long, Legion::Internal::Provenance*, Legion::Future) (legion_ops.cc:23169)
==3189703==    by 0x69B9DE2: Legion::Internal::InnerContext::reduce_future_map(Legion::FutureMap const&, int, bool, unsigned int, unsigned long, Legion::Internal::Provenance*, Legion::Future) (legion_context.cc:6473)
==3189703==    by 0x6ACAB44: Legion::Runtime::reduce_future_map(Legion::Internal::TaskContext*, Legion::FutureMap const&, int, bool, unsigned int, unsigned long, char const*, Legion::Future) (legion.cc:6341)
==3189703==    by 0x690E681: legion_future_map_reduce_with_initial_value (legion_c.cc:3333)
==3189703==    by 0x690E558: legion_future_map_reduce (legion_c.cc:3312)
==3189703==    by 0x21FEFA49: ffi_call_unix64 (in /home/mebauer/miniconda3/envs/legate/lib/libffi.so.8.1.0)
==3189703==    by 0x21FEEFE9: ffi_call_int (in /home/mebauer/miniconda3/envs/legate/lib/libffi.so.8.1.0)
==3189703==    by 0x2877C493: cdata_call (in /home/mebauer/miniconda3/envs/legate/lib/python3.11/site-packages/_cffi_backend.cpython-311-x86_64-linux-gnu.so)
==3189703==    by 0x1762E5AA3: _PyObject_MakeTpCall (call.c:214)

I can't explain why they are happening though since I don't know how to pass --db-attach=yes to valgrind through the legate launcher at the moment.

@sgurfinkel for visibility.

@marcinz
Copy link
Copy Markdown
Collaborator Author

marcinz commented Aug 1, 2023

I updated to the latest Legion, and fewer tests seem to fail.

@lightsighter
Copy link
Copy Markdown
Contributor

I updated to the latest Legion, and fewer tests seem to fail.

But there are still failing tests right? I think that the fundamental problem is still there (whatever it is).

@manopapad
Copy link
Copy Markdown
Contributor

I'm going to take this opportunity to test a change to the test configuration, whereby we tell pytest to not install python's signal handler, which tends to shadow Realm's printing of C++ backtraces. Hopefully this results in more informative backtraces when the C++ side crashes.

@marcinz
Copy link
Copy Markdown
Collaborator Author

marcinz commented Aug 4, 2023

/ok to test

@manopapad
Copy link
Copy Markdown
Contributor

@marcinz it looks like tests passed, so I think we can merge this and nv-legate/legate#803

@marcinz marcinz merged commit 67cd9f9 into nv-legate:branch-23.09 Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category:task PR is a simple task and will not be included in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants