Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undeterministically task==NULL at runtime when finetuning GPT-2 #12529

Open
wangkuiyi opened this issue Mar 6, 2023 · 12 comments
Open

Undeterministically task==NULL at runtime when finetuning GPT-2 #12529

wangkuiyi opened this issue Mar 6, 2023 · 12 comments
Assignees
Labels
bug 🐞 Something isn't working

Comments

@wangkuiyi
Copy link
Contributor

What happened?

After we fixed #12369, I can make GPT-2 generate text well, so I'm moving on to fine-tuning GPT-2.

In iree-org/iree-jax#58, I added a loss function to the file iree-jax/models/gpt2/model.py. In JAX-Python, the fine-tuning works well.

Then, in iree-org/iree-jax#59, I add the fine-tuning feature as an MLIR function. The compilation went well, and I got the file /tmp/gpt2.vmfb.

I can run the module using iree-run-module

15:09 $ iree-run-module --module=/tmp/gpt2.vmfb --device=local-task --function=finetune --input="1x64xi32=13" --input="1x64xi32=13" --input="1xi32=10"
EXEC @finetune

Because the finetune function only updates the paramter and does not return anything, the above run prints only EXEC @finetune.

To check if the finetuning really works on macOS, I wrote a C++ program to run this vmfb file. Sometimes it works well, but sometimes it crashes with Bus error: 10.

(base) ✔ ~/w/iree-ios/iree-jax/models/gpt2/finetune [export_finetune|●4✚ 3…6]
14:51 $ ./build.sh && ./finetune /tmp/gpt2.vmfb ~/w/iree-ios/IREESampleApp/IREESampleApp
clang: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
Got id = 679
Yi Wang has two dogs. He's a good dog, but he's not a good dog.
Got id = 1881
Yi Wang has two dogs. One is the other is the other is the other is the other is the other is
Got id = 1881
Yi Wang has two dogs. One of the other dogs is in the other two.
Got id = 1881
Yi Wang has two dogs. One is the other is the other is the other is the other is the other is
Got id = 1881
Yi Wang has two dogs. One is Joy.
Got id = 1881
Yi Wang has two dogs. One is Joy, the other is Joy.
Got id = 1881
Yi Wang has two dogs. One is Relaxie.
Got id = 1881
Yi Wang has two dogs. One is Relaxie.
Got id = 1881
Yi Wang has two dogs. One is Joy, the other is Relaxie.
Got id = 1881
Yi Wang has two dogs. One is Joy, the other is Relaxie.
Got id = 1881
Yi Wang has two dogs. One is Joy, the other is Relaxie.
(base) ✔ ~/w/iree-ios/iree-jax/models/gpt2/finetune [export_finetune|●4✚ 3…5]
14:51 $ ./build.sh && ./finetune /tmp/gpt2.vmfb ~/w/iree-ios/IREESampleApp/IREESampleApp
clang: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
Got id = 679
Yi Wang has two dogs. He's a good dog, but he's not a good dog.
Got id = 1881
Yi Wang has two dogs. One is the other is the other is the other is the other is the other is
Got id = 1881
Yi Wang has two dogs. One of the other dogs is in the other two.
Got id = 1881
Yi Wang has two dogs. One is the other is the other is the other is the other is the other is
Got id = 1881
Yi Wang has two dogs. One is Joy.
Bus error: 10

By putting the C++ program into an iOS app written in Objective-C, I can run the app on my iPhone 13 or the iOS Simulator. On these two platforms, the program crashes with EXC_BAD_ACCESS almost every time. I am attaching a stack trace from Xcode.
Screenshot 2023-03-06 at 10 11 26 AM

Steps to reproduce your issue

  1. Build a very recent version of IREE after the fix EXC_BAD_ACCESS signal received executing GPT2 llvm-cpu on iOS Simulator #12369
  2. Use the branch of IREE-JAX in Add MLIR function finetune to GPT-2 export.py iree-jax#59 to generate gpt2.vmfb
  3. Build the sample C++ program that executes gpt2.vmfb on macOS/M1.
  4. Build the sample iOS app that executes gpt2.vmfb on the iOS Simulator or an iPhone.

What component(s) does this issue relate to?

Runtime

Version information

IREE da22c84

Additional context

macOS
M1 Max

@wangkuiyi wangkuiyi added bug 🐞 Something isn't working awaiting-triage labels Mar 6, 2023
@bjacob
Copy link
Contributor

bjacob commented Mar 7, 2023

To triage an undeterministic issue like this, I would be very helpful to be able to run the reproduction steps with sanitizers: AddressSanitizer, and separately, ThreadSanitizer. This page says:

You can’t use Thread Sanitizer to diagnose iOS, tvOS, and watchOS apps running on a device. Use Thread Sanitizer only on your 64-bit macOS app, or to diagnose your 64-bit iOS, tvOS, or watchOS app running in Simulator.

Since you write above that this reproduces in Simulator, let's then focus on that.

In particular, task==NULL sounds like the kind of thing that could be associated with issues that ThreadSanitizer would diagnose.

Even a negative outcome (the sanitizer doesn't see anything) would be useful information in itself, as that would help rule out classes of issues.

We have sanitizers docs here,
https://github.com/openxla/iree/blob/main/docs/developers/developing_iree/sanitizers.md

But I wrote that a while ago and it's not optimal. Here's the important steps:

  1. For both sanitizers, select the RelWithDebInfo build type.
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .
  1. For ThreadSanitizer, first re-compile your .vmfb module by adding these flags to your iree-compile command line:
iree-compile ...  --iree-llvm-sanitize=thread --iree-llvm-link-embedded=false

Then re-build the IREE runtime (iree-run-module or anything else you're using to load the compiled module) with the IREE_ENABLE_TSAN CMake option:

cmake -DIREE_ENABLE_TSAN=ON .
cmake --build .

If the reproducing program is your own (finetune, if I read the Issue description correctly) then rebuild that with this C/C++ compiler flag: -fsanitize=thread. That is all what IREE_ENABLE_TSAN does to IREE binaries. But you still need to rebuild the IREE runtime (that it links to) with IREE_ENABLE_TSAN.

Then re-run your iree-run-module command line reproducing this issue, using both the TSan-enabled iree-run-module and the TSan-enabled compiled .vmfb module.

  1. For AddressSanitizer, it's easier as you don't need to re-compile the .vmfb. Just re-compile the IREE runtime with the CMake option IREE_ENABLE_ASAN=ON.
cmake -DIREE_ENABLE_ASAN=ON .
cmake --build .

If the reproducing program is your own (finetune, if I read the Issue description correctly) then rebuild that with this C/C++ compiler flag: -fsanitize=address. That is all what IREE_ENABLE_ASAN does to IREE binaries. But you still need to rebuild the IREE runtime (that it links to) with IREE_ENABLE_ASAN.

@wangkuiyi
Copy link
Contributor Author

wangkuiyi commented Mar 7, 2023

Thanks @bjacob ! I rebuild the IREE compiler and runtime for macOS/M1 with the following additional CMake flags

-DCMAKE_BUILD_TYPE=RelWithDebInfo
-DIREE_ENABLE_ASAN=ON 
-DIREE_ENABLE_TSAN=ON 
-DIREE_BYTECODE_MODULE_ENABLE_TSAN=ON 
-DIREE_BYTECODE_MODULE_FORCE_LLVM_SYSTEM_LINKER=ON 
-DIREE_ENABLE_MSAN=ON

The building was alright except that I had to fix libyaml a little bit yaml/libyaml#267

Then, I compiled the gpt2.mlir with the following command:

 iree-compile /tmp/gpt2.mlir \
   --iree-input-type=mhlo \
   --iree-hal-target-backends=llvm-cpu  \
   -o /tmp/gpt2-san.vmfb \
   --iree-llvm-sanitize=thread --iree-llvm-link-embedded=false 2>&1 | tee /tmp/log

It gave me errors like the following. (The more complete error message is at https://gist.github.com/wangkuiyi/b4ef1a867e6f129fe3287a0ef0e1d600. The complete one is too big to upload to GitHub.)

Undefined symbols for architecture arm64:
 "___tsan_func_entry", referenced from:
     _encode_dispatch_0_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_1_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_2_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_3_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_4_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_5_generic_768x8 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_6_matmul_2304x8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     ...
 "___tsan_func_exit", referenced from:
     _encode_dispatch_0_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_1_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_2_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_3_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_4_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_5_generic_768x8 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_6_matmul_2304x8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     ...
ld: symbol(s) not found for architecture arm64
Linking failed; escaped command line returned exit code 256:

It works if I remove --iree-llvm-sanitize=thread --iree-llvm-link-embedded=false.

@bjacob
Copy link
Contributor

bjacob commented Mar 7, 2023

I don't know the fix for these linking errors, but, FYI:

-DCMAKE_BUILD_TYPE=RelWithDebInfo
-DIREE_ENABLE_ASAN=ON
-DIREE_ENABLE_TSAN=ON
-DIREE_BYTECODE_MODULE_ENABLE_TSAN=ON
-DIREE_BYTECODE_MODULE_FORCE_LLVM_SYSTEM_LINKER=ON
-DIREE_ENABLE_MSAN=ON

The IREE_ENABLE_*SAN options should be regarded as mutually exclusive. In effect, they are probably overriding each other, passing -fsanitize={address,thread,memory} where the one passed last overrides others. So here, drop -DIREE_ENABLE_ASAN=ON and -DIREE_ENABLE_MSAN=ON.

@bjacob
Copy link
Contributor

bjacob commented Mar 7, 2023

Interesting! The linker command line from your gist is

/usr/bin/ld -o /var/folders/hd/6q8jftdn7b1fygsrzdkp5ww40000gn/T/gpt2_module_linked_llvm_cpu-dff9a6.so -static -dylib -flat_namespace -L /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib -lSystem /var/folders/hd/6q8jftdn7b1fygsrzdkp5ww40000gn/T/gpt2_module_linked_llvm_cpu-dff9a6.o

and it is itself generated by this code: https://github.com/openxla/iree/blob/1148f720be7e267f248e034b3cfb488633884980/compiler/src/iree/compiler/Dialect/HAL/Target/LLVM/internal/UnixLinkerTool.cpp#L82-L92

This is as if on the Apple platform, the TSan instrumentation library needed to be explicitly linked in (?) We need someone with Apple experience here.... maybe @powderluv ?

@bjacob
Copy link
Contributor

bjacob commented Mar 7, 2023

Maybe try adding "-fsanitize=thread" to the linker flags (code linked in previous comment). It's suggested at various places including google/sanitizers#701 .

That is, at UnixLinkerTool.cpp:90 (above linked code), add unconditionally

flags.push_back("-fsanitize=thread"); 

If that works, we'll figure how to do that conditionally.

@wangkuiyi
Copy link
Contributor Author

clang -v -fsantize=thread helped me. The following command

clang -fsantize /tmp/a.c -o /tmp/a

is equivalent to the following two:

clang /tmp/a.c -c -o /tmp/a.o

and

ld /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/14.0.0/lib/darwin/libclang_rt.tsan_osx_dynamic.dylib \
  -rpath @executable_path \
  -rpath /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/14.0.0/lib/darwin \
  /tmp/a.o -o /tmp/a \
  -lSystem -syslibroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk

@stellaraccident
Copy link
Collaborator

I suspect that this issue is not coming from generated code. In that case, you may be able to get away with just building the iree runtime with sanitizers and not fiddling with the iree-llvm-sanitize=thread compiler flags.

(It would obviously be good if this all worked better on apple platforms so just offering an option that night lead through the maze faster -- it is still useful to figure out how to fully enable sanitizers)

@stellaraccident
Copy link
Collaborator

Other things that can be done to bisect the area that is having the problem:

  • compile with vmvx (slow but unlikely to crash on generated code)
  • use the dylib-sync vs dylib-task runtime option (uses single threaded mode)

I suggest the last one because that error makes me think there is something going on with the task scheduler in threaded mode. Narrowing down which piece is crashing can help scope debugging activity.

@bjacob
Copy link
Contributor

bjacob commented Mar 7, 2023

I suspect that this issue is not coming from generated code. In that case, you may be able to get away with just building the iree runtime with sanitizers and not fiddling with the iree-llvm-sanitize=thread compiler flags.

Agree that this issue does not look like it comes from the generated code.... but TSan specifically (as opposed to other sanitizers) does not allow taking advantage of that in that way, because a TSan-enabled IREE runtime can only call TSan-enabled module code (TSan is an ABI break). Well, it will run, but it will crash.

compile with vmvx (slow but unlikely to crash on generated code)

Ah good idea, that does enable running a TSan-enabled IREE-runtime without having to get TSan to work in module code. My above objection is specific to llvm-cpu target backend.

I suggest the last one because that error makes me think there is something going on with the task scheduler in threaded mode. Narrowing down which piece is crashing can help scope debugging activity.

+1

@allieculp
Copy link

@bjacob @wangkuiyi Looks like this went a bit stale, any further update?

@bjacob
Copy link
Contributor

bjacob commented Apr 13, 2023

Deferring to @wangkuiyi .

@wangkuiyi
Copy link
Contributor Author

@allieculp and @bjacob - I got GPT-2 fine-tuning work a month ago, but via @antiagainst 's Metal GPU backend. This issue comes with the CPU backend, but not the Metal GPU one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants