Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torch-mlir] bump to llvm/llvm-project@9b78ddf3b2abfb3e #3491

Merged
merged 7 commits into from
Jun 28, 2024

Conversation

aartbik
Copy link
Contributor

@aartbik aartbik commented Jun 22, 2024

This bump triggered an upstream assert. Includes a WAR for #3506.

Also includes several things I needed to do to repro:

  • When TORCH_MLIR_TEST_CONCURRENCY=1, test runs will be printed.
  • Added TORCH_MLIR_TEST_VERBOSE=1 handling to enable verbose mode (useful on CI).

@aartbik aartbik requested a review from PeimingLiu June 22, 2024 01:49
@Max191
Copy link
Contributor

Max191 commented Jun 27, 2024

@aartbik @PeimingLiu is there any progress on this bump? Do you need someone to pick it up?

@PeimingLiu
Copy link
Member

@aartbik @PeimingLiu is there any progress on this bump? Do you need someone to pick it up?

Yes, please! I can not reproduce the error locally.

@stellaraccident stellaraccident self-requested a review June 27, 2024 18:09
Copy link
Collaborator

@stellaraccident stellaraccident left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. retriggered CI. Looked like maybe infra flake.

@stellaraccident
Copy link
Collaborator

Not an infra issue but some crash in the TOSA test suite. Darn - will require triage.

@stellaraccident
Copy link
Collaborator

Narrowed down to ViewDynamicExpandCollapseWithParallelUnknownDimModule_basic emitting an error. A new assert was added in LLVM to ensure all errors were handled.

@stellaraccident
Copy link
Collaborator

Debugging instructions:

$sudo apt install python3-dbg
$gdb --args python -m e2e_testing.main --config tosa --filter ViewDynamicExpandCollapseWithParallelUnknownDimModule_basic

... assert ...
(gdb) bt
(gdb) py-bt

Native and Python stack:

#5  0x00007ffff7c2881b in __assert_fail_base (fmt=0x7ffff7dd01e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x7fff3386b029 "errors.empty() && \"unhandled captured errors\"",
    file=file@entry=0x7fff3386cbea "/home/stella/src/torch-mlir/externals/llvm-project/mlir/lib/Bindings/Python/IRModule.h", line=line@entry=434,
    function=function@entry=0x7fff3386155e "mlir::python::PyMlirContext::ErrorCapture::~ErrorCapture()") at ./assert/assert.c:94
#6  0x00007ffff7c3b507 in __assert_fail (assertion=0x7fff3386b029 "errors.empty() && \"unhandled captured errors\"",
    file=0x7fff3386cbea "/home/stella/src/torch-mlir/externals/llvm-project/mlir/lib/Bindings/Python/IRModule.h", line=434,
    function=0x7fff3386155e "mlir::python::PyMlirContext::ErrorCapture::~ErrorCapture()") at ./assert/assert.c:103

Traceback (most recent call first):
  <built-in method run of PyCapsule object at remote 0x7fff9c77e3a0>
  File "/home/stella/src/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/compiler_utils.py", line 47, in run_pipeline_with_repro_report
    pm.run(module.operation)
  File "/home/stella/src/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/linalg_on_tensors_backends/refbackend.py", line 227, in compile
    run_pipeline_with_repro_report(
  File "/home/stella/src/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/tosa_backends/linalg_on_tensors.py", line 70, in compile
    return self.refbackend.compile(imported_module)
  File "/home/stella/src/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/configs/tosa_backend.py", line 42, in compile
    return self.backend.compile(module)
  File "/home/stella/src/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/framework.py", line 313, in compile_and_run_test
    compiled = config.compile(test.program_factory(), verbose=verbose)
  File "/home/stella/src/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir_e2e_test/framework.py", line 390, in run_tests
    compile_and_run_test(test, config, verbose)
  File "/home/stella/src/torch-mlir/projects/pt1/e2e_testing/main.py", line 231, in main
    results = run_tests(tests, config, args.sequential, args.verbose)
  File "/home/stella/src/torch-mlir/projects/pt1/e2e_testing/main.py", line 258, in <module>
    main()

Isolated to some faulty pass error handling code fouling things up. I think this is masking a legitimate bug, but it was causing a test that was supposed to just be XFAIL to crash the test framework on a native assert. Added a Python level WAR until I can land a proper fix upstream (which will just issue a warning when diagnostics were dropped).

@stellaraccident stellaraccident merged commit 1f73895 into llvm:main Jun 28, 2024
3 checks passed
@aartbik
Copy link
Contributor Author

aartbik commented Jun 28, 2024

Thanks for your help with this!

@aartbik aartbik deleted the bik branch June 28, 2024 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants