Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed #1298

bertmaher · 2023-03-08T05:04:04Z

Running https://gist.github.com/bertmaher/93302c4f40728d8481873850e84cf47a#file-mm_plus_mm_mlir_assert-py (this is generated by inductor in max-autotune mode) fails with

Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed

The text was updated successfully, but these errors were encountered:

ptillet · 2023-03-08T05:28:56Z

We have couple of these errors. Basically they happen when the optimizer doesn't do its job well, so the backend is hitting a slow codepath that hasn't been implemented yet 😅

ngimel · 2023-03-08T17:41:38Z

Is there anything we can fix in the triton kernel itself to not hit the slowpath? Or avoid some of the autotuning configs?

ptillet · 2023-03-08T17:43:39Z

Hmm, I gotta say I don't really see the benefits of having the two matmuls in the same inner loop, since they're independent? This will likely increase register pressure a lot without increasing arithmetic intensity

ngimel · 2023-03-08T17:55:44Z

We see speed-ups from some small-ish sizes we have iirc

bertmaher · 2023-03-08T18:51:20Z

Also based on the comment in that kernel, the original intent was to have them in two separate inner loops (but sharing the accumulator). I think these are bandwidth bound because of the small K, so avoiding the read/write round trip for the fused addition ends up being a win.

Trim number of tested mm_plus_mm configs to work around triton-lang/triton#1298 Pull Request resolved: #96385 Approved by: https://github.com/bertmaher, https://github.com/jansel

…or max autotuning" This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]

This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]