Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed #1298

Open
bertmaher opened this issue Mar 8, 2023 · 11 comments
Labels

Comments

@bertmaher
Copy link
Collaborator

Running https://gist.github.com/bertmaher/93302c4f40728d8481873850e84cf47a#file-mm_plus_mm_mlir_assert-py (this is generated by inductor in max-autotune mode) fails with

Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed
@ptillet ptillet added the bug label Mar 8, 2023
@ptillet
Copy link
Collaborator

ptillet commented Mar 8, 2023

We have couple of these errors. Basically they happen when the optimizer doesn't do its job well, so the backend is hitting a slow codepath that hasn't been implemented yet 😅

@ngimel
Copy link
Contributor

ngimel commented Mar 8, 2023

Is there anything we can fix in the triton kernel itself to not hit the slowpath? Or avoid some of the autotuning configs?

@ptillet
Copy link
Collaborator

ptillet commented Mar 8, 2023

Hmm, I gotta say I don't really see the benefits of having the two matmuls in the same inner loop, since they're independent? This will likely increase register pressure a lot without increasing arithmetic intensity

@ngimel
Copy link
Contributor

ngimel commented Mar 8, 2023

We see speed-ups from some small-ish sizes we have iirc

@bertmaher
Copy link
Collaborator Author

Also based on the comment in that kernel, the original intent was to have them in two separate inner loops (but sharing the accumulator). I think these are bandwidth bound because of the small K, so avoiding the read/write round trip for the fused addition ends up being a win.

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Mar 9, 2023
Trim number of tested mm_plus_mm configs to work around triton-lang/triton#1298

Pull Request resolved: #96385
Approved by: https://github.com/bertmaher, https://github.com/jansel
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
…or max autotuning"


This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
…or max autotuning"


This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
…or max autotuning"


This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
…or max autotuning"


This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
…or max autotuning"


This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
…or max autotuning"


This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
…or max autotuning"


This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

[ghstack-poisoned]
ydwu4 pushed a commit to ydwu4/pytorch that referenced this issue Mar 10, 2023
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
…or max autotuning"


This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 10, 2023
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
@ptillet
Copy link
Collaborator

ptillet commented Mar 11, 2023

I don't really see what is the drawback of doing loop fission here 🤔 Even in bandwidth-bound regimes, spilling can hurt a ton and it's much more likely to happen with the fused loop since it's harder to optimize for ptxas.

So the workaround I propose to avoid the slow path is to have two loops that share the same accumulator

@bertmaher
Copy link
Collaborator Author

We took your advice and separated the loops, which is better and doesn’t crash. But the “fused” loops probably still shouldn’t crash, right?

@ngimel
Copy link
Contributor

ngimel commented Mar 11, 2023

2 loops still error out in the same way when BLOCK_K is equal to k and they effectively become one.

@ptillet
Copy link
Collaborator

ptillet commented Mar 11, 2023

Yeah, it definitely shouldn't crash :) Just wanted to unblock you, since we probably won't have time to look at this issue for a few weeks.

@ngimel
Copy link
Contributor

ngimel commented Mar 11, 2023

Yeah we unblocked in pytorch/pytorch#96385, thanks!

cyyever pushed a commit to cyyever/pytorch_private that referenced this issue Mar 12, 2023
ydwu4 added a commit to ydwu4/pytorch that referenced this issue Mar 13, 2023
Trim number of tested mm_plus_mm configs to work around triton-lang/triton#1298

Pull Request resolved: pytorch#96385
Approved by: https://github.com/bertmaher, https://github.com/jansel
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 13, 2023
…or max autotuning"


This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 13, 2023
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 13, 2023
…or max autotuning"


This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 13, 2023
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 14, 2023
…or max autotuning"


This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 14, 2023
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 15, 2023
…or max autotuning"


This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 15, 2023
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 15, 2023
…or max autotuning"


This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 15, 2023
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 17, 2023
…or max autotuning"


This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 17, 2023
This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 17, 2023
…or max autotuning"


This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
shunting314 added a commit to pytorch/pytorch that referenced this issue Mar 17, 2023
This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.


Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
``` 

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```


cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)

[ghstack-poisoned]
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Mar 18, 2023
This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.

Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
```

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)
Pull Request resolved: #96410
Approved by: https://github.com/jansel
cyyever pushed a commit to cyyever/pytorch_private that referenced this issue Mar 23, 2023
This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.

Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
```

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)
Pull Request resolved: pytorch/pytorch#96410
Approved by: https://github.com/jansel
cyyever pushed a commit to cyyever/pytorch_private that referenced this issue Mar 27, 2023
This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.

Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
```

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)
Pull Request resolved: pytorch/pytorch#96410
Approved by: https://github.com/jansel
@vedantroy
Copy link
Contributor

@bertmaher I'm working on fixing this issue, do you have any reproductions that crash on HEAD?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants