-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed #1298
Comments
We have couple of these errors. Basically they happen when the optimizer doesn't do its job well, so the backend is hitting a slow codepath that hasn't been implemented yet 😅 |
Is there anything we can fix in the triton kernel itself to not hit the slowpath? Or avoid some of the autotuning configs? |
Hmm, I gotta say I don't really see the benefits of having the two matmuls in the same inner loop, since they're independent? This will likely increase register pressure a lot without increasing arithmetic intensity |
We see speed-ups from some small-ish sizes we have iirc |
Also based on the comment in that kernel, the original intent was to have them in two separate inner loops (but sharing the accumulator). I think these are bandwidth bound because of the small K, so avoiding the read/write round trip for the fused addition ends up being a win. |
Trim number of tested mm_plus_mm configs to work around triton-lang/triton#1298 Pull Request resolved: #96385 Approved by: https://github.com/bertmaher, https://github.com/jansel
…or max autotuning" This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…or max autotuning" This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…or max autotuning" This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…or max autotuning" This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…or max autotuning" This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…or max autotuning" This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
…or max autotuning" This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire [ghstack-poisoned]
Trim number of tested mm_plus_mm configs to work around triton-lang/triton#1298 Pull Request resolved: pytorch#96385 Approved by: https://github.com/bertmaher, https://github.com/jansel
…or max autotuning" This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
I don't really see what is the drawback of doing loop fission here 🤔 Even in bandwidth-bound regimes, spilling can hurt a ton and it's much more likely to happen with the fused loop since it's harder to optimize for So the workaround I propose to avoid the slow path is to have two loops that share the same accumulator |
We took your advice and separated the loops, which is better and doesn’t crash. But the “fused” loops probably still shouldn’t crash, right? |
2 loops still error out in the same way when |
Yeah, it definitely shouldn't crash :) Just wanted to unblock you, since we probably won't have time to look at this issue for a few weeks. |
Yeah we unblocked in pytorch/pytorch#96385, thanks! |
Trim number of tested mm_plus_mm configs to work around triton-lang/triton#1298 Pull Request resolved: pytorch/pytorch#96385 Approved by: https://github.com/bertmaher, https://github.com/jansel
Trim number of tested mm_plus_mm configs to work around triton-lang/triton#1298 Pull Request resolved: pytorch#96385 Approved by: https://github.com/bertmaher, https://github.com/jansel
…or max autotuning" This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
…or max autotuning" This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
…or max autotuning" This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
…or max autotuning" This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
…or max autotuning" This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
This PR implements a prototype to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
…or max autotuning" This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
…or max autotuning" This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` cc soumith voznesenskym penguinwu anijain2305 EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx peterbell10 desertfire Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) [ghstack-poisoned]
This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) Pull Request resolved: #96410 Approved by: https://github.com/jansel
This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) Pull Request resolved: pytorch/pytorch#96410 Approved by: https://github.com/jansel
This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like triton-lang/triton#1298 will only abort the autotuning child process but the parent process can continue. There are a few things to note: - cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html - to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail. Here I list the pickle related issues I encountered: - pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer. - IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode. - jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template. - due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly. - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object. - We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly. Test: ``` python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm ``` This is basically the repro I get from Bert Maher. Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help. ``` AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_5 0.0317s 87.1% triton_mm_plus_mm_1 0.0328s 84.4% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0379s 73.0% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 12.001659393310547 seconds AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536) triton_mm_plus_mm_0 0.0276s 100.0% triton_mm_plus_mm_6 0.0287s 96.4% triton_mm_plus_mm_1 0.0317s 87.1% triton_mm_plus_mm_5 0.0317s 87.1% ref_mm_plus_mm 0.0379s 73.0% triton_mm_plus_mm_7 0.0389s 71.1% triton_mm_plus_mm_2 0.0399s 69.2% triton_mm_plus_mm_3 0.0410s 67.5% triton_mm_plus_mm_4 0.0410s 67.5% AUTOTUNE takes 51.39659810066223 seconds ``` The feature is disabled by default and can be enabled by setting the following config or envvar: ``` autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1" ``` Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048) Pull Request resolved: pytorch/pytorch#96410 Approved by: https://github.com/jansel
@bertmaher I'm working on fixing this issue, do you have any reproductions that crash on HEAD? |
Running https://gist.github.com/bertmaher/93302c4f40728d8481873850e84cf47a#file-mm_plus_mm_mlir_assert-py (this is generated by inductor in max-autotune mode) fails with
The text was updated successfully, but these errors were encountered: