AOT compilation #935

david-macleod · 2022-12-01T13:28:37Z

Hi, I was just wondering if there had been any more thoughts on supporting AOT kernel compilation to allow execution outside of Python? Referencing #175

gaxler · 2022-12-02T08:17:09Z

We are waiting on a rewrite to be done

See: #490 (comment)

yufenglee · 2022-12-02T18:25:06Z

Nice! Do you have a rough estimation when it will be done?

ptillet · 2022-12-02T19:56:59Z

The rewrite will be done this months. There is some very basic aot that we made for unit testing purposes right now, but efforts on a more complex one will be able to resume after then.

yufenglee · 2022-12-07T04:49:42Z

And what will AoT compilation generate, a C/C++ API plus source/.so?

david-macleod · 2022-12-11T20:18:55Z

Great news, is there some branch/PR we can track the progress of this?

david-macleod · 2022-12-19T23:18:51Z

@ptillet I am very keen to have a go at using this feature whatever state the code currently is in, even if it is only the unit test you mentioned previously (have a time sensitive project which could benefit from AOT functionality)

gaxler · 2022-12-20T00:38:44Z

We have a prototype that works with an old version of Triton. You might be able to hack it for your needs?
#490

gaxler · 2022-12-20T00:44:16Z

And what will AoT compilation generate, a C/C++ API plus source/.so?

For previous iterations we started with a C code that holds the kernels in source.
The thinking is to give users something very general.

david-macleod · 2022-12-20T17:00:06Z

We have a prototype that works with an old version of Triton. You might be able to hack it for your needs? #490

Great thanks @gaxler, will give it a go! For the main feature is there any WIP branch that can be tracked or is it separate from the main repo?

david-macleod · 2023-01-03T15:38:34Z

@gaxler should there be a correlation between the triton BLOCK_SIZE defined in the kernel definition, and the gX, gY, gZ defined in GridWarps when calling the kernel?

gaxler · 2023-01-04T05:37:19Z

@gaxler should there be a correlation between the triton BLOCK_SIZE defined in the kernel definition, and the gX, gY, gZ defined in GridWarps when calling the kernel?

You mean add grid size constrains at compile-time?

In general I avoided dealing with anything related to kernel launches in the draft PR, its all just placeholders to make it run

david-macleod · 2023-01-05T19:08:29Z

Great thanks! I now have it working but have noticed the performance is much worse than the JIT triton equivalent. From the profile trace I see large gaps between the triton kernel and the preceeding/successive kernels.

I am aware you are not actively maintaining this but was just wondering if this was expected or had any hints? I am not that familiar with PTX but understand it is JIT compiled so was wondering if it was not being cached correctly or something like that.

gaxler · 2023-01-05T20:12:50Z

sorry that you have to bump into all those things. this is just a POC and in no way optimized.
thanks for profiling the generated code!!

probably the worst thing for the C code performance is the PTX. it gets compiled to binary every time you call a kernel. this will be replaced by a cubin.

another overhead might be the dispatch for different input sizes. not sure how significant it is for overall performance.

perhaps you can use several cuda streams to bypass those issues?

david-macleod · 2023-01-05T22:55:03Z

If I know my target hardware apriori is there any downside/gotchas to me dumping the ptx code to a file and compiling down to cubin and loading that instead? Could that potentially help with the overheads?

david-macleod · 2023-01-06T20:39:33Z

Converting to cubin has helped a lot! (in the trace the triton kernel is the one that sits between the orange and green)

JIT

AOT - PTX

AOT - cubin

Whilst the overhead is now much smaller, there is still a gap in utilization before and after the AOT triton kernel is run (perhaps there is some implicit synchronisation happening).

Regarding your suggestion about the dispatch time, I am guessing that could result in a delay on host thread but as long as it is launched sufficiently before the device is ready to execute the kernel (which we are pretty sure is the case here), that cost should be hidden?

EDIT: I now think the overheads might be related to the module loading, need to confirm

gaxler · 2023-01-06T23:25:32Z

Assuming the _tr... is a triton JITFunction for JIT and the launch function from the generated C code for AOT.

I think you are correct.
The JITFunction does the module and function loading before it calls the launch code. For the generated C code each call loads the module and the CUFunction.

Thanks for doing this, this will be helpful when thinking about optimizing the generated code!

david-macleod · 2023-01-07T21:11:28Z

Tried caching the loaded CUFunction and things are now looking very close to JIT performance (only 5-10% slower now) 🙂

gaxler · 2023-01-14T07:21:48Z

Got a new prototype together, maybe this can help in some way: #1056

david-macleod · 2023-01-24T22:04:49Z

Thanks, will check it out

david-macleod · 2023-04-03T09:41:31Z

Do you know how close it is to being merged? (just trying to gauge whether I should wait - or working from the branch)

gaxler · 2023-04-05T23:26:39Z

It's pretty close but there are other things that have priority over merging it. So branch will be better. I'm happy to help, it will be great to get user feedback

david-macleod · 2023-04-21T20:37:18Z

@gaxler what is the relationship between this branch and aot.py on master? Will they both continue to exist after this branch is complete?

Jokeren added the enhancement label Dec 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AOT compilation #935

AOT compilation #935

david-macleod commented Dec 1, 2022

gaxler commented Dec 2, 2022

yufenglee commented Dec 2, 2022

ptillet commented Dec 2, 2022

yufenglee commented Dec 7, 2022

david-macleod commented Dec 11, 2022

david-macleod commented Dec 19, 2022

gaxler commented Dec 20, 2022

gaxler commented Dec 20, 2022

david-macleod commented Dec 20, 2022

david-macleod commented Jan 3, 2023

gaxler commented Jan 4, 2023

david-macleod commented Jan 5, 2023 •

edited

gaxler commented Jan 5, 2023

david-macleod commented Jan 5, 2023 •

edited

david-macleod commented Jan 6, 2023 •

edited

gaxler commented Jan 6, 2023

david-macleod commented Jan 7, 2023

gaxler commented Jan 14, 2023

david-macleod commented Jan 24, 2023

david-macleod commented Apr 3, 2023

gaxler commented Apr 5, 2023

david-macleod commented Apr 21, 2023

AOT compilation #935

AOT compilation #935

Comments

david-macleod commented Dec 1, 2022

gaxler commented Dec 2, 2022

yufenglee commented Dec 2, 2022

ptillet commented Dec 2, 2022

yufenglee commented Dec 7, 2022

david-macleod commented Dec 11, 2022

david-macleod commented Dec 19, 2022

gaxler commented Dec 20, 2022

gaxler commented Dec 20, 2022

david-macleod commented Dec 20, 2022

david-macleod commented Jan 3, 2023

gaxler commented Jan 4, 2023

david-macleod commented Jan 5, 2023 • edited

gaxler commented Jan 5, 2023

david-macleod commented Jan 5, 2023 • edited

david-macleod commented Jan 6, 2023 • edited

gaxler commented Jan 6, 2023

david-macleod commented Jan 7, 2023

gaxler commented Jan 14, 2023

david-macleod commented Jan 24, 2023

david-macleod commented Apr 3, 2023

gaxler commented Apr 5, 2023

david-macleod commented Apr 21, 2023

david-macleod commented Jan 5, 2023 •

edited

david-macleod commented Jan 5, 2023 •

edited

david-macleod commented Jan 6, 2023 •

edited