This release contains mostly improvements on kernels and a few features and
fixes. Namely:
- We have a new super fast causal-linear kernel written by NVIDIA's Julien
Demouth
- We have faster clustered broadcast and clustered aggregate kernels written by
Apoorv Vyas
What should have been in this release but isn't because I didn't have time to
work on it :-) :
- Fancier masking that allows for different masks per sample while maintaining
backwards compatibility
- Checkpointing for training huge models on single GPU machines
- 16-bit kernels for linear, local and clustering