Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,9 @@ __This branch corresponds to the ongoing 2025 course. If you want to see full ma
- Seminar: In-depth overview of FSDP2
- [__Week 7:__](./week07_application_deployment) __Python web application deployment__
- Lecture/Seminar: Building and deployment of production-ready web services. App & web servers, Docker, Prometheus, API via HTTP and gRPC.
- __Week 8:__ __LLM inference optimizations and software__
- [__Week 8:__](./week08_inference_software) __LLM inference optimizations and software__
- Lecture: Inference speed metrics. KV caching, batch inference, continuous batching. FlashAttention with its modifications and PagedAttention. Overview of popular LLM serving frameworks.
- Seminar: Implementation of KV caching. Basics of the Triton language. Layer fusion in PyTorch and Triton. Liger Kernels. FlashAttention and FlexAttention in practice.
- [__Week 9:__](./week09_inference_algorithms) __Efficient model inference__
- Lecture: Speculative decoding, architecture optimizations, quantization, knowledge distillation
- Seminar: Introduction to speculative decoding. Matrix multiplication in Triton for different scenarios.
Expand Down
19 changes: 19 additions & 0 deletions week08_inference_software/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Week 8: LLM inference optimizations and software

* Lecture: [link](./lecture.pdf)
* Seminar: [link](./seminar/seminar.ipynb)

## Further reading
* [What is the KV cache?](https://mett29.github.io/posts/kv-cache/)
* [Overview of torch.compiler](https://pytorch.org/docs/stable/torch.compiler.html#torch-compiler-overview)
* [Torch Dynamo Overview](https://pytorch.org/docs/stable/torch.compiler_dynamo_overview.html)
* [Torch Dynamo Deep-Dive](https://pytorch.org/docs/stable/torch.compiler_dynamo_deepdive.html)
* [Torch Compiler Troubleshouting](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_troubleshooting.rst)
* [Deep Dive into Triton Internals (3 Parts)](https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals/)
* [HF Ultra-Scale Playbook (fused kernels section)](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=fused_kernels)
* [Liger Kernels repo](https://github.com/linkedin/Liger-Kernel/tree/main/src/liger_kernel/ops)
* [Liger Kernels paper](https://arxiv.org/pdf/2410.10989)
* [FlashAttention](https://arxiv.org/pdf/2205.14135)
* [FlassAttention2](https://arxiv.org/pdf/2307.08691)
* [FlassAttention3](https://arxiv.org/pdf/2407.08608)
* [Flex Attention Tutorial](https://pytorch.org/blog/flexattention/)
Binary file added week08_inference_software/lecture.pdf
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,632 changes: 1,632 additions & 0 deletions week08_inference_software/seminar/seminar.ipynb

Large diffs are not rendered by default.