mryab · mryab · Mar 17, 2025 · Mar 16, 2025 · Mar 17, 2025 · Mar 17, 2025
diff --git a/README.md b/README.md
@@ -24,7 +24,9 @@ __This branch corresponds to the ongoing 2025 course. If you want to see full ma
   - Seminar: In-depth overview of FSDP2
 - [__Week 7:__](./week07_application_deployment) __Python web application deployment__
   - Lecture/Seminar: Building and deployment of production-ready web services. App & web servers, Docker, Prometheus, API via HTTP and gRPC.
-- __Week 8:__ __LLM inference optimizations and software__
+- [__Week 8:__](./week08_inference_software) __LLM inference optimizations and software__
+  - Lecture: Inference speed metrics. KV caching, batch inference, continuous batching. FlashAttention with its modifications and PagedAttention. Overview of popular LLM serving frameworks.
+  - Seminar: Implementation of KV caching. Basics of the Triton language. Layer fusion in PyTorch and Triton. Liger Kernels. FlashAttention and FlexAttention in practice.
 - [__Week 9:__](./week09_inference_algorithms) __Efficient model inference__
   - Lecture: Speculative decoding, architecture optimizations, quantization, knowledge distillation
   - Seminar: Introduction to speculative decoding. Matrix multiplication in Triton for different scenarios.

diff --git a/week08_inference_software/README.md b/week08_inference_software/README.md
@@ -0,0 +1,19 @@
+# Week 8: LLM inference optimizations and software
+
+* Lecture: [link](./lecture.pdf)
+* Seminar: [link](./seminar/seminar.ipynb)
+
+## Further reading
+* [What is the KV cache?](https://mett29.github.io/posts/kv-cache/)
+* [Overview of torch.compiler](https://pytorch.org/docs/stable/torch.compiler.html#torch-compiler-overview)
+* [Torch Dynamo Overview](https://pytorch.org/docs/stable/torch.compiler_dynamo_overview.html)
+* [Torch Dynamo Deep-Dive](https://pytorch.org/docs/stable/torch.compiler_dynamo_deepdive.html)
+* [Torch Compiler Troubleshouting](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_troubleshooting.rst)
+* [Deep Dive into Triton Internals (3 Parts)](https://www.kapilsharma.dev/posts/deep-dive-into-triton-internals/)
+* [HF Ultra-Scale Playbook (fused kernels section)](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=fused_kernels)
+* [Liger Kernels repo](https://github.com/linkedin/Liger-Kernel/tree/main/src/liger_kernel/ops)
+* [Liger Kernels paper](https://arxiv.org/pdf/2410.10989)
+* [FlashAttention](https://arxiv.org/pdf/2205.14135)
+* [FlassAttention2](https://arxiv.org/pdf/2307.08691)
+* [FlassAttention3](https://arxiv.org/pdf/2407.08608)
+* [Flex Attention Tutorial](https://pytorch.org/blog/flexattention/)
diff --git a/week08_inference_software/lecture.pdf b/week08_inference_software/lecture.pdf
diff --git a/week08_inference_software/seminar/images/fused_kernels1.png b/week08_inference_software/seminar/images/fused_kernels1.png
diff --git a/week08_inference_software/seminar/images/fused_kernels2.png b/week08_inference_software/seminar/images/fused_kernels2.png
diff --git a/week08_inference_software/seminar/images/prefixLM.png b/week08_inference_software/seminar/images/prefixLM.png
diff --git a/week08_inference_software/seminar/images/rowcolumnarrays.webp b/week08_inference_software/seminar/images/rowcolumnarrays.webp
diff --git a/week08_inference_software/seminar/seminar.ipynb b/week08_inference_software/seminar/seminar.ipynb