|
321 | 321 |
|
322 | 322 | | Meta | Title | Cover | Publish | Code | Note |
|
323 | 323 | |:-----|:------|:------|:--------|:-----|:-----|
|
324 |
| -| [nmSPARSE](./meta/2023/nmSPARSE.prototxt) | [Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning](https://proceedings.mlsys.org/paper_files/paper/2023/file/4552cedd396a308320209f75f56a5ad5-Paper-mlsys2023.pdf) | |  | [](https://github.com/microsoft/SparTA) | | |
| 324 | +| [nmSPARSE](./meta/2023/nmSPARSE.prototxt) | [Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning](https://proceedings.mlsys.org/paper_files/paper/2023/file/a10deb4d5227a8ea307ea8ff3cb712f4-Paper-mlsys2023.pdf) | |  | [](https://github.com/microsoft/SparTA) | | |
325 | 325 | | [AWQ](./meta/2024/awq.prototxt) | [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/abs/2306.00978) | |  | [](https://github.com/mit-han-lab/llm-awq) | |
|
326 | 326 | | [Vidur](./meta/2024/Vidur.prototxt) | [Vidur: A Large-Scale Simulation Framework For LLM Inference](http://arxiv.org/abs/2405.05465v2) |  |  | [](https://github.com/microsoft/vidur) | [note](./notes/2024/Vidur/note.md) |
|
327 | 327 | | [0VRXJQ3F](./meta/2025/0VRXJQ3F.prototxt) | [Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving](http://arxiv.org/abs/2503.24000v1) |  |  | [](https://github.com/LLMkvsys/rethink-kv-compression) | [note](./notes/2025/0VRXJQ3F/note.md) |
|
|
350 | 350 | | [SlimGPT](./meta/2024/SlimGPT.prototxt) | [SlimGPT: Layer-wise Structured Pruning for Large Language Models](http://arxiv.org/abs/2412.18110v1) |  |  | | [note](./notes/2024/SlimGPT/note.md) |
|
351 | 351 | | [SparseLLM](./meta/2024/SparseLLM.prototxt) | [SparseLLM: Towards Global Pruning for Pre-trained Language Models](http://arxiv.org/abs/2402.17946v3) | |  | [](https://github.com/BaiTheBest/SparseLLM) | [note](./notes/2024/SparseLLM/note.md) |
|
352 | 352 | | [ZipCache](./meta/2024/ZipCache.prototxt) | [ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification](http://arxiv.org/abs/2405.14256v1) | |  | | [note](./notes/2024/ZipCache/note.md) |
|
| 353 | +| [DeltaAttention](./meta/2025/DeltaAttention.prototxt) | [Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction](http://arxiv.org/abs/2505.11254v1) |  |  | | [note](./notes/2025/DeltaAttention/note.md) | |
| 354 | +| [MoBA](./meta/2025/MoBA.prototxt) | [MoBA: Mixture of Block Attention for Long-Context LLMs](http://arxiv.org/abs/2502.13189v1) |  |  | [](https://github.com/MoonshotAI/MoBA) | [note](./notes/2025/MoBA/note.md) | |
| 355 | +| [SageAttention3](./meta/2025/SageAttention3.prototxt) | [SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training](http://arxiv.org/abs/2505.11594v1) | |  | [](https://github.com/thu-ml/SageAttention) | [note](./notes/2025/SageAttention3/note.md) | |
353 | 356 | | [Týr-the-Pruner](./meta/2025/Týr-the-Pruner.prototxt) | [Týr-the-Pruner: Unlocking Accurate 50% Structural Pruning for LLMs via Global Sparsity Distribution Optimization](http://arxiv.org/abs/2503.09657v2) |  |  | | [note](./notes/2025/Týr-the-Pruner/note.md) |
|
354 | 357 | </p>
|
355 | 358 | </details>
|
|
578 | 581 | | [DBudgetKV](./meta/2025/DBudgetKV.prototxt) | [DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance](http://arxiv.org/abs/2502.16886v1) |  |  | | [note](./notes/2025/DBudgetKV/note.md) |
|
579 | 582 | | [DReSS](./meta/2025/DReSS.prototxt) | [DReSS: Data-driven Regularized Structured Streamlining for Large Language Models](http://arxiv.org/abs/2501.17905v3) |  |  | | [note](./notes/2025/DReSS/note.md) |
|
580 | 583 | | [DeepSeek-R1](./meta/2025/DeepSeek-R1.prototxt) | [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](http://arxiv.org/abs/2501.12948v1) |  |  | [](https://github.com/deepseek-ai/DeepSeek-R1) | [note](./notes/2025/DeepSeek-R1/note.md) |
|
581 |
| -| [DeltaAttention](./meta/2025/DeltaAttention.prototxt) | [Delta Attention: Fast and Accurate Sparse Attention Inference by Delta Correction](http://arxiv.org/abs/2505.11254v1) |  |  | | [note](./notes/2025/DeltaAttention/note.md) | |
582 | 584 | | [DeltaLLM](./meta/2025/DeltaLLM.prototxt) | [DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference](http://arxiv.org/abs/2507.19608v1) |  |  | | [note](./notes/2025/DeltaLLM/note.md) |
|
583 | 585 | | [LIMINAL](./meta/2025/LIMINAL.prototxt) | [Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need](http://arxiv.org/abs/2507.14397v1) | |  | | [note](./notes/2025/LIMINAL/note.md) |
|
584 | 586 | | [RaaS](./meta/2025/RaaS.prototxt) | [Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity](http://arxiv.org/abs/2502.11147v1) |  |  | | [note](./notes/2025/RaaS/note.md) |
|
|
612 | 614 | | [52A7RO95](./meta/2025/52A7RO95.prototxt) | [Mixture of Experts in Large Language Models](http://arxiv.org/abs/2507.11181v1) |  |  | | [note](./notes/2025/52A7RO95/note.md) |
|
613 | 615 | | [MoSA](./meta/2025/MoSA.prototxt) | [Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing](http://arxiv.org/abs/2505.00315v1) |  |  | [](https://github.com/piotrpiekos/MoSA) | [note](./notes/2025/MoSA/note.md) |
|
614 | 616 | | [MoR](./meta/2025/MoR.prototxt) | [Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation](http://arxiv.org/abs/2507.10524v1) |  |  | [](https://github.com/raymin0223/mixture_of_recursions) | [note](./notes/2025/MoR/note.md) |
|
615 |
| -| [MoBA](./meta/2025/MoBA.prototxt) | [MoBA: Mixture of Block Attention for Long-Context LLMs](http://arxiv.org/abs/2502.13189v1) |  |  | [](https://github.com/MoonshotAI/MoBA) | [note](./notes/2025/MoBA/note.md) | |
616 | 617 | | [MoPEQ](./meta/2025/MoPEQ.prototxt) | [MoPEQ: Mixture of Mixed Precision Quantized Experts](http://arxiv.org/abs/2509.02512v1) |  |  | [](https://github.com/krishnateja95/MoE-Mixed-Prec) | [note](./notes/2025/MoPEQ/note.md) |
|
617 | 618 | | [Mosaic](./meta/2025/Mosaic.prototxt) | [Mosaic: Composite Projection Pruning for Resource-efficient LLMs](http://arxiv.org/abs/2504.06323v1) |  |  | | [note](./notes/2025/Mosaic/note.md) |
|
618 | 619 | | [PagedEviction](./meta/2025/PagedEviction.prototxt) | [PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference](http://arxiv.org/abs/2509.04377v1) |  |  | | [note](./notes/2025/PagedEviction/note.md) |
|
|
629 | 630 | | [RotateKV](./meta/2025/RotateKV.prototxt) | [RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations](http://arxiv.org/abs/2501.16383v2) |  |  | | [note](./notes/2025/RotateKV/note.md) |
|
630 | 631 | | [SALE](./meta/2025/SALE.prototxt) | [SALE : Low-bit Estimation for Efficient Sparse Attention in Long-context LLM Prefilling](http://arxiv.org/abs/2505.24179v1) |  |  | [](https://github.com/BirdChristopher/SALE) | [note](./notes/2025/SALE/note.md) |
|
631 | 632 | | [SEAP](./meta/2025/SEAP.prototxt) | [SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models](http://arxiv.org/abs/2503.07605v1) |  |  | [](https://github.com/IAAR-Shanghai/SEAP) | [note](./notes/2025/SEAP/note.md) |
|
632 |
| -| [SageAttention3](./meta/2025/SageAttention3.prototxt) | [SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training](http://arxiv.org/abs/2505.11594v1) | |  | [](https://github.com/thu-ml/SageAttention) | [note](./notes/2025/SageAttention3/note.md) | |
633 | 633 | | [SeerAttention-R](./meta/2025/SeerAttention-R.prototxt) | [SeerAttention-R: Sparse Attention Adaptation for Long Reasoning](http://arxiv.org/abs/2506.08889v1) |  |  | [](https://github.com/microsoft/SeerAttention) | [note](./notes/2025/SeerAttention-R/note.md) |
|
634 | 634 | | [Seesaw](./meta/2025/Seesaw.prototxt) | [Seesaw: High-throughput LLM Inference via Model Re-sharding](http://arxiv.org/abs/2503.06433v1) |  |  | | [note](./notes/2025/Seesaw/note.md) |
|
635 | 635 | | [SlimInfer](./meta/2025/SlimInfer.prototxt) | [SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning](http://arxiv.org/abs/2508.06447v1) |  |  | | [note](./notes/2025/SlimInfer/note.md) |
|
|
0 commit comments