- LightSeq: A High Performance Inference Library for Transformers [NAACL 2021] Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, Lei Li.
- use rewrite kernel and CuBLAS GEMM; Save most of the time;
- Propose one hierarchical beam search method, which use retrieve-rerank two-stage to reduce the softmax calculate.
- Dynamic GPU memory reuse.