Squashed commit of the following: · hpcaitech/ColossalAI@6ebeac2

Commit

Squashed commit of the following:

commit 4647ec28c8450ee96f4709626617763712efd77e
Author: binmakeswell <binmakeswell@gmail.com>
Date:   Thu May 23 17:44:06 2024 +0800

    [inference] release (#5747)

    * [inference] release

    * [inference] release

    * [inference] release

    * [inference] release

    * [inference] release

    * [inference] release

    * [inference] release

commit df6747603f11e2a1929db193ceb014799e02e2c1
Merge: 22ce873c 498f42c4
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed May 22 14:31:09 2024 +0800

    [Colossal-Inference] (v0.1.0) Merge pull request #5739 from hpcaitech/feature/colossal-infer

    [Inference] Merge feature/colossal-infer

commit 498f42c45b256b5cfc32d74b552e1e306f317a42
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed May 22 12:08:49 2024 +0800

    [NFC]  fix requirements (#5744)

commit bd38fe6b912379080673a43d77fd3bdf0e5c852e
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 21 22:12:15 2024 +0800

    [NFC] Fix code factors on inference triton kernels (#5743)

commit c2c8c9cf17d67000df8a5b75ae9dbecee0e1c00a
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 21 18:20:57 2024 +0800

    [ci] Temporary fix for build on pr (#5741)

    * temporary fix for CI

    * timeout to 90

commit c06208e72c35d74e150b6a83e72375f5021d10b1
Merge: d8b1ea4a 8633c15d
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 21 11:26:37 2024 +0800

    Merge pull request #5737 from yuanheng-zhao/inference/sync/main

    [sync] Sync feature/colossal-infer with main

commit 22ce873c3f26fd7f4217cdf19071c173683c2b47
Author: Haze188 <haze188@qq.com>
Date:   Tue May 21 11:07:13 2024 +0800

    [Shardformer] Add parallel output for shardformer models(bloom, falcon) (#5702)

    * [pre-commit.ci] auto fixes from pre-commit.com hooks

    * add parallel cross entropy output for falcon model & fix some typos in bloom.py

    * fix module name error, self.model -> self.transformers in bloom, falcon model

    * Fix the overflow bug of distributed cross entropy loss function when training with fp16

    * add dtype to parallel cross entropy loss function

    * fix dtype related typos adn prettify the loss.py

    * fix grad dtype and update dtype mismatch error

    * fix typo bugs

commit 8633c15da9b82c675c59ad292e7f0d77f092653c
Merge: d8b1ea4a 9d83c6d7
Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Date:   Mon May 20 15:50:53 2024 +0000

    [sync] Sync feature/colossal-infer with main

commit d8b1ea4ac90317ad6126acbd854e66583a8f9c8f
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon May 20 22:50:04 2024 +0800

    [doc] Update Inference Readme (#5736)

    * [doc] update inference readme

    * add contents

    * trivial

commit bdf9a001d61cfad4bb68752c4a808295165307a0
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon May 20 22:49:18 2024 +0800

    [Fix/Inference] Add unsupported auto-policy error message (#5730)

    * [fix] auto policy error message

    * trivial

commit 283c407a19002118bda7edd1b8a3acf099843205
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Sun May 19 15:08:42 2024 +0800

    [Inference] Fix Inference Generation Config and Sampling (#5710)

    * refactor and add

    * config default values

    * fix gen config passing

    * fix rpc generation config

commit 9d83c6d715e8cdb802f82335e651923baab5cfc6
Author: flybird11111 <1829166702@qq.com>
Date:   Fri May 17 18:18:59 2024 +0800

    [lazy] fix lazy cls init (#5720)

    * fix

    * fix

    * fix

    * fix

    * fix

    * remove kernel intall

    * rebase

    revert

    fix

    * fix

    * fix

commit 8bcfe360fdae7ccec7051aaced48497519afc2f2
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Fri May 17 11:28:53 2024 +0800

    [example] Update Inference Example (#5725)

    * [example] update inference example

commit a8d459f99a1d415fc843327e4dafce19ecee1f3e
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Thu May 16 10:49:03 2024 +0800

    【Inference] Delete duplicated package (#5723)

commit f47f2fbb2467df15548d2c663b119f4ae0103890
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed May 15 15:47:31 2024 +0800

    [Inference] Fix API server, test and example (#5712)

    * fix api server

    * fix generation config

    * fix api server

    * fix comments

    * fix infer hanging bug

    * resolve comments, change backend to free port

commit 74c47921facd26dbd93172bf887abcad4eab2d5c
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Tue May 14 20:17:43 2024 +0800

    [Fix] Llama3 Load/Omit CheckpointIO Temporarily (#5717)

    * Fix Llama3 Load error
    * Omit Checkpoint IO Temporarily

commit 5bbab1533ae7672ab37e91b7bc9e584b3a4e9cc1
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 14 16:08:51 2024 +0800

    [ci] Fix example tests (#5714)

    * [fix] revise timeout value on example CI

    * trivial

commit 121d7ad629c746e52a96ec53d6e26c0194016a03
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue May 14 14:35:33 2024 +0800

    [Inference] Delete duplicated copy_vector (#5716)

commit 7806842f2dbb4b6d6e74014efc7db5be8ccf0bbd
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Tue May 14 12:46:54 2024 +0800

    add paged-attetionv2: support seq length split across thread block (#5707)

commit 18d67d0e8e79c22bded0745c7d3daf8ca40d445c
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Tue May 14 10:00:55 2024 +0800

    [Feat]Inference RPC Server Support (#5705)

    * rpc support source
    * kv cache logical/physical disaggregation
    * sampler refactor
    * colossalai launch built in
    * Unitest
    * Rpyc support

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit de4bf3dedf2c7cb7ba6c3044745bab3c3ef6352d
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Sat May 11 15:13:25 2024 +0800

    [Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708)

    * Adapt repetition_penalty and no_repeat_ngram_size

    * fix no_repeat_ngram_size_logit_process

    * remove batch_updated

    * fix annotation

    * modified codes based on the review feedback.

    * rm get_batch_token_ids

commit 50104ab340e6c7067fbaaf9b47c608eb828aa95b
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Fri May 10 18:39:54 2024 +0800

    [Inference/Feat] Add convert_fp8 op for fp8 test in the future (#5706)

    * add convert_fp8 op for fp8 test in the future

    * rerun ci

commit bfad39357b0fe31ecf6f7639e2c4056165078a3f
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Thu May 9 18:03:24 2024 +0800

    [Inference/Feat] Add quant kvcache interface (#5700)

    * add quant kvcache interface

    * delete unused output

    * complete args comments

commit 492520dbdb962d207ac40d216e0414807f73eb19
Merge: d4829220 5d9a4948
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Thu May 9 17:19:45 2024 +0800

    Merge pull request #5588 from hpcaitech/feat/online-serving

    [Feature]Online Serving

commit 5d9a49483d98ccd4bebebbfd039162caceefe6bd
Author: CjhHa1 <cjh18671720497@outlook.com>
Date:   Thu May 9 05:44:05 2024 +0000

    [Inference] Add example test_ci script

commit bc9063adf1598c3be32fc2d12577d76b9daa79bf
Author: CjhHa1 <cjh18671720497@outlook.com>
Date:   Wed May 8 10:36:42 2024 +0000

    resolve rebase conflicts on Branch feat/online-serving

commit 61a1b2e798edcbf91ac35966a4047407ad6aa62d
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed May 8 15:14:06 2024 +0800

    [Inference] Fix bugs and docs for feat/online-server (#5598)

    * fix test bugs

    * add do sample test

    * del useless lines

    * fix comments

    * fix tests

    * delete version tag

    * delete version tag

    * add

    * del test sever

    * fix test

    * fix

    * Revert "add"

    This reverts commit b9305fb02440d5cd566d32b508bee9f9c13dda15.

commit 7bbb28e48bdb5849d9dfb118d7bf2959d79bbe02
Author: CjhHa1 <cjh18671720497@outlook.com>
Date:   Thu Apr 11 10:12:31 2024 +0800

    [Inference] resolve rebase conflicts

    fix

commit c06403286567f62cb0a6dfc5e075cf60e291cea9
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Sun Apr 7 14:45:43 2024 +0800

    [Online Server] Chat Api for streaming and not streaming response (#5470)

    * fix bugs

    * fix bugs

    * fix api server

    * fix api server

    * add chat api and test

    * del request.n

commit de378cd2abd77b464786dc5f8298c9edbf023fbc
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Mar 18 17:06:05 2024 +0800

    [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432)

    * finish online test and add examples

    * fix test_contionus_batching

    * fix some bugs

    * fix bash

    * fix

    * fix inference

    * finish revision

    * fix typos

    * revision

commit 69cd7e069d5705c7e431b301ac14924711c74e41
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Fri Mar 1 14:47:36 2024 +0800

    [Inference] ADD  async and sync Api server using FastAPI (#5396)

    * add api server

    * fix

    * add

    * add completion service and fix bug

    * add generation config

    * revise shardformer

    * fix bugs

    * add docstrings and fix some bugs

    * fix bugs and add choices for prompt template

commit d482922035ff7b6fe7ced8e6c4028faa2d68197f
tAuthor: yuehuayingxueluo <867460659@qq.com>
Date:   Wed May 8 19:59:10 2024 +0800

     [Inference] Support the logic related to ignoring EOS token (#5693)

    * Adapt temperature processing logic

    * add ValueError for top_p and top_k

    * add GQA Test

    * fix except_msg

    * support ignore EOS token

    * change variable's name

    * fix annotation

commit 9c2fe7935ff5aaec4f174cfba6f324df623c7447
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed May 8 17:58:29 2024 +0800

    [Inference]Adapt temperature processing logic (#5689)

    * Adapt temperature processing logic

    * add ValueError for top_p and top_k

    * add GQA Test

    * fix except_msg

commit 12e7c28d5e8f219480d1dbc682fd225dc76fcc2b
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Daqte:   Wed May 8 15:48:47 2024 +0800

    [hotfix] fix OpenMOE example import path (#5697)

commit 55cc7f3df7c600deae2f344ee162abae5a5c63e1
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed May 8 11:30:15 2024 +0800

    [Fix] Fix Inference Example, Tests, and Requirements (#5688)

    * clean requirements

    * modify example inference struct

    * add test ci scripts

    * mark test_infer as submodule

    * rm deprecated cls & deps

    * import of HAS_FLASH_ATTN

    * prune inference tests to be run

    * prune triton kernel tests

    * increment pytest timeout mins

    * revert import path in openmoe

commit f9afe0addd89303de4819debd93efe97d5618238
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 7 23:13:14 2024 +0800

    [hotfix] Fix KV Heads Number Assignment in KVCacheManager (#5695)

    - Fix key value number assignment in KVCacheManager, as well as method of accessing

commit 1ace1065e6bff175a0af88cae86d272acef29c9f
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon May 6 15:35:13 2024 +0800

    [Inference/Feat] Add quant kvcache support for decode_kv_cache_memcpy (#5686)

commit db7b3051f4379862f88790bf1653ddb6443c002e
Merge: 725fbd2e 8754abae
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon May 6 14:43:38 2024 +0800

    [Sync] Update from main to feature/colossal-infer (Merge pull request #5685)

    [Sync] Update from main to feature/colossal-infer

    - Merge pull request #5685 from yuanheng-zhao/inference/merge/main

commit 725fbd2ed067f9c58ac04670377d3e6f2a96fe00
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Mon May 6 10:55:34 2024 +0800

    [Inference] Remove unnecessary float4_ and rename float8_ to float8 (#5679)

commit 8754abae24dbcc492d2992d1091428592b615285
Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Date:   Sun May 5 16:28:56 2024 +0000

    [Fix] Fix & Update Inference Tests (compatibility w/ main)

commit 56ed09aba5e017fc0c211dac70215c2f83815919
Merge: 537a3cbc d3f34ee8
Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Date:   Sun May 5 05:14:00 2024 +0000

    [sync] resolve conflicts of merging main

commit 537a3cbc4df445786c8ecf2af0a2998e2fd881b6
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Fri May 3 17:20:45 2024 +0800

    [kernel] Support New KCache Layout - Triton Kernel (#5677)

    * kvmemcpy triton for new kcache layout

    * revise tests for new kcache layout

    * naive triton flash decoding - new kcache layout

    * rotary triton kernel - new kcache layout

    * remove redundancy - triton decoding

    * remove redundancy - triton kvcache copy

    * [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 9df016fc4520a5a5c95a11ed04a8ac62bde039c4
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Apr 30 19:38:00 2024 +0800

    [Inference] Fix quant bits order (#5681)

commit f79963199cd30c5e917d430aedd79113d06d608c
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Apr 30 19:35:05 2024 +0800

    [inference]Add alibi to flash attn function (#5678)

    * add alibi to flash attn function

    * rm redundant modifications

commit ef8e4ffe310bfe21f83feb965d962d816d75bc88
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Apr 30 18:33:53 2024 +0800

    [Inference/Feat] Add kvcache quant support for fused_rotary_embedding_cache_copy (#5680)

commit 5cd75ce4c7edc95bacd8ec5fc04b8add339e8331
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Tue Apr 30 15:52:23 2024 +0800

    [Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663)

    * refactor kvcache manager and rotary_embedding and kvcache_memcpy operator

    * refactor decode_kv_cache_memcpy

    * enable alibi in pagedattention

commit 5f00002e43bd738a99fea250306e54c8c908f05a
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Apr 30 15:47:07 2024 +0800

    [Inference] Adapt Baichuan2-13B TP (#5659)

    * adapt to baichuan2 13B

    * add baichuan2 13B TP

    * update baichuan tp logic

    * rm unused code

    * Fix TP logic

    * fix alibi slopes tp logic

    * rm nn.Module

    * Polished the code.

    * change BAICHUAN_MODEL_NAME_OR_PATH

    * Modified the logic for loading Baichuan weights.

    * fix typos

commit 808ee6e4addccb51990398434547fa5df3c255b0
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Apr 30 11:26:36 2024 +0800

    [Inference/Feat] Feat quant kvcache step2 (#5674)

commit 8ccb6714e79137c8e6e50d9a585eadbf70ae6fc0
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Fri Apr 26 19:40:37 2024 +0800

    [Inference/Feat] Add kvcache quantization support for FlashDecoding (#5656)

commit 5be590b99eb6c58c3aa809d453680139fdd2b9f7
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Fri Apr 26 17:51:49 2024 +0800

    [kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658)

    * add context attn triton kernel - new kcache layout

    * add benchmark triton

    * tiny revise

    * trivial - code style, comment

commit 3c91e3f1763d2a30a85187a3a606dbe4d1b9454d
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Apr 25 23:11:30 2024 +0800

    [Inference]Adapt to baichuan2 13B (#5614)

    * adapt to baichuan2 13B

    * adapt to baichuan2 13B

    * change BAICHUAN_MODEL_NAME_OR_PATH

    * fix test_decoding_attn.py

    * Modifications based on review comments.

    * change BAICHUAN_MODEL_NAME_OR_PATH

    * mv attn mask processes to test flash decoding

    * mv get_alibi_slopes baichuan modeling

    * fix bugs in test_baichuan.py

commit f342a9387168cedc2e5cc33155939c6d0c4e99a0
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Thu Apr 25 22:04:59 2024 +0800

    [Fix] Remove obsolete files - inference (#5650)

commit a8fd3b034235e1fa987a1ae85a9a2b465ee6128f
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Thu Apr 25 14:24:02 2024 +0800

    [Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643)

    * optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x])

    * [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 90cd5227a348dfe506e95b2e49f2a8dcd34fdbca
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Apr 24 14:51:36 2024 +0800

    [Fix/Inference]Fix vllm benchmark (#5630)

    * Fix bugs about OOM when running vllm-0.4.0

    * rm used params

    * change generation_config

    * change benchmark log file name

commit 279300dc5f34db219c90a297c0996d00221eae96
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Wed Apr 24 14:17:54 2024 +0800

    [Inference/Refactor] Refactor compilation mechanism and unified multi hw (#5613)

    * refactor compilation mechanism and unified multi hw

    * fix file path bug

    * add init.py to make pybind a module to avoid relative path error caused by softlink

    * delete duplicated micros

    * fix micros bug in gcc

commit 04863a9b144fc7dd46a57d2c7b0cf2f4b351ffb6
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Apr 23 22:23:07 2024 +0800

    [example] Update Llama Inference example (#5629)

    * [example] add infernece benchmark llama3

    * revise inference config - arg

    * remove unused args

    * add llama generation demo script

    * fix init rope in llama policy

    * add benchmark-llama3 - cleanup

commit 12f10d5b0b49a180bc162e166337942e0bbfb96b
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Apr 23 13:44:49 2024 +0800

    [Fix/Inference]Fix CUDA Rotary Rmbedding GQA (#5623)

    * fix rotary embedding GQA

    * change test_rotary_embdding_unpad.py KH

commit 5d4c1fe8f5f7019284f6cbc0ed29506748f63bf1
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Apr 23 13:09:55 2024 +0800

    [Fix/Inference] Fix GQA Triton and Support Llama3 (#5624)

    * [fix] GQA calling of flash decoding triton

    * fix kv cache alloc shape

    * fix rotary triton - GQA

    * fix sequence max length assigning

    * Sequence max length logic

    * fix scheduling and spec-dec

    * skip without import error

    * fix pytest - skip without ImportError

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit ccf72797e3bfafcbfc42870ce24ee484858d4852
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Fri Apr 19 15:34:53 2024 +0800

    feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611)

commit e37ee2fb65fc77c275b816968d91776322fd7695
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Thu Apr 18 16:56:46 2024 +0800

    [Feat]Tensor Model Parallel Support For Inference (#5563)

    * tensor parallel support naive source

    * [fix]precision, model load and refactor the framework

    * add tp unit test

    * docstring

    * fix do_sample

commit be396ad6cc102fa610731291bf28e531a5641c7a
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Thu Apr 18 16:45:07 2024 +0800

    [Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531)

    * feat flash decoding for paged attention

    * refactor flashdecodingattention

    * [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 56b222eff8c996a4677a158d4b5d4834a1bc0cfc
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Apr 15 16:53:02 2024 +0800

    [inference/model]Adapted to the baichuan2-7B model (#5591)

    * Adapted to the baichuan2-7B model

    * modified according to the review comments.

    * Modified the method of obtaining random weights.

    * modified according to the review comments.

    * change mlp layewr 'NOTE'

commit d4cb023b62ea8e092783be437cb16d74a1afc6a7
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon Apr 15 10:57:51 2024 +0800

    [Inference/Refactor] Delete Duplicated code and refactor vec_copy utils and reduce utils (#5593)

    * delete duplicated code and refactor vec_copy utils and reduce utils

    * delete unused header file

commit a21912339a2c41627b43fd00e6adba38308a2ea0
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Thu Apr 11 15:41:36 2024 +0800

    refactor csrc (#5582)

commit 25928d84961b60264a6dabbddeae32af04a43fa2
Merge: d56c9633 f8598e3e
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Apr 10 18:39:27 2024 +0800

    [Inference/Spec-Dec] Merge pull request #5565 from hpcaitech/feat/speculative-decoding

    Add Speculative Decoding and GLIDE Spec-Dec

commit f8598e3ec56bbe6bc6dd9fd84a1e0543adbd3073
Author: Yuanheng <jonathan.zhaoyh@gmail.com>
Date:   Wed Apr 10 11:14:04 2024 +0800

    [Fix] Llama Modeling Control with Spec-Dec (#5580)

    - fix ref before asgmt
    - fall back to use triton kernels when using spec-dec

commit e60d430cf53c9009af4682908d01742147654429
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Sun Apr 7 14:53:30 2024 +0800

    [Fix] resolve conflicts of rebasing feat/speculative-decoding (#5557)

    - resolve conflicts of rebasing feat/speculative-decoding

commit e1acb58423c53ece50b72db3bf9b91475d5d3d64
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Apr 3 18:06:23 2024 +0800

    [doc] Add inference/speculative-decoding README (#5552)

    * add README for spec-dec

    * update roadmap

commit d85d91435ae25d875bfeb012b1e66cbfce6f6525
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Apr 1 21:54:24 2024 +0800

    [Inference/SpecDec] Support GLIDE Drafter Model (#5455)

    * add glide-llama policy and modeling

    * update glide modeling, compitable with transformers 4.36.2

    * revise glide llama modeling/usage

    * fix issues of glimpsing large kv

    * revise the way re-loading params for glide drafter

    * fix drafter and engine tests

    * enable convert to glide strict=False

    * revise glide llama modeling

    * revise vicuna prompt template

    * revise drafter and tests

    * apply usage of glide model in engine

commit 912e24b2aaf4acda0e2b9a45a7d4327fbfc8bd39
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Mar 12 17:57:01 2024 +0800

    [SpecDec] Fix inputs for speculation and revise past KV trimming (#5449)

    * fix drafter pastkv and usage of batch bucket

commit a37f82629d7b9e3c3a0f430b8dd3ff6f38ddf1d4
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Mar 11 09:51:42 2024 +0800

    [Inference/SpecDec] Add Speculative Decoding Implementation (#5423)

    * fix flash decoding mask during verification

    * add spec-dec

    * add test for spec-dec

    * revise drafter init

    * remove drafter sampling

    * retire past kv in drafter

    * (trivial) rename attrs

    * (trivial) rename arg

    * revise how we enable/disable spec-dec

commit 5a9b05f7b297bc9ce3479990aeee94891c7f5edf
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Feb 28 13:48:17 2024 +0800

    [Inference/SpecDec] Add Basic Drafter Model Container (#5405)

    * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

    fix dependency in pytest

    * add drafter model container (basic ver)

commit d63c469f45bc20115aaf5ba01e62dc67ab47953f
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Feb 28 13:47:00 2024 +0800

    [Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401)

    * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

    fix dependency in pytest

    * resolve conflicts for revising flash-attn

    * adapt kv cache copy kernel for spec-dec

    * fix seqlen-n kvcache copy kernel/tests

    * test kvcache copy - use torch.equal

    * add assertions

    * (trivial) comment out

commit d56c96334e8a0626696609c3803ba5c73798f073
Merge: 7ebdf48a 7ca1d1c5
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Apr 9 10:09:34 2024 +0800

    Sync main to feature/colossal-infer

    [Sync] Merge feature/colossal-infer with main

commit 7ca1d1c5453de3e726bca6334c360045050f94c4
Author: Yuanheng <jonathan.zhaoyh@gmail.com>
Date:   Mon Apr 8 17:00:55 2024 +0800

    remove outdated triton test

commit d78817539ea03b7b4bc79e0ef50db33d3e347f24
Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Date:   Mon Apr 8 08:41:07 2024 +0000

    [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

commit ce9401ad52b870012846abcde120f1e87d5da7fe
Author: Yuanheng <jonathan.zhaoyh@gmail.com>
Date:   Mon Apr 8 16:25:12 2024 +0800

    remove unused triton kernels

commit ed5ebd1735db4541709eebdd37839ad161f542e8
Merge: 7ebdf48a 641b1ee7
Author: Yuanheng <jonathan.zhaoyh@gmail.com>
Date:   Mon Apr 8 16:21:47 2024 +0800

    [Fix] resolve conflicts of merging main

commit 7ebdf48ac50ca7bab827ef611551c6c48113b684
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon Apr 8 11:38:05 2024 +0800

    add cast and op_functor for cuda build-in types (#5546)

commit 4bb5d8923a6e85a0f89a483f15933698635a9f9c
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Apr 2 14:16:59 2024 +0800

    [Fix/Inference] Remove unused and non-functional functions (#5543)

    * [fix] remove unused func

    * rm non-functional partial

commit a2878e39f42f509f237f3d3fd0741f53e3feff0e
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon Apr 1 15:34:25 2024 +0800

    [Inference] Add Reduce Utils (#5537)

    * add reduce utils

    * add using to delele namespace prefix

commit 04aca9e55bd91ea4dd8d1231aa66df7848b08f03
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Apr 1 13:47:14 2024 +0800

    [Inference/Kernel]Add get_cos_and_sin Kernel (#5528)

    * Add get_cos_and_sin kernel

    * fix code comments

    * fix code typos

    * merge common codes of get_cos_and_sin kernel.

    * Fixed a typo

    * Changed 'asset allclose' to 'assert equal'.

commit 934e31afb22d2a281464aebde074eb2f238fb812
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Mar 28 10:42:51 2024 +0800

    The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519)

commit e6496dd37144202c8602dfdd66bb83f297eb5805
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Mar 26 16:37:14 2024 +0800

    [Inference] Optimize request handler of llama (#5512)

    * optimize request_handler

    * fix ways of writing

commit 6251d68dc9f92c333a8f07ddf94e80ff7462726e
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Mon Mar 25 15:24:17 2024 +0800

    [fix] PR #5354 (#5501)

    * [fix]

    * [fix]

    * Update config.py docstring

    * [fix] docstring align

    * [fix] docstring align

    * [fix] docstring align

commit 1d626233ce8dbf35405cb7d92a5638ee1d830e8f
Merge: 87079cff 68e9396b
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Mon Mar 25 14:55:59 2024 +0800

    Merge pull request #5434 from LRY89757/colossal-infer-cuda-graph

    [feat] cuda graph support and refactor non-functional api

commit 68e9396bc084f03fe9315e9fed93292c0efc7a48
Merge: ff4998c6 87079cff
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 25 14:48:28 2024 +0800

    [fix] merge conflicts

commit 87079cffe8e006d4949aa7ca7cb60e6b813ff701
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Mar 25 13:40:34 2024 +0800

    [Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461)

    * Support FP16/BF16 Flash Attention 2

    * fix bugs in test_kv_cache_memcpy.py

    * add context_kv_cache_memcpy_kernel.cu

    * rm typename MT

    * add tail process

    * add high_precision

    * add high_precision to config.py

    * rm unused code

    * change the comment for the high_precision parameter

    * update test_rotary_embdding_unpad.py

    * fix vector_copy_utils.h

    * add comment for self.high_precision when using float32

commit ff4998c6f39cbfd6d3d11f038c55cca3c9d3abd0
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 25 12:00:57 2024 +0800

    [fix] remove unused comment

commit 9fe61b44753083c89a50540daa1e9a3daedeb335
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 25 11:37:58 2024 +0800

    [fix]

commit 5b017d6324c9881e02a5440e0b1a3156612a8044
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 21 15:55:25 2024 +0800

    [fix]

commit 606603bb8805c39f6ee01029337ddc614c8d46ef
Merge: 4eafe0c8 7ff42cc0
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 21 14:25:22 2024 +0800

    Merge branch 'feature/colossal-infer' of https://github.com/hpcaitech/ColossalAI into colossal-infer-cuda-graph

commit 4eafe0c8141c120229be3ddce9c5591c1535348a
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 21 11:28:42 2024 +0800

    [fix] unused option

commit 7ff42cc06d007ae78fe091da65cb89c4bb62bc38
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Mar 19 18:36:40 2024 +0800

    add vec_type_trait implementation (#5473)

commit b96557b5e15dbb521bf0f77b6b1f24dcbd9464d6
Merge: b6e97858 48c4f29b
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Mar 19 13:53:26 2024 +0800

    Merge pull request #5469 from Courtesy-Xs/add_vec_traits

    Refactor vector utils

commit aabc9fb6aada9e7feb2ff8cf1f34e6ac37ade2e7
Author: Runyu Lu <runyulu@umich.edu>
Date:   Tue Mar 19 13:24:25 2024 +0800

    [feat] add use_cuda_kernel option

commit 48c4f29b275e2d8105842913cd84f5d66c378b36
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Tue Mar 19 11:32:01 2024 +0800

    refactor vector utils

commit b6e97858856ee8637216c51f14ac544b1bc0f872
Merge: f366a5ea 5724b9e3
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Fri Mar 15 11:23:44 2024 +0800

    Merge pull request #5457 from Courtesy-Xs/ly_add_implementation_for_launch_config

    add implementatino for GetGPULaunchConfig1D

commit 5724b9e31e13e07d8ade0444c3e2f3e6894d13b1
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Fri Mar 15 11:18:57 2024 +0800

    add some comments

commit 6e30248683c0e4ccc63d15f39f8149875cba1263
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 14 16:13:00 2024 +0800

    [fix] tmp for test

commit 388e0439301834a1ad0d11da26b23f4cdc6c82d7
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Thu Mar 14 11:13:40 2024 +0800

    add implementatino for GetGPULaunchConfig1D

commit d02e257abd778812d64491dde893c0d691ed4328
Merge: ae24b4f0 f366a5ea
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Thu Mar 14 10:37:05 2024 +0800

    Merge branch 'feature/colossal-infer' into colossal-infer-cuda-graph

commit ae24b4f025285949253a21c41bee4b80679a0bfe
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 14 10:35:08 2024 +0800

    diverse tests

commit 1821a6dab0ad6ad24ae25216e56268c4b0c0d365
Author: Runyu Lu <runyulu@umich.edu>
Date:   Wed Mar 13 17:28:32 2024 +0800

    [fix] pytest and fix dyn grid bug

commit f366a5ea1f2626a7870acaf8866f21d5fb49c388
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Mar 13 17:20:03 2024 +0800

    [Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418)

    * add rotary embedding kernel

    * add rotary_embedding_kernel

    * add fused rotary_emb and kvcache memcopy

    * add fused_rotary_emb_and_cache_kernel.cu

    * add fused_rotary_emb_and_memcopy

    * fix bugs in fused_rotary_emb_and_cache_kernel.cu

    * fix ci bugs

    * use vec memcopy and opt the  gloabl memory access

    * fix code style

    * fix test_rotary_embdding_unpad.py

    * codes revised based on the review comments

    * fix bugs about include path

    * rm inline

commit ed431de4e4f73584e6b9c11ab041ef54a8e83de6
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Wed Mar 13 16:00:55 2024 +0800

    fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test (#5454)

commit 6fd355a5a6bb46bfee41d2bc75578e8fba001144
Merge: b699f540 c1c45e9d
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Wed Mar 13 11:26:41 2024 +0800

    Merge pull request #5452 from Courtesy-Xs/fix_include_path

    fix include path

commit c1c45e9d8ecb6743e88e63dd151c617c0014e7c1
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Wed Mar 13 11:21:06 2024 +0800

    fix include path

commit b699f54007c52b2f4ec56326a495b06858cf8856
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Tue Mar 12 17:48:02 2024 +0800

    optimize rmsnorm: add vectorized elementwise op, feat loop unrolling (#5441)

commit 368a2aa5433d127adaa3674c6d00bb9dc3e0729c
Merge: 21e1e364 095c070a
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Mar 12 14:14:37 2024 +0800

    Merge pull request #5445 from Courtesy-Xs/refactor_infer_compilation

    Refactor colossal-infer code arch

commit 095c070a6eefe1a76fe3483b21986826114d6d17
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Mon Mar 11 17:06:57 2024 +0800

    refactor code

commit 21e1e3645c8f2e0d4e556f3e13d0d2aa5053911b
Merge: f7aecc0c 5eb5ff14
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon Mar 11 11:15:29 2024 +0800

    Merge pull request #5435 from Courtesy-Xs/add_gpu_launch_config

    Add query and other components

commit 633e95b301336c4c237537f584882b3d8e5f4145
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 11 10:56:51 2024 +0800

    [doc] add doc

commit 9dec66fad6c2f85166903aa80d0c077e37512fce
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 11 10:51:16 2024 +0800

    [fix] multi graphs capture error

commit b2c0d9ff2b4e4015660f2967837688cf7293b21e
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 11 10:49:31 2024 +0800

    [fix] multi graphs capture error

commit f7aecc0c6bac001d10c1dd00274e0152e4c86df6
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Fri Mar 8 16:21:12 2024 +0800

    feat rmsnorm cuda kernel and add unittest, benchmark script (#5417)

commit 5eb5ff1464311ac16c29307d03a3c076aced7e03
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Fri Mar 8 15:41:14 2024 +0800

    refactor code

commit 01d289d8e51384131d536b1c223c473aeea463e9
Merge: a46598ac 2b28b54a
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Fri Mar 8 15:04:55 2024 +0800

    Merge branch 'feature/colossal-infer' of https://github.com/hpcaitech/ColossalAI into add_gpu_launch_config

commit a46598ac5984c7dc5804d0cf8621698f1a6a8720
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Fri Mar 8 14:53:29 2024 +0800

    add reusable utils for cuda

commit 2b28b54ac6d19d33079d9117b9717fd2779f2b08
Merge: 593a72e4 95c21498
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Fri Mar 8 14:44:37 2024 +0800

    Merge pull request #5433 from Courtesy-Xs/add_silu_and_mul

    【Inference】Add silu_and_mul for infer

commit cefaeb5fdd551c8b95837a475cb810f4991cf674
Author: Runyu Lu <runyulu@umich.edu>
Date:   Fri Mar 8 14:19:35 2024 +0800

    [feat] cuda graph support and refactor non-functional api

commit 95c21498d4f6e640e218f4b00349020f4ae7c69a
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Thu Mar 7 16:57:49 2024 +0800

    add silu_and_mul for infer

commit 593a72e4d58b8c3feebde2d19c78d44f702f7b06
Merge: 0aa27f19 0310b76e
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Mon Mar 4 10:13:59 2024 +0800

    Merge pull request #5424 from FrankLeeeee/sync/main

    Sync/main

commit 0310b76e9d485703d5afc128b8d97d01b00f3317
Merge: 0aa27f19 4b8312c0
Author: FrankLeeeee <somerlee.9@gmail.com>
Date:   Mon Mar 4 10:09:36 2024 +0800

    Merge branch 'main' into sync/main

commit 0aa27f196109bfb4ce6171d7ce921052b9eee969
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Feb 28 16:46:03 2024 +0800

    [Inference]Move benchmark-related code to the example directory. (#5408)

    * move benchmark-related code to the example directory.

    * fix bugs in test_fused_rotary_embedding.py

commit 600881a8ea9b17c436ded922a9d4e3d5969acd87
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Feb 28 14:36:50 2024 +0800

    [Inference]Add CUDA KVCache Kernel (#5406)

    * add cuda KVCache kernel

    * annotation benchmark_kvcache_copy

    * add use cuda

    * fix import path

    * move benchmark scripts to example/

    * rm benchmark codes in test_kv_cache_memcpy.py

    * rm redundancy codes

    * rm redundancy codes

    * pr was modified according to the review

commit 19061188c396d851ef17bc34b526e2f2b4fc1479
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Feb 26 16:17:47 2024 +0800

    [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

    fix dependency in pytest

commit bc1da87366d81e144f1f133801d5f20520433c52
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Fri Feb 23 10:51:35 2024 +0800

    [Fix/Inference] Fix format of input prompts and input model  in inference engine (#5395)

    * Fix bugs in inference_engine

    * fix bugs in engine.py

    * rm  CUDA_VISIBLE_DEVICES

    * add request_ids in generate

    * fix bug in engine.py

    * add logger.debug for BatchBucket

commit 2a718c8be89918ec70b88f1f059148a7294dbccb
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Feb 21 13:23:57 2024 +0800

    Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390)

    * opt_view_and_memcopy

    * fix bugs in ci

    * fix ci bugs

    * update benchmark scripts

    * fix ci bugs

commit 730103819dc0636c85af1af80cc17914dcf196c1
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Feb 21 11:31:48 2024 +0800

    [Inference]Fused kv copy into rotary calculation (#5383)

    * revise rotary embedding

    * remove useless print

    * adapt

    * fix

    * add

    * fix

    * modeling

    * fix

    * fix

    * fix

    * fused kv copy

    * fused copy

    * colossalai/kernel/triton/no_pad_rotary_embedding.py

    * del padding llama

    * del

commit b21aac5baeddf7ea19615fae454e6f78f7469cd2
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Feb 19 17:18:20 2024 +0800

    [Inference] Optimize and Refactor Inference Batching/Scheduling (#5367)

    * add kvcache manager funcs for batching

    * add batch bucket for batching

    * revise RunningList struct in handler

    * add kvcache/batch funcs for compatibility

    * use new batching methods

    * fix indexing bugs

    * revise abort logic

    * use cpu seq lengths/block tables

    * rm unused attr in Sequence

    * fix type conversion/default arg

    * add and revise pytests

    * revise pytests, rm unused tests

    * rm unused statements

    * fix pop finished indexing issue

    * fix: use index in batch when retrieving inputs/update seqs

    * use dict instead of odict in batch struct

    * arg type hinting

    * fix make compress

    * refine comments

    * fix: pop_n_seqs to pop the first n seqs

    * add check in request handler

    * remove redundant conversion

    * fix test for request handler

    * fix pop method in batch bucket

    * fix prefill adding

commit 8c69debdc7128e1b8839f12aa3f19ad327569017
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Feb 8 15:27:26 2024 +0800

     [Inference]Support vllm testing in benchmark scripts (#5379)

    * add vllm benchmark scripts

    * fix code style

    * update run_benchmark.sh

    * fix code style

commit 9afa52061f89dde87a73e36f740f62781d658a01
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Thu Feb 8 14:04:14 2024 +0800

    [inference] refactored config (#5376)

commit 1f8c7e70469191610d9536029f624b4f30db8caf
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Feb 7 17:55:48 2024 +0800

    [Inference] User Experience: update the logic of default tokenizer and generation config.  (#5337)

    * add

    * fix

    * fix

    * pause

    * fix

    * fix pytest

    * align

    * fix

    * license

    * fix

    * fix

    * fix readme

    * fix some bugs

    * remove tokenizer config

commit 6fb4bcbb2420b9f977ab74de60c6d311b6c9ed9a
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Feb 7 17:15:42 2024 +0800

    [Inference/opt] Fused KVCahce Memcopy (#5374)

    * fused kv memcopy

    * add TODO in test_kvcache_copy.py

commit 58740b5f6872bc5a26dbf7c3112b86a1b66c083a
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Wed Feb 7 17:11:43 2024 +0800

    [inference] added inference template (#5375)

commit 8106ede07fae7e239203feb815162efdf46975ec
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Wed Feb 7 14:27:04 2024 +0800

    Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373)

    This reverts commit 9f4ab2eb924b938348df2c713bb4580972f18eb1.

commit 9f4ab2eb924b938348df2c713bb4580972f18eb1
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Feb 7 11:36:04 2024 +0800

    [Inference] Adapt to Fused rotary (#5348)

    * revise rotary embedding

    * remove useless print

    * adapt

    * fix

    * add

    * fix

    * modeling

    * fix

    * fix

    * fix

commit 35382a7fbf96c731ba1ed76cf5529ea3220a5b66
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Feb 6 19:38:25 2024 +0800

    [Inference]Fused the gate and up proj in mlp，and optimized the autograd process. (#5365)

    * fused the gate and up proj in mlp

    * fix code styles

    * opt auto_grad

    * rollback test_inference_engine.py

    * modifications based on the review feedback.

    * fix bugs in flash attn

    * Change reshape to view

    * fix test_rmsnorm_triton.py

commit 1dedb57747270f32be5d0e67abc1ad2fff658f8f
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Feb 6 17:27:45 2024 +0800

    [Fix/Infer] Remove unused deps and revise requirements (#5341)

    * remove flash-attn dep

    * rm padding llama

    * revise infer requirements

    * move requirements out of module

commit 631862f3390f874db118a25c0137f86630e9b167
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Fri Feb 2 15:38:21 2024 +0800

    [Inference]Optimize generation process of inference engine (#5356)

    * opt inference engine

    * fix run_benchmark.sh

    * fix generate in engine.py

    * rollback tesh_inference_engine.py

commit 21ad4a27f91659220bec6c4d4f2d0f62f7093a45
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Fri Feb 2 15:06:01 2024 +0800

    [Inference/opt]Optimize the mid tensor of RMS Norm (#5350)

    * opt rms_norm

    * fix bugs in rms_layernorm

commit 027aa1043f1c7b3668d5ca9b91d35c846736e9c4
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Fri Feb 2 14:31:10 2024 +0800

    [doc] updated inference readme (#5343)

commit e76acbb076582e0aade1ee8a5fa7696d95c1bef5
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Fri Feb 2 13:51:22 2024 +0800

    [inference] moved ops tests to test_infer (#5354)

commit db1a763307a54ca262751ebebd5f1c503d9bca74
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Fri Feb 2 11:44:15 2024 +0800

    [inference] removed redundancy init_batch (#5353)

commit 249644c23b0402ccf9d0908f13ed15b41b95145f
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Feb 1 15:49:39 2024 +0800

    [Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation，add fused_qkv and fused linear_add (#5340)

    * add fused qkv

    * replace attn and mlp by shardformer

    * fix bugs in mlp

    * add docstrings

    * fix test_inference_engine.py

    * add optimize unbind

    * add fused_addmm

    * rm squeeze(1)

    * refactor codes

    * fix ci bugs

    * rename ShardFormerLlamaMLP and ShardFormerLlamaAttention

    * Removed the dependency on LlamaFlashAttention2

    * rollback test_inference_engine.py

commit f8e456d20295af52665ca06a21f9fd8b468204d7
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Thu Feb 1 15:31:01 2024 +0800

    [inference] simplified config verification (#5346)

    * [inference] simplified config verification

    * polish

    * polish

commit df0aa49585d2dd19d7397dfbd3b5f136abac609b
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Jan 31 16:31:29 2024 +0800

    [Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336)

    * revise rotary embedding

    * remove useless print

    * adapt

commit 1336838a9149fb210a956b0ad338197c4ae77821
Merge: 5f98a9d6 c5655199
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Wed Jan 31 16:29:26 2024 +0800

    Merge pull request #5339 from FrankLeeeee/sync/merge-main

    Sync/merge main

commit c56551991379a457fc34df699710ab94132779fc
Merge: 5f98a9d6 71321a07
Author: FrankLeeeee <somerlee.9@gmail.com>
Date:   Wed Jan 31 10:41:47 2024 +0800

    merge commit

commit 5f98a9d68a0a35031e1c740c19e33b32f4fa8d9c
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Jan 30 16:06:09 2024 +0800

    [Infer] Optimize Blocked KVCache And Kernels Using It (#5325)

    * revise shape of kvcache (context attn kernel)

    * revise shape of kvcache (flash decoding kernel)

    * revise shape of kvcache (kvcache copy) and attn func

    * init of kvcache in kvcache manager

    * revise llama modeling

    * revise block size retrieval

    * use torch for rms_norm benchmarking

    * revise block size retrieval

commit e8f0642f2841f6aeb6ed0e6695ff9d9ef14f198b
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 30 10:31:46 2024 +0800

    [Inference]Add Nopadding Llama Modeling (#5327)

    * add nopadding llama modeling

    * add nopadding_llama.py

    * rm unused codes

    * fix bugs in test_xine_copy.py

    * fix code style

commit c7c104cb7ccc353faa10667853ed210e042f1be8
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Jan 29 16:21:06 2024 +0800

    [DOC] Update inference readme  (#5280)

    * add readme

    * add readme

    * 1

    * update engine

    * finish readme

    * add readme

commit 1f8a75d470d548bfd4db877e73102b8fad5cdfa9
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Jan 29 10:22:33 2024 +0800

    [Inference] Update rms norm kernel, benchmark with vLLM (#5315)

    * add

    * xi

    * del

    * del

    * fix

commit 7ddd8b37f0f1160e28a2919a2e37f8e8ad199773
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Fri Jan 26 15:02:12 2024 +0800

    fix (#5311)

commit 4f28cb43c0c2afbc970b9f0f300e7aa28e39bd2e
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Fri Jan 26 14:00:10 2024 +0800

    [inference]Optimize the usage of the mid tensors space in flash attn (#5304)

    * opt flash attn

    * opt tmp tensor

    * fix benchmark_llama

    * fix code style

    * fix None logic for output tensor

    * fix adapted to get_xine_cache

    * add comment

    * fix ci bugs

    * fix some codes

    * rm duplicated codes

    * rm duplicated codes

    * fix code style

    * add _get_dtype in config.py

commit af8359c430ce3fabb22748870b67b0c6c33f610c
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Thu Jan 25 10:23:12 2024 +0800

    [hotfix] fix boundary check in batch (#5306)

commit c647e00e3c092d3d6219f7686f260f2932a0c27d
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Jan 24 16:20:42 2024 +0800

    [Inference]Add fused rotary kernel and get cos cache kernel (#5302)

    * add fused rotary and get cos cache func

    * staged

    * fix bugs

    * fix bugs

commit 3da9993b0d03923755c1fcd6279cc4c7b8d00d1e
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Jan 23 17:16:02 2024 +0800

    [Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301)

    * fix decoding kernel pytest

    * revise and add triton context attn benchmark

commit 8e606ecc7e89ffed80537e89a27bb1eb6759f4bc
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Tue Jan 23 12:11:53 2024 +0800

    [Inference] Benchmarking rotary embedding and add a fetch function (#5277)

    * fix bugs and add a cos/sin cache fetch func

    * add docstring

    * fix bug

    * fix

commit b7853196a0a46558d7c0cac7deac9a36c7a5ba38
Merge: bfff9254 cea9c86e
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Jan 22 17:07:14 2024 +0800

    Merge pull request #5297 from yuehuayingxueluo/fix_rotary_embedding

    [Inference/fix]Add utils.py for Rotary Embedding

commit cea9c86e453e36b4848064312c9a4f0d2de6ea98
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Jan 22 16:06:27 2024 +0800

    add utils.py

commit bfff9254ac8ca866673746ec47cfd2f87aab2b66
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Jan 22 10:55:34 2024 +0800

     [inference] Adapted to Rotary Embedding and RMS Norm (#5283)

    * adapted to rotary_embedding

    * adapted to nopad rms norm

    * fix bugs in benchmark

    * fix flash_decoding.py

commit 6e487e7d3cf5295ca908fa69c8e03af8980391bf
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Fri Jan 19 15:47:16 2024 +0800

    [kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274)

    * prevent re-creating intermediate tensors

    * add singleton class holding intermediate values

    * fix triton kernel api

    * add benchmark in pytest

    * fix kernel api and add benchmark

    * revise flash decoding triton kernel in/out shapes

    * fix calling of triton kernel in modeling

    * fix pytest: extract to util functions

commit 9e2342bde2c0ffe1a8cdd2fe8917254ef0a06e7f
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Thu Jan 18 16:31:14 2024 +0800

    [Hotfix] Fix bugs in testing continuous batching (#5270)

    * fix bug

    * fix bugs

    * fix bugs

    * fix bugs and add padding

    * add funcs and fix bugs

    * fix typos

    * fix bugs

    * add func

commit 5ae9099f9203a4f8350f383b838e8f2ad15d6fdd
Author: Yaozheng Fang <62918515+nkfyz@users.noreply.github.com>
Date:   Thu Jan 18 10:21:03 2024 +0800

    [kernel] Add RMSLayerNorm triton kernel (#5262)

    * add layerrmsnorm triton kernel

    * add layerrmsnorm kernel

    * modify the atol and rtol in test file

    * Remove the logics of mean computations, and update the name of ther kernel functions and files

    * add benchmark of rms norm

commit 86b63f720cf60deefe40874517b3d8e1dccb7af3
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Jan 17 16:03:10 2024 +0800

    [Inference]Adapted to the triton attn kernels (#5264)

    * adapted to the triton attn kernels

    * fix pad input

    * adapted to copy_kv_to_blocked_cache

    * fix ci test

    * update kv memcpy

    * remove print

commit 0f2b46a41c2c308cc6fbeaf0e86d0e0b93435b77
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Jan 16 14:41:02 2024 +0800

    [kernel] Revise KVCache copy triton kernel API (#5273)

    * [kernel/fix] revise kvcache copy kernel api

    * fix benchmark

commit d8db500efc0e67dea995c2124d20aadd07afb6f0
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Jan 15 17:50:46 2024 +0800

    [Inference] Fix request handler and add recycle logic (#5260)

    * fix request handler

    * fix comment

commit c597678da475abd4ecc075c0b80996989f1bcdc0
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Mon Jan 15 17:37:41 2024 +0800

    [doc] updated inference readme (#5269)

commit fa85e02b3b1b316009c4557482f998b903730ec3
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Jan 15 17:37:20 2024 +0800

    [kernel] Add KV cache copy kernel during decoding  (#5261)

    * add kv copy triton kernel during decoding stage

    * add pytest and fix kernel

    * fix test utilities

    * revise kernel config

    * add benchmark for kvcache copy

commit 1ded7e81ef08d574798dd98d1f4d33da07b7f4c9
Author: FrankLeeeee <somerlee.9@gmail.com>
Date:   Thu Jan 11 13:50:45 2024 +0000

    [git] fixed rebased files

commit 1513f20f4d80f782fab381996368ff2c2f3c95c3
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Thu Jan 11 18:06:39 2024 +0800

    [kernel] Add flash decoding triton kernel for blocked kv cache (#5249)

    * add flash decoding unpad triton kernel

    * rename flash decoding kernel

    * add kernel testing (draft)

    * revise pytest

    * support kv group (GQA)

    * (trivial) fix api and pytest

    * (trivial) func renaming

    * (trivial) func/file renaming

    * refactor pytest for attention

    * (trivial) format and consistent vars of context/decode attn

    * (trivial) remove test redundancy

commit fded91d049997ed87dee965fc42c35a239e3ec03
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Thu Jan 11 16:24:54 2024 +0800

    [Inference] Kernel: no pad rotary embedding (#5252)

    * fix bugs

    * comment

    * use more accurate atol

    * fix

commit d40eb26029e8c61fc2b8ef3a1b8126a229e48047
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Jan 10 10:38:53 2024 +0800

    fix bugs in request_handler.py and engine.py

commit 10e3c9f923caf4fb68ab61e96c244bd5cca9b9da
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 9 15:53:04 2024 +0800

    rm torch.cuda.synchronize

commit fab294c7f4a5db0a4e19109ac5656492ff3ca08b
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 9 15:18:28 2024 +0800

    fix CI bugs

commit 2a73e828eba565017d19eaf70a304e1b1eddba1f
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 9 14:29:45 2024 +0800

    fix bugs related to processing padding mask

commit e545a871b8a89093f5d01e3fea1fe873ef52d51a
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Jan 8 15:56:00 2024 +0800

    [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229)

    * fix accuracy

    * alignment in attention

    * fix attention

    * fix

    * fix bugs

    * fix bugs

    * fix bugs

commit fa4fbdbffb6996e8aa1f65bddce5844f2bbbfdf1
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 9 13:52:53 2024 +0800

    adapted to pad_context_forward

commit 47e53eaa1ca08fd55b657b53b75d13cc72f9cd05
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Jan 8 12:35:06 2024 +0800

    fix bugs in attention.py and request_handler.py

commit bfd9b1b494b4414835b22cbba52005921127e4f6
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Thu Jan 4 16:39:00 2024 +0800

    [Inference] Pytorch Attention func, pad&nopad input support (#5219)

    * add attn

    * add attention test

    * fix attn forward

    * fix decoding

commit 3ad1f3b78b830c90079ed9f1e0b5cd26601194fa
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Jan 4 16:48:53 2024 +0800

    fix beam_width

commit b2eb9cd18665317ec7900364ef21a38c3edb9e3f
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Jan 4 15:09:06 2024 +0800

    Fixed a typo

commit bbfebfb9fc5250c1e4d3a6f008af652f7a0a9ca0
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Jan 4 15:03:18 2024 +0800

    fix bugs in sampler

commit 02c1bf8b2abef137a653b86b733d66b6dfbcc022
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Jan 3 18:50:26 2024 +0800

    add context_attention_unpadded

commit 07b5283b6a3899ebe84cbe8c7902d142ffbc4b9c
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Jan 3 14:41:35 2024 +0800

    [kernel] Add triton kernel for context attention (FAv2) without padding (#5192)

    * add context attn unpadded triton kernel

    * test compatibility

    * kv cache copy (testing)

    * fix k/v cache copy

    * fix kv cache copy and test

    * fix boundary of block ptrs

    * add support for GQA/MQA and testing

    * fix import statement

    ---------

    Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>

commit 4df8876fcad799ace567b2458df5feb3109ee917
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 2 18:34:19 2024 +0800

    Fixed a writing error

commit 9489dc64d8e01b04c9033c3dcaee83e25afebe42
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 2 18:30:11 2024 +0800

    precision alignment

commit 62968588d195126adc9b1bdb3adc02f199303ddf
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 2 13:02:20 2024 +0800

    fix bugs in request_handler

commit 62fd08ee4425e031f8f1c43b25bf1ba5e7e33e8d
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Dec 26 21:34:27 2023 +0800

    Fixed a bug in the inference frame

commit 86853a37d5243b40d4b229d163494624b8027cd0
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Dec 25 14:07:43 2023 +0800

    Add padding llama model

commit 0e616462a7f9e8faaa33d1700a2020ceb03ccd34
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Dec 25 12:15:15 2023 +0800

    [Inference] add logit processor and request handler (#5166)

    * add logit processor and request handler

    * add

    * add

    * add

    * fix

    * add search tokens and update func

    * finish request handler

    * add running list test

    * fix test

    * fix some bug

    * add

    * add

    * fix bugs

    * fix some bugs

    * fix bug

    * fix

    * fix

    * add copy fun

    * del useless attn

    * fix request status

    ---------

    Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

commit 8daee26989adad5ae5b152b24d3344db727986fe
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Dec 18 10:40:47 2023 +0800

    [Inference] Add the logic of the inference engine (#5173)

    * add infer_struct and infer_config

    * update codes

    * change InferConfig

    * Add hf_model_config to the engine

    * rm _get_hf_model_config

    * update codes

    * made adjustments according to the feedback from the reviewer.

    * update codes

    * add ci test for config and struct

    * Add the logic of the inference engine

    * update engine and test

    * Recover cache_manager.py

    * add logger

    * fix conflict

    * update codes

    * update codes

    * update model and tokenizer

    * fix add the logic about shardformer

    * change kvcache_manager docstring

    * add policy

    * fix ci bug in test_kvcache_manager.py

    * remove codes related o tokenizer and move model_policy

    * fix  code style

    * add ordered_set to requirements-infer.txt

    * Delete extra empty lines

    * add ordered_set to requirements-test.txt

commit 93aeacca342ab03732362dbb9096ab1265f4a8b3
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Tue Dec 12 17:22:41 2023 +0800

    [Inference]Update inference config and fix test (#5178)

    * unify the config setting

    * fix test

    * fix import

    * fix test

    * fix

    * fix

    * add logger

    * revise log info

    ---------

    Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

commit 3de2e622995321b042d4a8cffcd61686cda4a58e
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Dec 11 10:56:18 2023 +0800

    [Inference] Add CacheBlock and KV-Cache Manager (#5156)

    * [Inference] Add KVCache Manager

    * function refactored

    * add test for KVCache Manager

    * add attr beam width

    * Revise alloc func in CacheManager

    * Fix docs and pytests

    * add tp slicing for head number

    * optimize shapes of tensors used as physical cache

    * Apply using InferenceConfig on KVCacheManager

    * rm duplicate config file

    * Optimize cache allocation: use contiguous cache

    * Fix config in pytest (and config)

commit fab9b931d9e24c6e8ada8025cf8cf12719c3d2af
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Dec 7 14:34:01 2023 +0800

    [Inference]Add BatchInferState, Sequence and InferConfig (#5149)

    * add infer_struct and infer_config

    * update codes

    * change InferConfig

    * Add hf_model_config to the engine

    * rm _get_hf_model_config

    * update codes

    * made adjustments according to the feedback from the reviewer.

    * update codes

    * add ci test for config and struct

commit 2bb92243d4151873d75a9d6d9c2275b390e1716a
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Dec 5 15:12:57 2023 +0800

    [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159)

    * [inference/nfc] remove outdated inference tests

    * remove outdated kernel tests

    * remove deprecated triton kernels

    * remove imports from deprecated kernels

commit 56e75eeb063279fbc0fc84e25f267f1ca208e784
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Fri Dec 1 17:31:31 2023 +0800

    [Inference] Add readme (roadmap) and fulfill request handler (#5147)

    * request handler

    * add readme

    ---------

    Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

commit 4cf4682e70f70dea8e0510705d3383de0bf1a4a8
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Fri Dec 1 17:02:44 2023 +0800

    [Inference] First PR for rebuild colossal-infer (#5143)

    * add engine and scheduler

    * add dirs

    ---------

    Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

Loading branch information

botbw committed May 23, 2024

1 parent 83716e9 commit 6ebeac2

.github/workflows/build_on_pr.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -91,7 +91,7 @@ jobs: @@
         container:
           image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
           options: --gpus all --rm -v /dev/shm -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
-        timeout-minutes: 60
+        timeout-minutes: 90
         defaults:
           run:
             shell: bash
@@ Expand Down @@

.github/workflows/doc_test_on_pr.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -58,7 +58,7 @@ jobs: @@
         container:
           image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
           options: --gpus all --rm
-        timeout-minutes: 20
+        timeout-minutes: 30
         defaults:
           run:
             shell: bash
@@ Expand Down @@

.github/workflows/example_check_on_pr.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -8,6 +8,7 @@ on: @@
         # any change in the examples folder will trigger check for the corresponding example.
         paths:
           - "examples/**"
+          - "!examples/**.md"
     jobs:
       # This is for changed example files detect and output a matrix containing all the corresponding directory name.
@@ Expand All / @@ -19,6 +20,7 @@ jobs: @@
         outputs:
           matrix: ${{ steps.setup-matrix.outputs.matrix }}
           anyChanged: ${{ steps.setup-matrix.outputs.anyChanged }}
+          anyExtensionFileChanged: ${{ steps.find-extension-change.outputs.any_changed }}
         name: Detect changed example files
         concurrency:
           group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-detect-change
@@ Expand All / @@ -37,6 +39,16 @@ jobs: @@
               echo $commonCommit
               echo "baseSHA=$commonCommit" >> $GITHUB_OUTPUT
+          - name: Find the changed extension-related files
+            id: find-extension-change
+            uses: tj-actions/changed-files@v35
+            with:
+              base_sha: ${{ steps.locate-base-sha.outputs.baseSHA }}
+              files: |
+                op_builder/**
+                colossalai/kernel/**
+                setup.py
           - name: Get all changed example files
             id: changed-files
             uses: tj-actions/changed-files@v35
@@ Expand Down Expand Up / @@ -79,17 +91,28 @@ jobs: @@
         container:
           image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
           options: --gpus all --rm -v /data/scratch/examples-data:/data/ -v /dev/shm
-        timeout-minutes: 20
+        timeout-minutes: 30
         concurrency:
           group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-run-example-${{ matrix.directory }}
           cancel-in-progress: true
         steps:
           - uses: actions/checkout@v3
+          - name: Restore Colossal-AI Cache
+            if: needs.detect.outputs.anyExtensionFileChanged != 'true'
+            run: |
+              if [ -d /github/home/cuda_ext_cache ] && [ ! -z "$(ls -A /github/home/cuda_ext_cache/)" ]; then
+                cp -p -r /github/home/cuda_ext_cache/* /__w/ColossalAI/ColossalAI/
+              fi
           - name: Install Colossal-AI
             run: |
               BUILD_EXT=1 pip install -v .
+          - name: Store Colossal-AI Cache
+            run: |
+              cp -p -r /__w/ColossalAI/ColossalAI/build /github/home/cuda_ext_cache/
           - name: Test the example
             run: |
               example_dir=${{ matrix.directory }}
@@ Expand Down @@

.github/workflows/example_check_on_schedule.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -36,7 +36,7 @@ jobs: @@
         container:
           image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
           options: --gpus all --rm -v /data/scratch/examples-data:/data/ -v /dev/shm
-        timeout-minutes: 10
+        timeout-minutes: 30
         steps:
           - name: 📚 Checkout
             uses: actions/checkout@v3
@@ Expand Down @@

README.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -25,6 +25,7 @@ @@
     </div>
     ## Latest News
+    * [2024/05] [Large AI Models Inference Speed Doubled, Colossal-Inference Open Source Release](https://hpc-ai.com/blog/colossal-inference)
     * [2024/04] [Open-Sora Unveils Major Upgrade: Embracing Open Source with Single-Shot 16-Second Video Generation and 720p Resolution](https://hpc-ai.com/blog/open-soras-comprehensive-upgrade-unveiled-embracing-16-second-video-generation-and-720p-resolution-in-open-source)
     * [2024/04] [Most cost-effective solutions for inference, fine-tuning and pretraining, tailored to LLaMA3 series](https://hpc-ai.com/blog/most-cost-effective-solutions-for-inference-fine-tuning-and-pretraining-tailored-to-llama3-series)
     * [2024/03] [314 Billion Parameter Grok-1 Inference Accelerated by 3.8x, Efficient and Easy-to-Use PyTorch+HuggingFace version is Here](https://hpc-ai.com/blog/314-billion-parameter-grok-1-inference-accelerated-by-3.8x-efficient-and-easy-to-use-pytorchhuggingface-version-is-here)
@@ Expand Down Expand Up / @@ -75,11 +76,9 @@ @@
      <li>
        <a href="#Inference">Inference</a>
        <ul>
+         <li><a href="#Colossal-Inference">Colossal-Inference: Large AI  Models Inference Speed Doubled</a></li>
          <li><a href="#Grok-1">Grok-1: 314B model of PyTorch + HuggingFace Inference</a></li>
          <li><a href="#SwiftInfer">SwiftInfer:Breaks the Length Limit of LLM for Multi-Round Conversations with 46% Acceleration</a></li>
-         <li><a href="#GPT-3-Inference">GPT-3</a></li>
-         <li><a href="#OPT-Serving">OPT-175B Online Serving for Text Generation</a></li>
-         <li><a href="#BLOOM-Inference">176B BLOOM</a></li>
        </ul>
      </li>
      <li>
@@ Expand Down Expand Up @@
     ## Inference
+    ### Colossal-Inference
+    <p align="center">
+    <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/colossal-inference-v1-1.png" width=1000/>
+    </p>
+    <p align="center">
+    <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/colossal-inference-v1-2.png" width=1000/>
+    </p>
+     - Large AI models inference speed doubled, compared to the offline inference performance of vLLM in some cases.
+    [[code]](https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/inference)
+    [[blog]](https://hpc-ai.com/blog/colossal-inference)
     ### Grok-1
     <p id="Grok-1" align="center">
     <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/examples/images/grok-1-inference.jpg" width=600/>
@@ Expand All @@
     [[HuggingFace Grok-1 PyTorch model weights]](https://huggingface.co/hpcai-tech/grok-1)
     [[ModelScope Grok-1 PyTorch model weights]](https://www.modelscope.cn/models/colossalai/grok-1-pytorch/summary)
+    ### SwiftInfer
     <p id="SwiftInfer" align="center">
     <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/SwiftInfer.jpg" width=800/>
     </p>
     - [SwiftInfer](https://github.com/hpcaitech/SwiftInfer): Inference performance improved by 46%, open source solution breaks the length limit of LLM for multi-round conversations
-    <p id="GPT-3-Inference" align="center">
-    <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference_GPT-3.jpg" width=800/>
-    </p>
-    - [Energon-AI](https://github.com/hpcaitech/EnergonAI): 50% inference acceleration on the same hardware
-    <p id="OPT-Serving" align="center">
-    <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/BLOOM%20serving.png" width=600/>
-    </p>
-    - [OPT Serving](https://colossalai.org/docs/advanced_tutorials/opt_service): Try 175-billion-parameter OPT online services
-    <p id="BLOOM-Inference" align="center">
-    <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/BLOOM%20Inference.PNG" width=800/>
-    </p>
-    - [BLOOM](https://github.com/hpcaitech/EnergonAI/tree/main/examples/bloom): Reduce hardware deployment costs of 176-billion-parameter BLOOM by more than 10 times.
     <p align="right">(<a href="#top">back to top</a>)</p>
     ## Installation
@@ Expand Down @@

0 comments on commit `6ebeac2`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `6ebeac2`

Commit

There are no files selected for viewing

0 comments on commit 6ebeac2

0 comments on commit `6ebeac2`