[Inference]Support vllm testing in benchmark scripts #5379

isky-cd · 2024-02-08T06:47:10Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

colossalai/inference/config.py

examples/inference/benchmark_llama.py

colossalai/inference/core/engine.py

examples/inference/benchmark_llama.py

FrankLeeeee · 2024-02-08T07:27:15Z

lgtm.

* [Inference] First PR for rebuild colossal-infer (#5143) * add engine and scheduler * add dirs --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Inference] Add readme (roadmap) and fulfill request handler (#5147) * request handler * add readme --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159) * [inference/nfc] remove outdated inference tests * remove outdated kernel tests * remove deprecated triton kernels * remove imports from deprecated kernels * [Inference]Add BatchInferState, Sequence and InferConfig (#5149) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * [Inference] Add CacheBlock and KV-Cache Manager (#5156) * [Inference] Add KVCache Manager * function refactored * add test for KVCache Manager * add attr beam width * Revise alloc func in CacheManager * Fix docs and pytests * add tp slicing for head number * optimize shapes of tensors used as physical cache * Apply using InferenceConfig on KVCacheManager * rm duplicate config file * Optimize cache allocation: use contiguous cache * Fix config in pytest (and config) * [Inference]Update inference config and fix test (#5178) * unify the config setting * fix test * fix import * fix test * fix * fix * add logger * revise log info --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt * [Inference] add logit processor and request handler (#5166) * add logit processor and request handler * add * add * add * fix * add search tokens and update func * finish request handler * add running list test * fix test * fix some bug * add * add * fix bugs * fix some bugs * fix bug * fix * fix * add copy fun * del useless attn * fix request status --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * Add padding llama model * Fixed a bug in the inference frame * fix bugs in request_handler * precision alignment * Fixed a writing error * [kernel] Add triton kernel for context attention (FAv2) without padding (#5192) * add context attn unpadded triton kernel * test compatibility * kv cache copy (testing) * fix k/v cache copy * fix kv cache copy and test * fix boundary of block ptrs * add support for GQA/MQA and testing * fix import statement --------- Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local> * add context_attention_unpadded * fix bugs in sampler * Fixed a typo * fix beam_width * [Inference] Pytorch Attention func, pad&nopad input support (#5219) * add attn * add attention test * fix attn forward * fix decoding * fix bugs in attention.py and request_handler.py * adapted to pad_context_forward * [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229) * fix accuracy * alignment in attention * fix attention * fix * fix bugs * fix bugs * fix bugs * fix bugs related to processing padding mask * fix CI bugs * rm torch.cuda.synchronize * fix bugs in request_handler.py and engine.py * [Inference] Kernel: no pad rotary embedding (#5252) * fix bugs * comment * use more accurate atol * fix * [kernel] Add flash decoding triton kernel for blocked kv cache (#5249) * add flash decoding unpad triton kernel * rename flash decoding kernel * add kernel testing (draft) * revise pytest * support kv group (GQA) * (trivial) fix api and pytest * (trivial) func renaming * (trivial) func/file renaming * refactor pytest for attention * (trivial) format and consistent vars of context/decode attn * (trivial) remove test redundancy * [git] fixed rebased files * [kernel] Add KV cache copy kernel during decoding (#5261) * add kv copy triton kernel during decoding stage * add pytest and fix kernel * fix test utilities * revise kernel config * add benchmark for kvcache copy * [doc] updated inference readme (#5269) * [Inference] Fix request handler and add recycle logic (#5260) * fix request handler * fix comment * [kernel] Revise KVCache copy triton kernel API (#5273) * [kernel/fix] revise kvcache copy kernel api * fix benchmark * [Inference]Adapted to the triton attn kernels (#5264) * adapted to the triton attn kernels * fix pad input * adapted to copy_kv_to_blocked_cache * fix ci test * update kv memcpy * remove print * [kernel] Add RMSLayerNorm triton kernel (#5262) * add layerrmsnorm triton kernel * add layerrmsnorm kernel * modify the atol and rtol in test file * Remove the logics of mean computations, and update the name of ther kernel functions and files * add benchmark of rms norm * [Hotfix] Fix bugs in testing continuous batching (#5270) * fix bug * fix bugs * fix bugs * fix bugs and add padding * add funcs and fix bugs * fix typos * fix bugs * add func * [kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274) * prevent re-creating intermediate tensors * add singleton class holding intermediate values * fix triton kernel api * add benchmark in pytest * fix kernel api and add benchmark * revise flash decoding triton kernel in/out shapes * fix calling of triton kernel in modeling * fix pytest: extract to util functions * [inference] Adapted to Rotary Embedding and RMS Norm (#5283) * adapted to rotary_embedding * adapted to nopad rms norm * fix bugs in benchmark * fix flash_decoding.py * add utils.py * [Inference] Benchmarking rotary embedding and add a fetch function (#5277) * fix bugs and add a cos/sin cache fetch func * add docstring * fix bug * fix * [Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301) * fix decoding kernel pytest * revise and add triton context attn benchmark * [Inference]Add fused rotary kernel and get cos cache kernel (#5302) * add fused rotary and get cos cache func * staged * fix bugs * fix bugs * [hotfix] fix boundary check in batch (#5306) * [inference]Optimize the usage of the mid tensors space in flash attn (#5304) * opt flash attn * opt tmp tensor * fix benchmark_llama * fix code style * fix None logic for output tensor * fix adapted to get_xine_cache * add comment * fix ci bugs * fix some codes * rm duplicated codes * rm duplicated codes * fix code style * add _get_dtype in config.py * fix (#5311) * [Inference] Update rms norm kernel, benchmark with vLLM (#5315) * add * xi * del * del * fix * [DOC] Update inference readme (#5280) * add readme * add readme * 1 * update engine * finish readme * add readme * [Inference]Add Nopadding Llama Modeling (#5327) * add nopadding llama modeling * add nopadding_llama.py * rm unused codes * fix bugs in test_xine_copy.py * fix code style * [Infer] Optimize Blocked KVCache And Kernels Using It (#5325) * revise shape of kvcache (context attn kernel) * revise shape of kvcache (flash decoding kernel) * revise shape of kvcache (kvcache copy) and attn func * init of kvcache in kvcache manager * revise llama modeling * revise block size retrieval * use torch for rms_norm benchmarking * revise block size retrieval * [Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336) * revise rotary embedding * remove useless print * adapt * [inference] simplified config verification (#5346) * [inference] simplified config verification * polish * polish * [Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation，add fused_qkv and fused linear_add (#5340) * add fused qkv * replace attn and mlp by shardformer * fix bugs in mlp * add docstrings * fix test_inference_engine.py * add optimize unbind * add fused_addmm * rm squeeze(1) * refactor codes * fix ci bugs * rename ShardFormerLlamaMLP and ShardFormerLlamaAttention * Removed the dependency on LlamaFlashAttention2 * rollback test_inference_engine.py * [inference] removed redundancy init_batch (#5353) * [inference] moved ops tests to test_infer (#5354) * [doc] updated inference readme (#5343) * [Inference/opt]Optimize the mid tensor of RMS Norm (#5350) * opt rms_norm * fix bugs in rms_layernorm * [Inference]Optimize generation process of inference engine (#5356) * opt inference engine * fix run_benchmark.sh * fix generate in engine.py * rollback tesh_inference_engine.py * [Fix/Infer] Remove unused deps and revise requirements (#5341) * remove flash-attn dep * rm padding llama * revise infer requirements * move requirements out of module * [Inference]Fused the gate and up proj in mlp，and optimized the autograd process. (#5365) * fused the gate and up proj in mlp * fix code styles * opt auto_grad * rollback test_inference_engine.py * modifications based on the review feedback. * fix bugs in flash attn * Change reshape to view * fix test_rmsnorm_triton.py * [Inference] Adapt to Fused rotary (#5348) * revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix * Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373) This reverts commit 9f4ab2e. * [inference] added inference template (#5375) * [Inference/opt] Fused KVCahce Memcopy (#5374) * fused kv memcopy * add TODO in test_kvcache_copy.py * [Inference] User Experience: update the logic of default tokenizer and generation config. (#5337) * add * fix * fix * pause * fix * fix pytest * align * fix * license * fix * fix * fix readme * fix some bugs * remove tokenizer config * [inference] refactored config (#5376) * [Inference]Support vllm testing in benchmark scripts (#5379) * add vllm benchmark scripts * fix code style * update run_benchmark.sh * fix code style * [Inference] Optimize and Refactor Inference Batching/Scheduling (#5367) * add kvcache manager funcs for batching * add batch bucket for batching * revise RunningList struct in handler * add kvcache/batch funcs for compatibility * use new batching methods * fix indexing bugs * revise abort logic * use cpu seq lengths/block tables * rm unused attr in Sequence * fix type conversion/default arg * add and revise pytests * revise pytests, rm unused tests * rm unused statements * fix pop finished indexing issue * fix: use index in batch when retrieving inputs/update seqs * use dict instead of odict in batch struct * arg type hinting * fix make compress * refine comments * fix: pop_n_seqs to pop the first n seqs * add check in request handler * remove redundant conversion * fix test for request handler * fix pop method in batch bucket * fix prefill adding * [Inference]Fused kv copy into rotary calculation (#5383) * revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix * fused kv copy * fused copy * colossalai/kernel/triton/no_pad_rotary_embedding.py * del padding llama * del * Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390) * opt_view_and_memcopy * fix bugs in ci * fix ci bugs * update benchmark scripts * fix ci bugs * [Fix/Inference] Fix format of input prompts and input model in inference engine (#5395) * Fix bugs in inference_engine * fix bugs in engine.py * rm CUDA_VISIBLE_DEVICES * add request_ids in generate * fix bug in engine.py * add logger.debug for BatchBucket * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * [Inference]Add CUDA KVCache Kernel (#5406) * add cuda KVCache kernel * annotation benchmark_kvcache_copy * add use cuda * fix import path * move benchmark scripts to example/ * rm benchmark codes in test_kv_cache_memcpy.py * rm redundancy codes * rm redundancy codes * pr was modified according to the review * [Inference]Move benchmark-related code to the example directory. (#5408) * move benchmark-related code to the example directory. * fix bugs in test_fused_rotary_embedding.py * add silu_and_mul for infer * [feat] cuda graph support and refactor non-functional api * add reusable utils for cuda * refactor code * feat rmsnorm cuda kernel and add unittest, benchmark script (#5417) * [fix] multi graphs capture error * [fix] multi graphs capture error * [doc] add doc * refactor code * optimize rmsnorm: add vectorized elementwise op, feat loop unrolling (#5441) * fix include path * fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test (#5454) * [Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418) * add rotary embedding kernel * add rotary_embedding_kernel * add fused rotary_emb and kvcache memcopy * add fused_rotary_emb_and_cache_kernel.cu * add fused_rotary_emb_and_memcopy * fix bugs in fused_rotary_emb_and_cache_kernel.cu * fix ci bugs * use vec memcopy and opt the gloabl memory access * fix code style * fix test_rotary_embdding_unpad.py * codes revised based on the review comments * fix bugs about include path * rm inline * [fix] pytest and fix dyn grid bug * diverse tests * add implementatino for GetGPULaunchConfig1D * [fix] tmp for test * add some comments * refactor vector utils * [feat] add use_cuda_kernel option * add vec_type_trait implementation (#5473) * [fix] unused option * [fix] * [fix] * [fix] remove unused comment * [Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461) * Support FP16/BF16 Flash Attention 2 * fix bugs in test_kv_cache_memcpy.py * add context_kv_cache_memcpy_kernel.cu * rm typename MT * add tail process * add high_precision * add high_precision to config.py * rm unused code * change the comment for the high_precision parameter * update test_rotary_embdding_unpad.py * fix vector_copy_utils.h * add comment for self.high_precision when using float32 * [fix] PR #5354 (#5501) * [fix] * [fix] * Update config.py docstring * [fix] docstring align * [fix] docstring align * [fix] docstring align * [Inference] Optimize request handler of llama (#5512) * optimize request_handler * fix ways of writing * The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519) * [Inference/Kernel]Add get_cos_and_sin Kernel (#5528) * Add get_cos_and_sin kernel * fix code comments * fix code typos * merge common codes of get_cos_and_sin kernel. * Fixed a typo * Changed 'asset allclose' to 'assert equal'. * [Inference] Add Reduce Utils (#5537) * add reduce utils * add using to delele namespace prefix * [Fix/Inference] Remove unused and non-functional functions (#5543) * [fix] remove unused func * rm non-functional partial * add cast and op_functor for cuda build-in types (#5546) * remove unused triton kernels * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove outdated triton test * [Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401) * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * resolve conflicts for revising flash-attn * adapt kv cache copy kernel for spec-dec * fix seqlen-n kvcache copy kernel/tests * test kvcache copy - use torch.equal * add assertions * (trivial) comment out * [Inference/SpecDec] Add Basic Drafter Model Container (#5405) * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * add drafter model container (basic ver) * [Inference/SpecDec] Add Speculative Decoding Implementation (#5423) * fix flash decoding mask during verification * add spec-dec * add test for spec-dec * revise drafter init * remove drafter sampling * retire past kv in drafter * (trivial) rename attrs * (trivial) rename arg * revise how we enable/disable spec-dec * [SpecDec] Fix inputs for speculation and revise past KV trimming (#5449) * fix drafter pastkv and usage of batch bucket * [Inference/SpecDec] Support GLIDE Drafter Model (#5455) * add glide-llama policy and modeling * update glide modeling, compitable with transformers 4.36.2 * revise glide llama modeling/usage * fix issues of glimpsing large kv * revise the way re-loading params for glide drafter * fix drafter and engine tests * enable convert to glide strict=False * revise glide llama modeling * revise vicuna prompt template * revise drafter and tests * apply usage of glide model in engine * [doc] Add inference/speculative-decoding README (#5552) * add README for spec-dec * update roadmap * [Fix] resolve conflicts of rebasing feat/speculative-decoding (#5557) - resolve conflicts of rebasing feat/speculative-decoding * [Fix] Llama Modeling Control with Spec-Dec (#5580) - fix ref before asgmt - fall back to use triton kernels when using spec-dec * refactor csrc (#5582) * [Inference/Refactor] Delete Duplicated code and refactor vec_copy utils and reduce utils (#5593) * delete duplicated code and refactor vec_copy utils and reduce utils * delete unused header file * [inference/model]Adapted to the baichuan2-7B model (#5591) * Adapted to the baichuan2-7B model * modified according to the review comments. * Modified the method of obtaining random weights. * modified according to the review comments. * change mlp layewr 'NOTE' * [Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531) * feat flash decoding for paged attention * refactor flashdecodingattention * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Feat]Tensor Model Parallel Support For Inference (#5563) * tensor parallel support naive source * [fix]precision, model load and refactor the framework * add tp unit test * docstring * fix do_sample * feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611) * [Fix/Inference] Fix GQA Triton and Support Llama3 (#5624) * [fix] GQA calling of flash decoding triton * fix kv cache alloc shape * fix rotary triton - GQA * fix sequence max length assigning * Sequence max length logic * fix scheduling and spec-dec * skip without import error * fix pytest - skip without ImportError --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Fix/Inference]Fix CUDA Rotary Rmbedding GQA (#5623) * fix rotary embedding GQA * change test_rotary_embdding_unpad.py KH * [example] Update Llama Inference example (#5629) * [example] add infernece benchmark llama3 * revise inference config - arg * remove unused args * add llama generation demo script * fix init rope in llama policy * add benchmark-llama3 - cleanup * [Inference/Refactor] Refactor compilation mechanism and unified multi hw (#5613) * refactor compilation mechanism and unified multi hw * fix file path bug * add init.py to make pybind a module to avoid relative path error caused by softlink * delete duplicated micros * fix micros bug in gcc * [Fix/Inference]Fix vllm benchmark (#5630) * Fix bugs about OOM when running vllm-0.4.0 * rm used params * change generation_config * change benchmark log file name * [Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643) * optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x]) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Fix] Remove obsolete files - inference (#5650) * [Inference]Adapt to baichuan2 13B (#5614) * adapt to baichuan2 13B * adapt to baichuan2 13B * change BAICHUAN_MODEL_NAME_OR_PATH * fix test_decoding_attn.py * Modifications based on review comments. * change BAICHUAN_MODEL_NAME_OR_PATH * mv attn mask processes to test flash decoding * mv get_alibi_slopes baichuan modeling * fix bugs in test_baichuan.py * [kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658) * add context attn triton kernel - new kcache layout * add benchmark triton * tiny revise * trivial - code style, comment * [Inference/Feat] Add kvcache quantization support for FlashDecoding (#5656) * [Inference/Feat] Feat quant kvcache step2 (#5674) * [Inference] Adapt Baichuan2-13B TP (#5659) * adapt to baichuan2 13B * add baichuan2 13B TP * update baichuan tp logic * rm unused code * Fix TP logic * fix alibi slopes tp logic * rm nn.Module * Polished the code. * change BAICHUAN_MODEL_NAME_OR_PATH * Modified the logic for loading Baichuan weights. * fix typos * [Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663) * refactor kvcache manager and rotary_embedding and kvcache_memcpy operator * refactor decode_kv_cache_memcpy * enable alibi in pagedattention * [Inference/Feat] Add kvcache quant support for fused_rotary_embedding_cache_copy (#5680) * [inference]Add alibi to flash attn function (#5678) * add alibi to flash attn function * rm redundant modifications * [Inference] Fix quant bits order (#5681) * [kernel] Support New KCache Layout - Triton Kernel (#5677) * kvmemcpy triton for new kcache layout * revise tests for new kcache layout * naive triton flash decoding - new kcache layout * rotary triton kernel - new kcache layout * remove redundancy - triton decoding * remove redundancy - triton kvcache copy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Fix] Fix & Update Inference Tests (compatibility w/ main) * [Inference] Remove unnecessary float4_ and rename float8_ to float8 (#5679) * [Inference/Feat] Add quant kvcache support for decode_kv_cache_memcpy (#5686) * [hotfix] Fix KV Heads Number Assignment in KVCacheManager (#5695) - Fix key value number assignment in KVCacheManager, as well as method of accessing * [Fix] Fix Inference Example, Tests, and Requirements (#5688) * clean requirements * modify example inference struct * add test ci scripts * mark test_infer as submodule * rm deprecated cls & deps * import of HAS_FLASH_ATTN * prune inference tests to be run * prune triton kernel tests * increment pytest timeout mins * revert import path in openmoe * [hotfix] fix OpenMOE example import path (#5697) * [Inference]Adapt temperature processing logic (#5689) * Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg * [Inference] Support the logic related to ignoring EOS token (#5693) * Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg * support ignore EOS token * change variable's name * fix annotation * [Inference] ADD async and sync Api server using FastAPI (#5396) * add api server * fix * add * add completion service and fix bug * add generation config * revise shardformer * fix bugs * add docstrings and fix some bugs * fix bugs and add choices for prompt template * [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432) * finish online test and add examples * fix test_contionus_batching * fix some bugs * fix bash * fix * fix inference * finish revision * fix typos * revision * [Online Server] Chat Api for streaming and not streaming response (#5470) * fix bugs * fix bugs * fix api server * fix api server * add chat api and test * del request.n * [Inference] resolve rebase conflicts fix * [Inference] Fix bugs and docs for feat/online-server (#5598) * fix test bugs * add do sample test * del useless lines * fix comments * fix tests * delete version tag * delete version tag * add * del test sever * fix test * fix * Revert "add" This reverts commit b9305fb. * resolve rebase conflicts on Branch feat/online-serving * [Inference] Add example test_ci script * [Inference/Feat] Add quant kvcache interface (#5700) * add quant kvcache interface * delete unused output * complete args comments * [Inference/Feat] Add convert_fp8 op for fp8 test in the future (#5706) * add convert_fp8 op for fp8 test in the future * rerun ci * [Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708) * Adapt repetition_penalty and no_repeat_ngram_size * fix no_repeat_ngram_size_logit_process * remove batch_updated * fix annotation * modified codes based on the review feedback. * rm get_batch_token_ids * [Feat]Inference RPC Server Support (#5705) * rpc support source * kv cache logical/physical disaggregation * sampler refactor * colossalai launch built in * Unitest * Rpyc support --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add paged-attetionv2: support seq length split across thread block (#5707) * [Inference] Delete duplicated copy_vector (#5716) * [ci] Fix example tests (#5714) * [fix] revise timeout value on example CI * trivial * [Fix] Llama3 Load/Omit CheckpointIO Temporarily (#5717) * Fix Llama3 Load error * Omit Checkpoint IO Temporarily * [Inference] Fix API server, test and example (#5712) * fix api server * fix generation config * fix api server * fix comments * fix infer hanging bug * resolve comments, change backend to free port * 【Inference] Delete duplicated package (#5723) * [example] Update Inference Example (#5725) * [example] update inference example * [lazy] fix lazy cls init (#5720) * fix * fix * fix * fix * fix * remove kernel intall * rebase revert fix * fix * fix * [Inference] Fix Inference Generation Config and Sampling (#5710) * refactor and add * config default values * fix gen config passing * fix rpc generation config * [Fix/Inference] Add unsupported auto-policy error message (#5730) * [fix] auto policy error message * trivial * [doc] Update Inference Readme (#5736) * [doc] update inference readme * add contents * trivial * [Shardformer] Add parallel output for shardformer models(bloom, falcon) (#5702) * [pre-commit.ci] auto fixes from pre-commit.com hooks * add parallel cross entropy output for falcon model & fix some typos in bloom.py * fix module name error, self.model -> self.transformers in bloom, falcon model * Fix the overflow bug of distributed cross entropy loss function when training with fp16 * add dtype to parallel cross entropy loss function * fix dtype related typos adn prettify the loss.py * fix grad dtype and update dtype mismatch error * fix typo bugs * [bug] fix silly bug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [chore] add test for prefetch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [ci] Temporary fix for build on pr (#5741) * temporary fix for CI * timeout to 90 * [NFC] Fix code factors on inference triton kernels (#5743) * [NFC] fix requirements (#5744) * [inference] release (#5747) * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release --------- Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Co-authored-by: yuehuayingxueluo <867460659@qq.com> Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local> Co-authored-by: FrankLeeeee <somerlee.9@gmail.com> Co-authored-by: Yaozheng Fang <62918515+nkfyz@users.noreply.github.com> Co-authored-by: xs_courtesy <xs1580802568@gmail.com> Co-authored-by: Runyu Lu <runyulu@umich.edu> Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Co-authored-by: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Co-authored-by: Yuanheng <jonathan.zhaoyh@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: CjhHa1 <cjh18671720497@outlook.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: Haze188 <haze188@qq.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com>

commit 4647ec28c8450ee96f4709626617763712efd77e Author: binmakeswell <binmakeswell@gmail.com> Date: Thu May 23 17:44:06 2024 +0800 [inference] release (#5747) * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release commit df6747603f11e2a1929db193ceb014799e02e2c1 Merge: 22ce873c 498f42c4 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed May 22 14:31:09 2024 +0800 [Colossal-Inference] (v0.1.0) Merge pull request #5739 from hpcaitech/feature/colossal-infer [Inference] Merge feature/colossal-infer commit 498f42c45b256b5cfc32d74b552e1e306f317a42 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed May 22 12:08:49 2024 +0800 [NFC] fix requirements (#5744) commit bd38fe6b912379080673a43d77fd3bdf0e5c852e Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue May 21 22:12:15 2024 +0800 [NFC] Fix code factors on inference triton kernels (#5743) commit c2c8c9cf17d67000df8a5b75ae9dbecee0e1c00a Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue May 21 18:20:57 2024 +0800 [ci] Temporary fix for build on pr (#5741) * temporary fix for CI * timeout to 90 commit c06208e72c35d74e150b6a83e72375f5021d10b1 Merge: d8b1ea4a 8633c15d Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue May 21 11:26:37 2024 +0800 Merge pull request #5737 from yuanheng-zhao/inference/sync/main [sync] Sync feature/colossal-infer with main commit 22ce873c3f26fd7f4217cdf19071c173683c2b47 Author: Haze188 <haze188@qq.com> Date: Tue May 21 11:07:13 2024 +0800 [Shardformer] Add parallel output for shardformer models(bloom, falcon) (#5702) * [pre-commit.ci] auto fixes from pre-commit.com hooks * add parallel cross entropy output for falcon model & fix some typos in bloom.py * fix module name error, self.model -> self.transformers in bloom, falcon model * Fix the overflow bug of distributed cross entropy loss function when training with fp16 * add dtype to parallel cross entropy loss function * fix dtype related typos adn prettify the loss.py * fix grad dtype and update dtype mismatch error * fix typo bugs commit 8633c15da9b82c675c59ad292e7f0d77f092653c Merge: d8b1ea4a 9d83c6d7 Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> Date: Mon May 20 15:50:53 2024 +0000 [sync] Sync feature/colossal-infer with main commit d8b1ea4ac90317ad6126acbd854e66583a8f9c8f Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon May 20 22:50:04 2024 +0800 [doc] Update Inference Readme (#5736) * [doc] update inference readme * add contents * trivial commit bdf9a001d61cfad4bb68752c4a808295165307a0 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon May 20 22:49:18 2024 +0800 [Fix/Inference] Add unsupported auto-policy error message (#5730) * [fix] auto policy error message * trivial commit 283c407a19002118bda7edd1b8a3acf099843205 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Sun May 19 15:08:42 2024 +0800 [Inference] Fix Inference Generation Config and Sampling (#5710) * refactor and add * config default values * fix gen config passing * fix rpc generation config commit 9d83c6d715e8cdb802f82335e651923baab5cfc6 Author: flybird11111 <1829166702@qq.com> Date: Fri May 17 18:18:59 2024 +0800 [lazy] fix lazy cls init (#5720) * fix * fix * fix * fix * fix * remove kernel intall * rebase revert fix * fix * fix commit 8bcfe360fdae7ccec7051aaced48497519afc2f2 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Fri May 17 11:28:53 2024 +0800 [example] Update Inference Example (#5725) * [example] update inference example commit a8d459f99a1d415fc843327e4dafce19ecee1f3e Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Thu May 16 10:49:03 2024 +0800 【Inference] Delete duplicated package (#5723) commit f47f2fbb2467df15548d2c663b119f4ae0103890 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Wed May 15 15:47:31 2024 +0800 [Inference] Fix API server, test and example (#5712) * fix api server * fix generation config * fix api server * fix comments * fix infer hanging bug * resolve comments, change backend to free port commit 74c47921facd26dbd93172bf887abcad4eab2d5c Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Date: Tue May 14 20:17:43 2024 +0800 [Fix] Llama3 Load/Omit CheckpointIO Temporarily (#5717) * Fix Llama3 Load error * Omit Checkpoint IO Temporarily commit 5bbab1533ae7672ab37e91b7bc9e584b3a4e9cc1 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue May 14 16:08:51 2024 +0800 [ci] Fix example tests (#5714) * [fix] revise timeout value on example CI * trivial commit 121d7ad629c746e52a96ec53d6e26c0194016a03 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue May 14 14:35:33 2024 +0800 [Inference] Delete duplicated copy_vector (#5716) commit 7806842f2dbb4b6d6e74014efc7db5be8ccf0bbd Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Tue May 14 12:46:54 2024 +0800 add paged-attetionv2: support seq length split across thread block (#5707) commit 18d67d0e8e79c22bded0745c7d3daf8ca40d445c Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Date: Tue May 14 10:00:55 2024 +0800 [Feat]Inference RPC Server Support (#5705) * rpc support source * kv cache logical/physical disaggregation * sampler refactor * colossalai launch built in * Unitest * Rpyc support --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit de4bf3dedf2c7cb7ba6c3044745bab3c3ef6352d Author: yuehuayingxueluo <867460659@qq.com> Date: Sat May 11 15:13:25 2024 +0800 [Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708) * Adapt repetition_penalty and no_repeat_ngram_size * fix no_repeat_ngram_size_logit_process * remove batch_updated * fix annotation * modified codes based on the review feedback. * rm get_batch_token_ids commit 50104ab340e6c7067fbaaf9b47c608eb828aa95b Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Fri May 10 18:39:54 2024 +0800 [Inference/Feat] Add convert_fp8 op for fp8 test in the future (#5706) * add convert_fp8 op for fp8 test in the future * rerun ci commit bfad39357b0fe31ecf6f7639e2c4056165078a3f Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Thu May 9 18:03:24 2024 +0800 [Inference/Feat] Add quant kvcache interface (#5700) * add quant kvcache interface * delete unused output * complete args comments commit 492520dbdb962d207ac40d216e0414807f73eb19 Merge: d4829220 5d9a4948 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Thu May 9 17:19:45 2024 +0800 Merge pull request #5588 from hpcaitech/feat/online-serving [Feature]Online Serving commit 5d9a49483d98ccd4bebebbfd039162caceefe6bd Author: CjhHa1 <cjh18671720497@outlook.com> Date: Thu May 9 05:44:05 2024 +0000 [Inference] Add example test_ci script commit bc9063adf1598c3be32fc2d12577d76b9daa79bf Author: CjhHa1 <cjh18671720497@outlook.com> Date: Wed May 8 10:36:42 2024 +0000 resolve rebase conflicts on Branch feat/online-serving commit 61a1b2e798edcbf91ac35966a4047407ad6aa62d Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Wed May 8 15:14:06 2024 +0800 [Inference] Fix bugs and docs for feat/online-server (#5598) * fix test bugs * add do sample test * del useless lines * fix comments * fix tests * delete version tag * delete version tag * add * del test sever * fix test * fix * Revert "add" This reverts commit b9305fb02440d5cd566d32b508bee9f9c13dda15. commit 7bbb28e48bdb5849d9dfb118d7bf2959d79bbe02 Author: CjhHa1 <cjh18671720497@outlook.com> Date: Thu Apr 11 10:12:31 2024 +0800 [Inference] resolve rebase conflicts fix commit c06403286567f62cb0a6dfc5e075cf60e291cea9 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Sun Apr 7 14:45:43 2024 +0800 [Online Server] Chat Api for streaming and not streaming response (#5470) * fix bugs * fix bugs * fix api server * fix api server * add chat api and test * del request.n commit de378cd2abd77b464786dc5f8298c9edbf023fbc Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Mon Mar 18 17:06:05 2024 +0800 [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432) * finish online test and add examples * fix test_contionus_batching * fix some bugs * fix bash * fix * fix inference * finish revision * fix typos * revision commit 69cd7e069d5705c7e431b301ac14924711c74e41 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Fri Mar 1 14:47:36 2024 +0800 [Inference] ADD async and sync Api server using FastAPI (#5396) * add api server * fix * add * add completion service and fix bug * add generation config * revise shardformer * fix bugs * add docstrings and fix some bugs * fix bugs and add choices for prompt template commit d482922035ff7b6fe7ced8e6c4028faa2d68197f tAuthor: yuehuayingxueluo <867460659@qq.com> Date: Wed May 8 19:59:10 2024 +0800 [Inference] Support the logic related to ignoring EOS token (#5693) * Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg * support ignore EOS token * change variable's name * fix annotation commit 9c2fe7935ff5aaec4f174cfba6f324df623c7447 Author: yuehuayingxueluo <867460659@qq.com> Date: Wed May 8 17:58:29 2024 +0800 [Inference]Adapt temperature processing logic (#5689) * Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg commit 12e7c28d5e8f219480d1dbc682fd225dc76fcc2b Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Daqte: Wed May 8 15:48:47 2024 +0800 [hotfix] fix OpenMOE example import path (#5697) commit 55cc7f3df7c600deae2f344ee162abae5a5c63e1 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed May 8 11:30:15 2024 +0800 [Fix] Fix Inference Example, Tests, and Requirements (#5688) * clean requirements * modify example inference struct * add test ci scripts * mark test_infer as submodule * rm deprecated cls & deps * import of HAS_FLASH_ATTN * prune inference tests to be run * prune triton kernel tests * increment pytest timeout mins * revert import path in openmoe commit f9afe0addd89303de4819debd93efe97d5618238 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue May 7 23:13:14 2024 +0800 [hotfix] Fix KV Heads Number Assignment in KVCacheManager (#5695) - Fix key value number assignment in KVCacheManager, as well as method of accessing commit 1ace1065e6bff175a0af88cae86d272acef29c9f Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Mon May 6 15:35:13 2024 +0800 [Inference/Feat] Add quant kvcache support for decode_kv_cache_memcpy (#5686) commit db7b3051f4379862f88790bf1653ddb6443c002e Merge: 725fbd2e 8754abae Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon May 6 14:43:38 2024 +0800 [Sync] Update from main to feature/colossal-infer (Merge pull request #5685) [Sync] Update from main to feature/colossal-infer - Merge pull request #5685 from yuanheng-zhao/inference/merge/main commit 725fbd2ed067f9c58ac04670377d3e6f2a96fe00 Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Mon May 6 10:55:34 2024 +0800 [Inference] Remove unnecessary float4_ and rename float8_ to float8 (#5679) commit 8754abae24dbcc492d2992d1091428592b615285 Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> Date: Sun May 5 16:28:56 2024 +0000 [Fix] Fix & Update Inference Tests (compatibility w/ main) commit 56ed09aba5e017fc0c211dac70215c2f83815919 Merge: 537a3cbc d3f34ee8 Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> Date: Sun May 5 05:14:00 2024 +0000 [sync] resolve conflicts of merging main commit 537a3cbc4df445786c8ecf2af0a2998e2fd881b6 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Fri May 3 17:20:45 2024 +0800 [kernel] Support New KCache Layout - Triton Kernel (#5677) * kvmemcpy triton for new kcache layout * revise tests for new kcache layout * naive triton flash decoding - new kcache layout * rotary triton kernel - new kcache layout * remove redundancy - triton decoding * remove redundancy - triton kvcache copy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit 9df016fc4520a5a5c95a11ed04a8ac62bde039c4 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue Apr 30 19:38:00 2024 +0800 [Inference] Fix quant bits order (#5681) commit f79963199cd30c5e917d430aedd79113d06d608c Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Apr 30 19:35:05 2024 +0800 [inference]Add alibi to flash attn function (#5678) * add alibi to flash attn function * rm redundant modifications commit ef8e4ffe310bfe21f83feb965d962d816d75bc88 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue Apr 30 18:33:53 2024 +0800 [Inference/Feat] Add kvcache quant support for fused_rotary_embedding_cache_copy (#5680) commit 5cd75ce4c7edc95bacd8ec5fc04b8add339e8331 Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Tue Apr 30 15:52:23 2024 +0800 [Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663) * refactor kvcache manager and rotary_embedding and kvcache_memcpy operator * refactor decode_kv_cache_memcpy * enable alibi in pagedattention commit 5f00002e43bd738a99fea250306e54c8c908f05a Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Apr 30 15:47:07 2024 +0800 [Inference] Adapt Baichuan2-13B TP (#5659) * adapt to baichuan2 13B * add baichuan2 13B TP * update baichuan tp logic * rm unused code * Fix TP logic * fix alibi slopes tp logic * rm nn.Module * Polished the code. * change BAICHUAN_MODEL_NAME_OR_PATH * Modified the logic for loading Baichuan weights. * fix typos commit 808ee6e4addccb51990398434547fa5df3c255b0 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue Apr 30 11:26:36 2024 +0800 [Inference/Feat] Feat quant kvcache step2 (#5674) commit 8ccb6714e79137c8e6e50d9a585eadbf70ae6fc0 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Fri Apr 26 19:40:37 2024 +0800 [Inference/Feat] Add kvcache quantization support for FlashDecoding (#5656) commit 5be590b99eb6c58c3aa809d453680139fdd2b9f7 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Fri Apr 26 17:51:49 2024 +0800 [kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658) * add context attn triton kernel - new kcache layout * add benchmark triton * tiny revise * trivial - code style, comment commit 3c91e3f1763d2a30a85187a3a606dbe4d1b9454d Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Apr 25 23:11:30 2024 +0800 [Inference]Adapt to baichuan2 13B (#5614) * adapt to baichuan2 13B * adapt to baichuan2 13B * change BAICHUAN_MODEL_NAME_OR_PATH * fix test_decoding_attn.py * Modifications based on review comments. * change BAICHUAN_MODEL_NAME_OR_PATH * mv attn mask processes to test flash decoding * mv get_alibi_slopes baichuan modeling * fix bugs in test_baichuan.py commit f342a9387168cedc2e5cc33155939c6d0c4e99a0 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Thu Apr 25 22:04:59 2024 +0800 [Fix] Remove obsolete files - inference (#5650) commit a8fd3b034235e1fa987a1ae85a9a2b465ee6128f Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Thu Apr 25 14:24:02 2024 +0800 [Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643) * optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x]) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit 90cd5227a348dfe506e95b2e49f2a8dcd34fdbca Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Apr 24 14:51:36 2024 +0800 [Fix/Inference]Fix vllm benchmark (#5630) * Fix bugs about OOM when running vllm-0.4.0 * rm used params * change generation_config * change benchmark log file name commit 279300dc5f34db219c90a297c0996d00221eae96 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Wed Apr 24 14:17:54 2024 +0800 [Inference/Refactor] Refactor compilation mechanism and unified multi hw (#5613) * refactor compilation mechanism and unified multi hw * fix file path bug * add init.py to make pybind a module to avoid relative path error caused by softlink * delete duplicated micros * fix micros bug in gcc commit 04863a9b144fc7dd46a57d2c7b0cf2f4b351ffb6 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Apr 23 22:23:07 2024 +0800 [example] Update Llama Inference example (#5629) * [example] add infernece benchmark llama3 * revise inference config - arg * remove unused args * add llama generation demo script * fix init rope in llama policy * add benchmark-llama3 - cleanup commit 12f10d5b0b49a180bc162e166337942e0bbfb96b Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Apr 23 13:44:49 2024 +0800 [Fix/Inference]Fix CUDA Rotary Rmbedding GQA (#5623) * fix rotary embedding GQA * change test_rotary_embdding_unpad.py KH commit 5d4c1fe8f5f7019284f6cbc0ed29506748f63bf1 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Apr 23 13:09:55 2024 +0800 [Fix/Inference] Fix GQA Triton and Support Llama3 (#5624) * [fix] GQA calling of flash decoding triton * fix kv cache alloc shape * fix rotary triton - GQA * fix sequence max length assigning * Sequence max length logic * fix scheduling and spec-dec * skip without import error * fix pytest - skip without ImportError --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit ccf72797e3bfafcbfc42870ce24ee484858d4852 Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Fri Apr 19 15:34:53 2024 +0800 feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611) commit e37ee2fb65fc77c275b816968d91776322fd7695 Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Date: Thu Apr 18 16:56:46 2024 +0800 [Feat]Tensor Model Parallel Support For Inference (#5563) * tensor parallel support naive source * [fix]precision, model load and refactor the framework * add tp unit test * docstring * fix do_sample commit be396ad6cc102fa610731291bf28e531a5641c7a Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Thu Apr 18 16:45:07 2024 +0800 [Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531) * feat flash decoding for paged attention * refactor flashdecodingattention * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit 56b222eff8c996a4677a158d4b5d4834a1bc0cfc Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Apr 15 16:53:02 2024 +0800 [inference/model]Adapted to the baichuan2-7B model (#5591) * Adapted to the baichuan2-7B model * modified according to the review comments. * Modified the method of obtaining random weights. * modified according to the review comments. * change mlp layewr 'NOTE' commit d4cb023b62ea8e092783be437cb16d74a1afc6a7 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Mon Apr 15 10:57:51 2024 +0800 [Inference/Refactor] Delete Duplicated code and refactor vec_copy utils and reduce utils (#5593) * delete duplicated code and refactor vec_copy utils and reduce utils * delete unused header file commit a21912339a2c41627b43fd00e6adba38308a2ea0 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Thu Apr 11 15:41:36 2024 +0800 refactor csrc (#5582) commit 25928d84961b60264a6dabbddeae32af04a43fa2 Merge: d56c9633 f8598e3e Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed Apr 10 18:39:27 2024 +0800 [Inference/Spec-Dec] Merge pull request #5565 from hpcaitech/feat/speculative-decoding Add Speculative Decoding and GLIDE Spec-Dec commit f8598e3ec56bbe6bc6dd9fd84a1e0543adbd3073 Author: Yuanheng <jonathan.zhaoyh@gmail.com> Date: Wed Apr 10 11:14:04 2024 +0800 [Fix] Llama Modeling Control with Spec-Dec (#5580) - fix ref before asgmt - fall back to use triton kernels when using spec-dec commit e60d430cf53c9009af4682908d01742147654429 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Sun Apr 7 14:53:30 2024 +0800 [Fix] resolve conflicts of rebasing feat/speculative-decoding (#5557) - resolve conflicts of rebasing feat/speculative-decoding commit e1acb58423c53ece50b72db3bf9b91475d5d3d64 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed Apr 3 18:06:23 2024 +0800 [doc] Add inference/speculative-decoding README (#5552) * add README for spec-dec * update roadmap commit d85d91435ae25d875bfeb012b1e66cbfce6f6525 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon Apr 1 21:54:24 2024 +0800 [Inference/SpecDec] Support GLIDE Drafter Model (#5455) * add glide-llama policy and modeling * update glide modeling, compitable with transformers 4.36.2 * revise glide llama modeling/usage * fix issues of glimpsing large kv * revise the way re-loading params for glide drafter * fix drafter and engine tests * enable convert to glide strict=False * revise glide llama modeling * revise vicuna prompt template * revise drafter and tests * apply usage of glide model in engine commit 912e24b2aaf4acda0e2b9a45a7d4327fbfc8bd39 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Mar 12 17:57:01 2024 +0800 [SpecDec] Fix inputs for speculation and revise past KV trimming (#5449) * fix drafter pastkv and usage of batch bucket commit a37f82629d7b9e3c3a0f430b8dd3ff6f38ddf1d4 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon Mar 11 09:51:42 2024 +0800 [Inference/SpecDec] Add Speculative Decoding Implementation (#5423) * fix flash decoding mask during verification * add spec-dec * add test for spec-dec * revise drafter init * remove drafter sampling * retire past kv in drafter * (trivial) rename attrs * (trivial) rename arg * revise how we enable/disable spec-dec commit 5a9b05f7b297bc9ce3479990aeee94891c7f5edf Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed Feb 28 13:48:17 2024 +0800 [Inference/SpecDec] Add Basic Drafter Model Container (#5405) * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * add drafter model container (basic ver) commit d63c469f45bc20115aaf5ba01e62dc67ab47953f Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed Feb 28 13:47:00 2024 +0800 [Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401) * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * resolve conflicts for revising flash-attn * adapt kv cache copy kernel for spec-dec * fix seqlen-n kvcache copy kernel/tests * test kvcache copy - use torch.equal * add assertions * (trivial) comment out commit d56c96334e8a0626696609c3803ba5c73798f073 Merge: 7ebdf48a 7ca1d1c5 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Apr 9 10:09:34 2024 +0800 Sync main to feature/colossal-infer [Sync] Merge feature/colossal-infer with main commit 7ca1d1c5453de3e726bca6334c360045050f94c4 Author: Yuanheng <jonathan.zhaoyh@gmail.com> Date: Mon Apr 8 17:00:55 2024 +0800 remove outdated triton test commit d78817539ea03b7b4bc79e0ef50db33d3e347f24 Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon Apr 8 08:41:07 2024 +0000 [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci commit ce9401ad52b870012846abcde120f1e87d5da7fe Author: Yuanheng <jonathan.zhaoyh@gmail.com> Date: Mon Apr 8 16:25:12 2024 +0800 remove unused triton kernels commit ed5ebd1735db4541709eebdd37839ad161f542e8 Merge: 7ebdf48a 641b1ee7 Author: Yuanheng <jonathan.zhaoyh@gmail.com> Date: Mon Apr 8 16:21:47 2024 +0800 [Fix] resolve conflicts of merging main commit 7ebdf48ac50ca7bab827ef611551c6c48113b684 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Mon Apr 8 11:38:05 2024 +0800 add cast and op_functor for cuda build-in types (#5546) commit 4bb5d8923a6e85a0f89a483f15933698635a9f9c Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Apr 2 14:16:59 2024 +0800 [Fix/Inference] Remove unused and non-functional functions (#5543) * [fix] remove unused func * rm non-functional partial commit a2878e39f42f509f237f3d3fd0741f53e3feff0e Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Mon Apr 1 15:34:25 2024 +0800 [Inference] Add Reduce Utils (#5537) * add reduce utils * add using to delele namespace prefix commit 04aca9e55bd91ea4dd8d1231aa66df7848b08f03 Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Apr 1 13:47:14 2024 +0800 [Inference/Kernel]Add get_cos_and_sin Kernel (#5528) * Add get_cos_and_sin kernel * fix code comments * fix code typos * merge common codes of get_cos_and_sin kernel. * Fixed a typo * Changed 'asset allclose' to 'assert equal'. commit 934e31afb22d2a281464aebde074eb2f238fb812 Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Mar 28 10:42:51 2024 +0800 The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519) commit e6496dd37144202c8602dfdd66bb83f297eb5805 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue Mar 26 16:37:14 2024 +0800 [Inference] Optimize request handler of llama (#5512) * optimize request_handler * fix ways of writing commit 6251d68dc9f92c333a8f07ddf94e80ff7462726e Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Date: Mon Mar 25 15:24:17 2024 +0800 [fix] PR #5354 (#5501) * [fix] * [fix] * Update config.py docstring * [fix] docstring align * [fix] docstring align * [fix] docstring align commit 1d626233ce8dbf35405cb7d92a5638ee1d830e8f Merge: 87079cff 68e9396b Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Date: Mon Mar 25 14:55:59 2024 +0800 Merge pull request #5434 from LRY89757/colossal-infer-cuda-graph [feat] cuda graph support and refactor non-functional api commit 68e9396bc084f03fe9315e9fed93292c0efc7a48 Merge: ff4998c6 87079cff Author: Runyu Lu <runyulu@umich.edu> Date: Mon Mar 25 14:48:28 2024 +0800 [fix] merge conflicts commit 87079cffe8e006d4949aa7ca7cb60e6b813ff701 Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Mar 25 13:40:34 2024 +0800 [Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461) * Support FP16/BF16 Flash Attention 2 * fix bugs in test_kv_cache_memcpy.py * add context_kv_cache_memcpy_kernel.cu * rm typename MT * add tail process * add high_precision * add high_precision to config.py * rm unused code * change the comment for the high_precision parameter * update test_rotary_embdding_unpad.py * fix vector_copy_utils.h * add comment for self.high_precision when using float32 commit ff4998c6f39cbfd6d3d11f038c55cca3c9d3abd0 Author: Runyu Lu <runyulu@umich.edu> Date: Mon Mar 25 12:00:57 2024 +0800 [fix] remove unused comment commit 9fe61b44753083c89a50540daa1e9a3daedeb335 Author: Runyu Lu <runyulu@umich.edu> Date: Mon Mar 25 11:37:58 2024 +0800 [fix] commit 5b017d6324c9881e02a5440e0b1a3156612a8044 Author: Runyu Lu <runyulu@umich.edu> Date: Thu Mar 21 15:55:25 2024 +0800 [fix] commit 606603bb8805c39f6ee01029337ddc614c8d46ef Merge: 4eafe0c8 7ff42cc0 Author: Runyu Lu <runyulu@umich.edu> Date: Thu Mar 21 14:25:22 2024 +0800 Merge branch 'feature/colossal-infer' of https://github.com/hpcaitech/ColossalAI into colossal-infer-cuda-graph commit 4eafe0c8141c120229be3ddce9c5591c1535348a Author: Runyu Lu <runyulu@umich.edu> Date: Thu Mar 21 11:28:42 2024 +0800 [fix] unused option commit 7ff42cc06d007ae78fe091da65cb89c4bb62bc38 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue Mar 19 18:36:40 2024 +0800 add vec_type_trait implementation (#5473) commit b96557b5e15dbb521bf0f77b6b1f24dcbd9464d6 Merge: b6e97858 48c4f29b Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue Mar 19 13:53:26 2024 +0800 Merge pull request #5469 from Courtesy-Xs/add_vec_traits Refactor vector utils commit aabc9fb6aada9e7feb2ff8cf1f34e6ac37ade2e7 Author: Runyu Lu <runyulu@umich.edu> Date: Tue Mar 19 13:24:25 2024 +0800 [feat] add use_cuda_kernel option commit 48c4f29b275e2d8105842913cd84f5d66c378b36 Author: xs_courtesy <xs1580802568@gmail.com> Date: Tue Mar 19 11:32:01 2024 +0800 refactor vector utils commit b6e97858856ee8637216c51f14ac544b1bc0f872 Merge: f366a5ea 5724b9e3 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Fri Mar 15 11:23:44 2024 +0800 Merge pull request #5457 from Courtesy-Xs/ly_add_implementation_for_launch_config add implementatino for GetGPULaunchConfig1D commit 5724b9e31e13e07d8ade0444c3e2f3e6894d13b1 Author: xs_courtesy <xs1580802568@gmail.com> Date: Fri Mar 15 11:18:57 2024 +0800 add some comments commit 6e30248683c0e4ccc63d15f39f8149875cba1263 Author: Runyu Lu <runyulu@umich.edu> Date: Thu Mar 14 16:13:00 2024 +0800 [fix] tmp for test commit 388e0439301834a1ad0d11da26b23f4cdc6c82d7 Author: xs_courtesy <xs1580802568@gmail.com> Date: Thu Mar 14 11:13:40 2024 +0800 add implementatino for GetGPULaunchConfig1D commit d02e257abd778812d64491dde893c0d691ed4328 Merge: ae24b4f0 f366a5ea Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Date: Thu Mar 14 10:37:05 2024 +0800 Merge branch 'feature/colossal-infer' into colossal-infer-cuda-graph commit ae24b4f025285949253a21c41bee4b80679a0bfe Author: Runyu Lu <runyulu@umich.edu> Date: Thu Mar 14 10:35:08 2024 +0800 diverse tests commit 1821a6dab0ad6ad24ae25216e56268c4b0c0d365 Author: Runyu Lu <runyulu@umich.edu> Date: Wed Mar 13 17:28:32 2024 +0800 [fix] pytest and fix dyn grid bug commit f366a5ea1f2626a7870acaf8866f21d5fb49c388 Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Mar 13 17:20:03 2024 +0800 [Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418) * add rotary embedding kernel * add rotary_embedding_kernel * add fused rotary_emb and kvcache memcopy * add fused_rotary_emb_and_cache_kernel.cu * add fused_rotary_emb_and_memcopy * fix bugs in fused_rotary_emb_and_cache_kernel.cu * fix ci bugs * use vec memcopy and opt the gloabl memory access * fix code style * fix test_rotary_embdding_unpad.py * codes revised based on the review comments * fix bugs about include path * rm inline commit ed431de4e4f73584e6b9c11ab041ef54a8e83de6 Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Wed Mar 13 16:00:55 2024 +0800 fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test (#5454) commit 6fd355a5a6bb46bfee41d2bc75578e8fba001144 Merge: b699f540 c1c45e9d Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Wed Mar 13 11:26:41 2024 +0800 Merge pull request #5452 from Courtesy-Xs/fix_include_path fix include path commit c1c45e9d8ecb6743e88e63dd151c617c0014e7c1 Author: xs_courtesy <xs1580802568@gmail.com> Date: Wed Mar 13 11:21:06 2024 +0800 fix include path commit b699f54007c52b2f4ec56326a495b06858cf8856 Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Tue Mar 12 17:48:02 2024 +0800 optimize rmsnorm: add vectorized elementwise op, feat loop unrolling (#5441) commit 368a2aa5433d127adaa3674c6d00bb9dc3e0729c Merge: 21e1e364 095c070a Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue Mar 12 14:14:37 2024 +0800 Merge pull request #5445 from Courtesy-Xs/refactor_infer_compilation Refactor colossal-infer code arch commit 095c070a6eefe1a76fe3483b21986826114d6d17 Author: xs_courtesy <xs1580802568@gmail.com> Date: Mon Mar 11 17:06:57 2024 +0800 refactor code commit 21e1e3645c8f2e0d4e556f3e13d0d2aa5053911b Merge: f7aecc0c 5eb5ff14 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Mon Mar 11 11:15:29 2024 +0800 Merge pull request #5435 from Courtesy-Xs/add_gpu_launch_config Add query and other components commit 633e95b301336c4c237537f584882b3d8e5f4145 Author: Runyu Lu <runyulu@umich.edu> Date: Mon Mar 11 10:56:51 2024 +0800 [doc] add doc commit 9dec66fad6c2f85166903aa80d0c077e37512fce Author: Runyu Lu <runyulu@umich.edu> Date: Mon Mar 11 10:51:16 2024 +0800 [fix] multi graphs capture error commit b2c0d9ff2b4e4015660f2967837688cf7293b21e Author: Runyu Lu <runyulu@umich.edu> Date: Mon Mar 11 10:49:31 2024 +0800 [fix] multi graphs capture error commit f7aecc0c6bac001d10c1dd00274e0152e4c86df6 Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Fri Mar 8 16:21:12 2024 +0800 feat rmsnorm cuda kernel and add unittest, benchmark script (#5417) commit 5eb5ff1464311ac16c29307d03a3c076aced7e03 Author: xs_courtesy <xs1580802568@gmail.com> Date: Fri Mar 8 15:41:14 2024 +0800 refactor code commit 01d289d8e51384131d536b1c223c473aeea463e9 Merge: a46598ac 2b28b54a Author: xs_courtesy <xs1580802568@gmail.com> Date: Fri Mar 8 15:04:55 2024 +0800 Merge branch 'feature/colossal-infer' of https://github.com/hpcaitech/ColossalAI into add_gpu_launch_config commit a46598ac5984c7dc5804d0cf8621698f1a6a8720 Author: xs_courtesy <xs1580802568@gmail.com> Date: Fri Mar 8 14:53:29 2024 +0800 add reusable utils for cuda commit 2b28b54ac6d19d33079d9117b9717fd2779f2b08 Merge: 593a72e4 95c21498 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Fri Mar 8 14:44:37 2024 +0800 Merge pull request #5433 from Courtesy-Xs/add_silu_and_mul 【Inference】Add silu_and_mul for infer commit cefaeb5fdd551c8b95837a475cb810f4991cf674 Author: Runyu Lu <runyulu@umich.edu> Date: Fri Mar 8 14:19:35 2024 +0800 [feat] cuda graph support and refactor non-functional api commit 95c21498d4f6e640e218f4b00349020f4ae7c69a Author: xs_courtesy <xs1580802568@gmail.com> Date: Thu Mar 7 16:57:49 2024 +0800 add silu_and_mul for infer commit 593a72e4d58b8c3feebde2d19c78d44f702f7b06 Merge: 0aa27f19 0310b76e Author: Frank Lee <somerlee.9@gmail.com> Date: Mon Mar 4 10:13:59 2024 +0800 Merge pull request #5424 from FrankLeeeee/sync/main Sync/main commit 0310b76e9d485703d5afc128b8d97d01b00f3317 Merge: 0aa27f19 4b8312c0 Author: FrankLeeeee <somerlee.9@gmail.com> Date: Mon Mar 4 10:09:36 2024 +0800 Merge branch 'main' into sync/main commit 0aa27f196109bfb4ce6171d7ce921052b9eee969 Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Feb 28 16:46:03 2024 +0800 [Inference]Move benchmark-related code to the example directory. (#5408) * move benchmark-related code to the example directory. * fix bugs in test_fused_rotary_embedding.py commit 600881a8ea9b17c436ded922a9d4e3d5969acd87 Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Feb 28 14:36:50 2024 +0800 [Inference]Add CUDA KVCache Kernel (#5406) * add cuda KVCache kernel * annotation benchmark_kvcache_copy * add use cuda * fix import path * move benchmark scripts to example/ * rm benchmark codes in test_kv_cache_memcpy.py * rm redundancy codes * rm redundancy codes * pr was modified according to the review commit 19061188c396d851ef17bc34b526e2f2b4fc1479 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon Feb 26 16:17:47 2024 +0800 [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest commit bc1da87366d81e144f1f133801d5f20520433c52 Author: yuehuayingxueluo <867460659@qq.com> Date: Fri Feb 23 10:51:35 2024 +0800 [Fix/Inference] Fix format of input prompts and input model in inference engine (#5395) * Fix bugs in inference_engine * fix bugs in engine.py * rm CUDA_VISIBLE_DEVICES * add request_ids in generate * fix bug in engine.py * add logger.debug for BatchBucket commit 2a718c8be89918ec70b88f1f059148a7294dbccb Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Feb 21 13:23:57 2024 +0800 Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390) * opt_view_and_memcopy * fix bugs in ci * fix ci bugs * update benchmark scripts * fix ci bugs commit 730103819dc0636c85af1af80cc17914dcf196c1 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Wed Feb 21 11:31:48 2024 +0800 [Inference]Fused kv copy into rotary calculation (#5383) * revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix * fused kv copy * fused copy * colossalai/kernel/triton/no_pad_rotary_embedding.py * del padding llama * del commit b21aac5baeddf7ea19615fae454e6f78f7469cd2 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon Feb 19 17:18:20 2024 +0800 [Inference] Optimize and Refactor Inference Batching/Scheduling (#5367) * add kvcache manager funcs for batching * add batch bucket for batching * revise RunningList struct in handler * add kvcache/batch funcs for compatibility * use new batching methods * fix indexing bugs * revise abort logic * use cpu seq lengths/block tables * rm unused attr in Sequence * fix type conversion/default arg * add and revise pytests * revise pytests, rm unused tests * rm unused statements * fix pop finished indexing issue * fix: use index in batch when retrieving inputs/update seqs * use dict instead of odict in batch struct * arg type hinting * fix make compress * refine comments * fix: pop_n_seqs to pop the first n seqs * add check in request handler * remove redundant conversion * fix test for request handler * fix pop method in batch bucket * fix prefill adding commit 8c69debdc7128e1b8839f12aa3f19ad327569017 Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Feb 8 15:27:26 2024 +0800 [Inference]Support vllm testing in benchmark scripts (#5379) * add vllm benchmark scripts * fix code style * update run_benchmark.sh * fix code style commit 9afa52061f89dde87a73e36f740f62781d658a01 Author: Frank Lee <somerlee.9@gmail.com> Date: Thu Feb 8 14:04:14 2024 +0800 [inference] refactored config (#5376) commit 1f8c7e70469191610d9536029f624b4f30db8caf Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Wed Feb 7 17:55:48 2024 +0800 [Inference] User Experience: update the logic of default tokenizer and generation config. (#5337) * add * fix * fix * pause * fix * fix pytest * align * fix * license * fix * fix * fix readme * fix some bugs * remove tokenizer config commit 6fb4bcbb2420b9f977ab74de60c6d311b6c9ed9a Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Feb 7 17:15:42 2024 +0800 [Inference/opt] Fused KVCahce Memcopy (#5374) * fused kv memcopy * add TODO in test_kvcache_copy.py commit 58740b5f6872bc5a26dbf7c3112b86a1b66c083a Author: Frank Lee <somerlee.9@gmail.com> Date: Wed Feb 7 17:11:43 2024 +0800 [inference] added inference template (#5375) commit 8106ede07fae7e239203feb815162efdf46975ec Author: Frank Lee <somerlee.9@gmail.com> Date: Wed Feb 7 14:27:04 2024 +0800 Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373) This reverts commit 9f4ab2eb924b938348df2c713bb4580972f18eb1. commit 9f4ab2eb924b938348df2c713bb4580972f18eb1 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Wed Feb 7 11:36:04 2024 +0800 [Inference] Adapt to Fused rotary (#5348) * revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix commit 35382a7fbf96c731ba1ed76cf5529ea3220a5b66 Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Feb 6 19:38:25 2024 +0800 [Inference]Fused the gate and up proj in mlp，and optimized the autograd process. (#5365) * fused the gate and up proj in mlp * fix code styles * opt auto_grad * rollback test_inference_engine.py * modifications based on the review feedback. * fix bugs in flash attn * Change reshape to view * fix test_rmsnorm_triton.py commit 1dedb57747270f32be5d0e67abc1ad2fff658f8f Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Feb 6 17:27:45 2024 +0800 [Fix/Infer] Remove unused deps and revise requirements (#5341) * remove flash-attn dep * rm padding llama * revise infer requirements * move requirements out of module commit 631862f3390f874db118a25c0137f86630e9b167 Author: yuehuayingxueluo <867460659@qq.com> Date: Fri Feb 2 15:38:21 2024 +0800 [Inference]Optimize generation process of inference engine (#5356) * opt inference engine * fix run_benchmark.sh * fix generate in engine.py * rollback tesh_inference_engine.py commit 21ad4a27f91659220bec6c4d4f2d0f62f7093a45 Author: yuehuayingxueluo <867460659@qq.com> Date: Fri Feb 2 15:06:01 2024 +0800 [Inference/opt]Optimize the mid tensor of RMS Norm (#5350) * opt rms_norm * fix bugs in rms_layernorm commit 027aa1043f1c7b3668d5ca9b91d35c846736e9c4 Author: Frank Lee <somerlee.9@gmail.com> Date: Fri Feb 2 14:31:10 2024 +0800 [doc] updated inference readme (#5343) commit e76acbb076582e0aade1ee8a5fa7696d95c1bef5 Author: Frank Lee <somerlee.9@gmail.com> Date: Fri Feb 2 13:51:22 2024 +0800 [inference] moved ops tests to test_infer (#5354) commit db1a763307a54ca262751ebebd5f1c503d9bca74 Author: Frank Lee <somerlee.9@gmail.com> Date: Fri Feb 2 11:44:15 2024 +0800 [inference] removed redundancy init_batch (#5353) commit 249644c23b0402ccf9d0908f13ed15b41b95145f Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Feb 1 15:49:39 2024 +0800 [Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation，add fused_qkv and fused linear_add (#5340) * add fused qkv * replace attn and mlp by shardformer * fix bugs in mlp * add docstrings * fix test_inference_engine.py * add optimize unbind * add fused_addmm * rm squeeze(1) * refactor codes * fix ci bugs * rename ShardFormerLlamaMLP and ShardFormerLlamaAttention * Removed the dependency on LlamaFlashAttention2 * rollback test_inference_engine.py commit f8e456d20295af52665ca06a21f9fd8b468204d7 Author: Frank Lee <somerlee.9@gmail.com> Date: Thu Feb 1 15:31:01 2024 +0800 [inference] simplified config verification (#5346) * [inference] simplified config verification * polish * polish commit df0aa49585d2dd19d7397dfbd3b5f136abac609b Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Wed Jan 31 16:31:29 2024 +0800 [Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336) * revise rotary embedding * remove useless print * adapt commit 1336838a9149fb210a956b0ad338197c4ae77821 Merge: 5f98a9d6 c5655199 Author: Frank Lee <somerlee.9@gmail.com> Date: Wed Jan 31 16:29:26 2024 +0800 Merge pull request #5339 from FrankLeeeee/sync/merge-main Sync/merge main commit c56551991379a457fc34df699710ab94132779fc Merge: 5f98a9d6 71321a07 Author: FrankLeeeee <somerlee.9@gmail.com> Date: Wed Jan 31 10:41:47 2024 +0800 merge commit commit 5f98a9d68a0a35031e1c740c19e33b32f4fa8d9c Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Jan 30 16:06:09 2024 +0800 [Infer] Optimize Blocked KVCache And Kernels Using It (#5325) * revise shape of kvcache (context attn kernel) * revise shape of kvcache (flash decoding kernel) * revise shape of kvcache (kvcache copy) and attn func * init of kvcache in kvcache manager * revise llama modeling * revise block size retrieval * use torch for rms_norm benchmarking * revise block size retrieval commit e8f0642f2841f6aeb6ed0e6695ff9d9ef14f198b Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 30 10:31:46 2024 +0800 [Inference]Add Nopadding Llama Modeling (#5327) * add nopadding llama modeling * add nopadding_llama.py * rm unused codes * fix bugs in test_xine_copy.py * fix code style commit c7c104cb7ccc353faa10667853ed210e042f1be8 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Mon Jan 29 16:21:06 2024 +0800 [DOC] Update inference readme (#5280) * add readme * add readme * 1 * update engine * finish readme * add readme commit 1f8a75d470d548bfd4db877e73102b8fad5cdfa9 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Mon Jan 29 10:22:33 2024 +0800 [Inference] Update rms norm kernel, benchmark with vLLM (#5315) * add * xi * del * del * fix commit 7ddd8b37f0f1160e28a2919a2e37f8e8ad199773 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Fri Jan 26 15:02:12 2024 +0800 fix (#5311) commit 4f28cb43c0c2afbc970b9f0f300e7aa28e39bd2e Author: yuehuayingxueluo <867460659@qq.com> Date: Fri Jan 26 14:00:10 2024 +0800 [inference]Optimize the usage of the mid tensors space in flash attn (#5304) * opt flash attn * opt tmp tensor * fix benchmark_llama * fix code style * fix None logic for output tensor * fix adapted to get_xine_cache * add comment * fix ci bugs * fix some codes * rm duplicated codes * rm duplicated codes * fix code style * add _get_dtype in config.py commit af8359c430ce3fabb22748870b67b0c6c33f610c Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Thu Jan 25 10:23:12 2024 +0800 [hotfix] fix boundary check in batch (#5306) commit c647e00e3c092d3d6219f7686f260f2932a0c27d Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Wed Jan 24 16:20:42 2024 +0800 [Inference]Add fused rotary kernel and get cos cache kernel (#5302) * add fused rotary and get cos cache func * staged * fix bugs * fix bugs commit 3da9993b0d03923755c1fcd6279cc4c7b8d00d1e Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Jan 23 17:16:02 2024 +0800 [Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301) * fix decoding kernel pytest * revise and add triton context attn benchmark commit 8e606ecc7e89ffed80537e89a27bb1eb6759f4bc Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Tue Jan 23 12:11:53 2024 +0800 [Inference] Benchmarking rotary embedding and add a fetch function (#5277) * fix bugs and add a cos/sin cache fetch func * add docstring * fix bug * fix commit b7853196a0a46558d7c0cac7deac9a36c7a5ba38 Merge: bfff9254 cea9c86e Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Jan 22 17:07:14 2024 +0800 Merge pull request #5297 from yuehuayingxueluo/fix_rotary_embedding [Inference/fix]Add utils.py for Rotary Embedding commit cea9c86e453e36b4848064312c9a4f0d2de6ea98 Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Jan 22 16:06:27 2024 +0800 add utils.py commit bfff9254ac8ca866673746ec47cfd2f87aab2b66 Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Jan 22 10:55:34 2024 +0800 [inference] Adapted to Rotary Embedding and RMS Norm (#5283) * adapted to rotary_embedding * adapted to nopad rms norm * fix bugs in benchmark * fix flash_decoding.py commit 6e487e7d3cf5295ca908fa69c8e03af8980391bf Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Fri Jan 19 15:47:16 2024 +0800 [kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274) * prevent re-creating intermediate tensors * add singleton class holding intermediate values * fix triton kernel api * add benchmark in pytest * fix kernel api and add benchmark * revise flash decoding triton kernel in/out shapes * fix calling of triton kernel in modeling * fix pytest: extract to util functions commit 9e2342bde2c0ffe1a8cdd2fe8917254ef0a06e7f Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Thu Jan 18 16:31:14 2024 +0800 [Hotfix] Fix bugs in testing continuous batching (#5270) * fix bug * fix bugs * fix bugs * fix bugs and add padding * add funcs and fix bugs * fix typos * fix bugs * add func commit 5ae9099f9203a4f8350f383b838e8f2ad15d6fdd Author: Yaozheng Fang <62918515+nkfyz@users.noreply.github.com> Date: Thu Jan 18 10:21:03 2024 +0800 [kernel] Add RMSLayerNorm triton kernel (#5262) * add layerrmsnorm triton kernel * add layerrmsnorm kernel * modify the atol and rtol in test file * Remove the logics of mean computations, and update the name of ther kernel functions and files * add benchmark of rms norm commit 86b63f720cf60deefe40874517b3d8e1dccb7af3 Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Jan 17 16:03:10 2024 +0800 [Inference]Adapted to the triton attn kernels (#5264) * adapted to the triton attn kernels * fix pad input * adapted to copy_kv_to_blocked_cache * fix ci test * update kv memcpy * remove print commit 0f2b46a41c2c308cc6fbeaf0e86d0e0b93435b77 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Jan 16 14:41:02 2024 +0800 [kernel] Revise KVCache copy triton kernel API (#5273) * [kernel/fix] revise kvcache copy kernel api * fix benchmark commit d8db500efc0e67dea995c2124d20aadd07afb6f0 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Mon Jan 15 17:50:46 2024 +0800 [Inference] Fix request handler and add recycle logic (#5260) * fix request handler * fix comment commit c597678da475abd4ecc075c0b80996989f1bcdc0 Author: Frank Lee <somerlee.9@gmail.com> Date: Mon Jan 15 17:37:41 2024 +0800 [doc] updated inference readme (#5269) commit fa85e02b3b1b316009c4557482f998b903730ec3 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon Jan 15 17:37:20 2024 +0800 [kernel] Add KV cache copy kernel during decoding (#5261) * add kv copy triton kernel during decoding stage * add pytest and fix kernel * fix test utilities * revise kernel config * add benchmark for kvcache copy commit 1ded7e81ef08d574798dd98d1f4d33da07b7f4c9 Author: FrankLeeeee <somerlee.9@gmail.com> Date: Thu Jan 11 13:50:45 2024 +0000 [git] fixed rebased files commit 1513f20f4d80f782fab381996368ff2c2f3c95c3 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Thu Jan 11 18:06:39 2024 +0800 [kernel] Add flash decoding triton kernel for blocked kv cache (#5249) * add flash decoding unpad triton kernel * rename flash decoding kernel * add kernel testing (draft) * revise pytest * support kv group (GQA) * (trivial) fix api and pytest * (trivial) func renaming * (trivial) func/file renaming * refactor pytest for attention * (trivial) format and consistent vars of context/decode attn * (trivial) remove test redundancy commit fded91d049997ed87dee965fc42c35a239e3ec03 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Thu Jan 11 16:24:54 2024 +0800 [Inference] Kernel: no pad rotary embedding (#5252) * fix bugs * comment * use more accurate atol * fix commit d40eb26029e8c61fc2b8ef3a1b8126a229e48047 Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Jan 10 10:38:53 2024 +0800 fix bugs in request_handler.py and engine.py commit 10e3c9f923caf4fb68ab61e96c244bd5cca9b9da Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 9 15:53:04 2024 +0800 rm torch.cuda.synchronize commit fab294c7f4a5db0a4e19109ac5656492ff3ca08b Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 9 15:18:28 2024 +0800 fix CI bugs commit 2a73e828eba565017d19eaf70a304e1b1eddba1f Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 9 14:29:45 2024 +0800 fix bugs related to processing padding mask commit e545a871b8a89093f5d01e3fea1fe873ef52d51a Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Mon Jan 8 15:56:00 2024 +0800 [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229) * fix accuracy * alignment in attention * fix attention * fix * fix bugs * fix bugs * fix bugs commit fa4fbdbffb6996e8aa1f65bddce5844f2bbbfdf1 Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 9 13:52:53 2024 +0800 adapted to pad_context_forward commit 47e53eaa1ca08fd55b657b53b75d13cc72f9cd05 Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Jan 8 12:35:06 2024 +0800 fix bugs in attention.py and request_handler.py commit bfd9b1b494b4414835b22cbba52005921127e4f6 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Thu Jan 4 16:39:00 2024 +0800 [Inference] Pytorch Attention func, pad&nopad input support (#5219) * add attn * add attention test * fix attn forward * fix decoding commit 3ad1f3b78b830c90079ed9f1e0b5cd26601194fa Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Jan 4 16:48:53 2024 +0800 fix beam_width commit b2eb9cd18665317ec7900364ef21a38c3edb9e3f Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Jan 4 15:09:06 2024 +0800 Fixed a typo commit bbfebfb9fc5250c1e4d3a6f008af652f7a0a9ca0 Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Jan 4 15:03:18 2024 +0800 fix bugs in sampler commit 02c1bf8b2abef137a653b86b733d66b6dfbcc022 Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Jan 3 18:50:26 2024 +0800 add context_attention_unpadded commit 07b5283b6a3899ebe84cbe8c7902d142ffbc4b9c Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed Jan 3 14:41:35 2024 +0800 [kernel] Add triton kernel for context attention (FAv2) without padding (#5192) * add context attn unpadded triton kernel * test compatibility * kv cache copy (testing) * fix k/v cache copy * fix kv cache copy and test * fix boundary of block ptrs * add support for GQA/MQA and testing * fix import statement --------- Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local> commit 4df8876fcad799ace567b2458df5feb3109ee917 Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 2 18:34:19 2024 +0800 Fixed a writing error commit 9489dc64d8e01b04c9033c3dcaee83e25afebe42 Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 2 18:30:11 2024 +0800 precision alignment commit 62968588d195126adc9b1bdb3adc02f199303ddf Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 2 13:02:20 2024 +0800 fix bugs in request_handler commit 62fd08ee4425e031f8f1c43b25bf1ba5e7e33e8d Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Dec 26 21:34:27 2023 +0800 Fixed a bug in the inference frame commit 86853a37d5243b40d4b229d163494624b8027cd0 Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Dec 25 14:07:43 2023 +0800 Add padding llama model commit 0e616462a7f9e8faaa33d1700a2020ceb03ccd34 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Mon Dec 25 12:15:15 2023 +0800 [Inference] add logit processor and request handler (#5166) * add logit processor and request handler * add * add * add * fix * add search tokens and update func * finish request handler * add running list test * fix test * fix some bug * add * add * fix bugs * fix some bugs * fix bug * fix * fix * add copy fun * del useless attn * fix request status --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> commit 8daee26989adad5ae5b152b24d3344db727986fe Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Dec 18 10:40:47 2023 +0800 [Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt commit 93aeacca342ab03732362dbb9096ab1265f4a8b3 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Tue Dec 12 17:22:41 2023 +0800 [Inference]Update inference config and fix test (#5178) * unify the config setting * fix test * fix import * fix test * fix * fix * add logger * revise log info --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> commit 3de2e622995321b042d4a8cffcd61686cda4a58e Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon Dec 11 10:56:18 2023 +0800 [Inference] Add CacheBlock and KV-Cache Manager (#5156) * [Inference] Add KVCache Manager * function refactored * add test for KVCache Manager * add attr beam width * Revise alloc func in CacheManager * Fix docs and pytests * add tp slicing for head number * optimize shapes of tensors used as physical cache * Apply using InferenceConfig on KVCacheManager * rm duplicate config file * Optimize cache allocation: use contiguous cache * Fix config in pytest (and config) commit fab9b931d9e24c6e8ada8025cf8cf12719c3d2af Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Dec 7 14:34:01 2023 +0800 [Inference]Add BatchInferState, Sequence and InferConfig (#5149) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct commit 2bb92243d4151873d75a9d6d9c2275b390e1716a Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Dec 5 15:12:57 2023 +0800 [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159) * [inference/nfc] remove outdated inference tests * remove outdated kernel tests * remove deprecated triton kernels * remove imports from deprecated kernels commit 56e75eeb063279fbc0fc84e25f267f1ca208e784 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Fri Dec 1 17:31:31 2023 +0800 [Inference] Add readme (roadmap) and fulfill request handler (#5147) * request handler * add readme --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> commit 4cf4682e70f70dea8e0510705d3383de0bf1a4a8 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Fri Dec 1 17:02:44 2023 +0800 [Inference] First PR for rebuild colossal-infer (#5143) * add engine and scheduler * add dirs --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

isky-cd added 2 commits February 8, 2024 14:38

add vllm benchmark scripts

0314c7d

fix conflict

80c9e7d

isky-cd marked this pull request as ready for review February 8, 2024 06:55

isky-cd requested a review from a team as a code owner February 8, 2024 06:55