[Inference] Clean duplicated vector utils #5715

* add engine and scheduler * add dirs --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

…ch#5147) * request handler * add readme --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

…hpcaitech#5159) * [inference/nfc] remove outdated inference tests * remove outdated kernel tests * remove deprecated triton kernels * remove imports from deprecated kernels

) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct

* [Inference] Add KVCache Manager * function refactored * add test for KVCache Manager * add attr beam width * Revise alloc func in CacheManager * Fix docs and pytests * add tp slicing for head number * optimize shapes of tensors used as physical cache * Apply using InferenceConfig on KVCacheManager * rm duplicate config file * Optimize cache allocation: use contiguous cache * Fix config in pytest (and config)

* unify the config setting * fix test * fix import * fix test * fix * fix * add logger * revise log info --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt

* add logit processor and request handler * add * add * add * fix * add search tokens and update func * finish request handler * add running list test * fix test * fix some bug * add * add * fix bugs * fix some bugs * fix bug * fix * fix * add copy fun * del useless attn * fix request status --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

…ng (hpcaitech#5192) * add context attn unpadded triton kernel * test compatibility * kv cache copy (testing) * fix k/v cache copy * fix kv cache copy and test * fix boundary of block ptrs * add support for GQA/MQA and testing * fix import statement --------- Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>

…h#5219) * add attn * add attention test * fix attn forward * fix decoding

…el (hpcaitech#5229) * fix accuracy * alignment in attention * fix attention * fix * fix bugs * fix bugs * fix bugs

* fix bugs * comment * use more accurate atol * fix

…tech#5249) * add flash decoding unpad triton kernel * rename flash decoding kernel * add kernel testing (draft) * revise pytest * support kv group (GQA) * (trivial) fix api and pytest * (trivial) func renaming * (trivial) func/file renaming * refactor pytest for attention * (trivial) format and consistent vars of context/decode attn * (trivial) remove test redundancy

* add kv copy triton kernel during decoding stage * add pytest and fix kernel * fix test utilities * revise kernel config * add benchmark for kvcache copy

* fix request handler * fix comment

* [kernel/fix] revise kvcache copy kernel api * fix benchmark

* adapted to the triton attn kernels * fix pad input * adapted to copy_kv_to_blocked_cache * fix ci test * update kv memcpy * remove print

* add layerrmsnorm triton kernel * add layerrmsnorm kernel * modify the atol and rtol in test file * Remove the logics of mean computations, and update the name of ther kernel functions and files * add benchmark of rms norm

* fix bug * fix bugs * fix bugs * fix bugs and add padding * add funcs and fix bugs * fix typos * fix bugs * add func

…rking (hpcaitech#5274) * prevent re-creating intermediate tensors * add singleton class holding intermediate values * fix triton kernel api * add benchmark in pytest * fix kernel api and add benchmark * revise flash decoding triton kernel in/out shapes * fix calling of triton kernel in modeling * fix pytest: extract to util functions

* adapted to rotary_embedding * adapted to nopad rms norm * fix bugs in benchmark * fix flash_decoding.py

…bedding [Inference/fix]Add utils.py for Rotary Embedding

…pcaitech#5277) * fix bugs and add a cos/sin cache fetch func * add docstring * fix bug * fix

…rk (hpcaitech#5301) * fix decoding kernel pytest * revise and add triton context attn benchmark

…h#5302) * add fused rotary and get cos cache func * staged * fix bugs * fix bugs

…pcaitech#5304) * opt flash attn * opt tmp tensor * fix benchmark_llama * fix code style * fix None logic for output tensor * fix adapted to get_xine_cache * add comment * fix ci bugs * fix some codes * rm duplicated codes * rm duplicated codes * fix code style * add _get_dtype in config.py

* add * xi * del * del * fix

* add readme * add readme * 1 * update engine * finish readme * add readme

* add nopadding llama modeling * add nopadding_llama.py * rm unused codes * fix bugs in test_xine_copy.py * fix code style

* revise shape of kvcache (context attn kernel) * revise shape of kvcache (flash decoding kernel) * revise shape of kvcache (kvcache copy) and attn func * init of kvcache in kvcache manager * revise llama modeling * revise block size retrieval * use torch for rms_norm benchmarking * revise block size retrieval

Sync/merge main

…pcaitech#5336) * revise rotary embedding * remove useless print * adapt

* [inference] simplified config verification * polish * polish

…timize the weight transpose operation，add fused_qkv and fused linear_add (hpcaitech#5340) * add fused qkv * replace attn and mlp by shardformer * fix bugs in mlp * add docstrings * fix test_inference_engine.py * add optimize unbind * add fused_addmm * rm squeeze(1) * refactor codes * fix ci bugs * rename ShardFormerLlamaMLP and ShardFormerLlamaAttention * Removed the dependency on LlamaFlashAttention2 * rollback test_inference_engine.py

* opt rms_norm * fix bugs in rms_layernorm

…#5356) * opt inference engine * fix run_benchmark.sh * fix generate in engine.py * rollback tesh_inference_engine.py

* remove flash-attn dep * rm padding llama * revise infer requirements * move requirements out of module

…ad process. (hpcaitech#5365) * fused the gate and up proj in mlp * fix code styles * opt auto_grad * rollback test_inference_engine.py * modifications based on the review feedback. * fix bugs in flash attn * Change reshape to view * fix test_rmsnorm_triton.py

* revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix

…h#5373) This reverts commit 9f4ab2e.

* fused kv memcopy * add TODO in test_kvcache_copy.py

…d generation config. (hpcaitech#5337) * add * fix * fix * pause * fix * fix pytest * align * fix * license * fix * fix * fix readme * fix some bugs * remove tokenizer config

* add vllm benchmark scripts * fix code style * update run_benchmark.sh * fix code style

…itech#5367) * add kvcache manager funcs for batching * add batch bucket for batching * revise RunningList struct in handler * add kvcache/batch funcs for compatibility * use new batching methods * fix indexing bugs * revise abort logic * use cpu seq lengths/block tables * rm unused attr in Sequence * fix type conversion/default arg * add and revise pytests * revise pytests, rm unused tests * rm unused statements * fix pop finished indexing issue * fix: use index in batch when retrieving inputs/update seqs * use dict instead of odict in batch struct * arg type hinting * fix make compress * refine comments * fix: pop_n_seqs to pop the first n seqs * add check in request handler * remove redundant conversion * fix test for request handler * fix pop method in batch bucket * fix prefill adding

* revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix * fused kv copy * fused copy * colossalai/kernel/triton/no_pad_rotary_embedding.py * del padding llama * del

…view and memcopy (hpcaitech#5390) * opt_view_and_memcopy * fix bugs in ci * fix ci bugs * update benchmark scripts * fix ci bugs

…nce engine (hpcaitech#5395) * Fix bugs in inference_engine * fix bugs in engine.py * rm CUDA_VISIBLE_DEVICES * add request_ids in generate * fix bug in engine.py * add logger.debug for BatchBucket

fix dependency in pytest

* add cuda KVCache kernel * annotation benchmark_kvcache_copy * add use cuda * fix import path * move benchmark scripts to example/ * rm benchmark codes in test_kv_cache_memcpy.py * rm redundancy codes * rm redundancy codes * pr was modified according to the review

…aitech#5408) * move benchmark-related code to the example directory. * fix bugs in test_fused_rotary_embedding.py

Sync/main

【Inference】Add silu_and_mul for infer

…/ColossalAI into add_gpu_launch_config

…h#5417)

Add query and other components

…pilation Refactor colossal-infer code arch

…pcaitech#5441)

fix include path

…rtial specialization is not allowed in Cpp) and luckily pass e2e precision test (hpcaitech#5454)

… Kernel (hpcaitech#5418) * add rotary embedding kernel * add rotary_embedding_kernel * add fused rotary_emb and kvcache memcopy * add fused_rotary_emb_and_cache_kernel.cu * add fused_rotary_emb_and_memcopy * fix bugs in fused_rotary_emb_and_cache_kernel.cu * fix ci bugs * use vec memcopy and opt the gloabl memory access * fix code style * fix test_rotary_embdding_unpad.py * codes revised based on the review comments * fix bugs about include path * rm inline

…ion_for_launch_config add implementatino for GetGPULaunchConfig1D

Refactor vector utils

…/ColossalAI into colossal-infer-cuda-graph

… Flag To Rotary Embedding (hpcaitech#5461) * Support FP16/BF16 Flash Attention 2 * fix bugs in test_kv_cache_memcpy.py * add context_kv_cache_memcpy_kernel.cu * rm typename MT * add tail process * add high_precision * add high_precision to config.py * rm unused code * change the comment for the high_precision parameter * update test_rotary_embdding_unpad.py * fix vector_copy_utils.h * add comment for self.high_precision when using float32

…raph [feat] cuda graph support and refactor non-functional api

* [fix] * [fix] * Update config.py docstring * [fix] docstring align * [fix] docstring align * [fix] docstring align

* optimize request_handler * fix ways of writing

…efinitions have been optimized. (hpcaitech#5519)

* Add get_cos_and_sin kernel * fix code comments * fix code typos * merge common codes of get_cos_and_sin kernel. * Fixed a typo * Changed 'asset allclose' to 'assert equal'.

* add reduce utils * add using to delele namespace prefix

…#5543) * [fix] remove unused func * rm non-functional partial

for more information, see https://pre-commit.ci

[Sync] Merge feature/colossal-infer with main

* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (hpcaitech#5399) fix dependency in pytest * resolve conflicts for revising flash-attn * adapt kv cache copy kernel for spec-dec * fix seqlen-n kvcache copy kernel/tests * test kvcache copy - use torch.equal * add assertions * (trivial) comment out

* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (hpcaitech#5399) fix dependency in pytest * add drafter model container (basic ver)

…h#5423) * fix flash decoding mask during verification * add spec-dec * add test for spec-dec * revise drafter init * remove drafter sampling * retire past kv in drafter * (trivial) rename attrs * (trivial) rename arg * revise how we enable/disable spec-dec

…aitech#5449) * fix drafter pastkv and usage of batch bucket

* add glide-llama policy and modeling * update glide modeling, compitable with transformers 4.36.2 * revise glide llama modeling/usage * fix issues of glimpsing large kv * revise the way re-loading params for glide drafter * fix drafter and engine tests * enable convert to glide strict=False * revise glide llama modeling * revise vicuna prompt template * revise drafter and tests * apply usage of glide model in engine

* add README for spec-dec * update roadmap

…ech#5557) - resolve conflicts of rebasing feat/speculative-decoding

- fix ref before asgmt - fall back to use triton kernels when using spec-dec

…/feat/speculative-decoding Add Speculative Decoding and GLIDE Spec-Dec

…ls and reduce utils (hpcaitech#5593) * delete duplicated code and refactor vec_copy utils and reduce utils * delete unused header file

* Adapted to the baichuan2-7B model * modified according to the review comments. * Modified the method of obtaining random weights. * modified according to the review comments. * change mlp layewr 'NOTE'

…he same thread block (hpcaitech#5531) * feat flash decoding for paged attention * refactor flashdecodingattention * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* tensor parallel support naive source * [fix]precision, model load and refactor the framework * add tp unit test * docstring * fix do_sample

* [fix] GQA calling of flash decoding triton * fix kv cache alloc shape * fix rotary triton - GQA * fix sequence max length assigning * Sequence max length logic * fix scheduling and spec-dec * skip without import error * fix pytest - skip without ImportError --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix rotary embedding GQA * change test_rotary_embdding_unpad.py KH

* [example] add infernece benchmark llama3 * revise inference config - arg * remove unused args * add llama generation demo script * fix init rope in llama policy * add benchmark-llama3 - cleanup

… hw (hpcaitech#5613) * refactor compilation mechanism and unified multi hw * fix file path bug * add init.py to make pybind a module to avoid relative path error caused by softlink * delete duplicated micros * fix micros bug in gcc

* Fix bugs about OOM when running vllm-0.4.0 * rm used params * change generation_config * change benchmark log file name

hpcaitech#5643) * optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x]) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* adapt to baichuan2 13B * adapt to baichuan2 13B * change BAICHUAN_MODEL_NAME_OR_PATH * fix test_decoding_attn.py * Modifications based on review comments. * change BAICHUAN_MODEL_NAME_OR_PATH * mv attn mask processes to test flash decoding * mv get_alibi_slopes baichuan modeling * fix bugs in test_baichuan.py

…pcaitech#5658) * add context attn triton kernel - new kcache layout * add benchmark triton * tiny revise * trivial - code style, comment

…pcaitech#5656)

* adapt to baichuan2 13B * add baichuan2 13B TP * update baichuan tp logic * rm unused code * Fix TP logic * fix alibi slopes tp logic * rm nn.Module * Polished the code. * change BAICHUAN_MODEL_NAME_OR_PATH * Modified the logic for loading Baichuan weights. * fix typos

…kvcache_memcpy oper… (hpcaitech#5663) * refactor kvcache manager and rotary_embedding and kvcache_memcpy operator * refactor decode_kv_cache_memcpy * enable alibi in pagedattention

…_cache_copy (hpcaitech#5680)

* add alibi to flash attn function * rm redundant modifications

* kvmemcpy triton for new kcache layout * revise tests for new kcache layout * naive triton flash decoding - new kcache layout * rotary triton kernel - new kcache layout * remove redundancy - triton decoding * remove redundancy - triton kvcache copy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…pcaitech#5679)

…pcaitech#5685) [Sync] Update from main to feature/colossal-infer - Merge pull request hpcaitech#5685 from yuanheng-zhao/inference/merge/main

…hpcaitech#5686)

…5695) - Fix key value number assignment in KVCacheManager, as well as method of accessing

* clean requirements * modify example inference struct * add test ci scripts * mark test_infer as submodule * rm deprecated cls & deps * import of HAS_FLASH_ATTN * prune inference tests to be run * prune triton kernel tests * increment pytest timeout mins * revert import path in openmoe

* Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg

…h#5693) * Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg * support ignore EOS token * change variable's name * fix annotation

* add api server * fix * add * add completion service and fix bug * add generation config * revise shardformer * fix bugs * add docstrings and fix some bugs * fix bugs and add choices for prompt template

…tinuous batching test and example (hpcaitech#5432) * finish online test and add examples * fix test_contionus_batching * fix some bugs * fix bash * fix * fix inference * finish revision * fix typos * revision

…caitech#5470) * fix bugs * fix bugs * fix api server * fix api server * add chat api and test * del request.n

fix

* fix test bugs * add do sample test * del useless lines * fix comments * fix tests * delete version tag * delete version tag * add * del test sever * fix test * fix * Revert "add" This reverts commit b9305fb.

[Feature]Online Serving

* add quant kvcache interface * delete unused output * complete args comments

…tech#5706) * add convert_fp8 op for fp8 test in the future * rerun ci

…ch#5708) * Adapt repetition_penalty and no_repeat_ngram_size * fix no_repeat_ngram_size_logit_process * remove batch_updated * fix annotation * modified codes based on the review feedback. * rm get_batch_token_ids

* rpc support source * kv cache logical/physical disaggregation * sampler refactor * colossalai launch built in * Unitest * Rpyc support --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Commits on Mar 7, 2024

add silu_and_mul for infer

Courtesy-Xs committed Mar 7, 2024

Configuration menu

View commit details

Copy full SHA for 95c2149

Browse repository at this point

Copy the full SHA

95c2149 View commit details

Browse the repository at this point in the history

Commits on Apr 11, 2024

refactor csrc (hpcaitech#5582 )

Courtesy-Xs committed Apr 11, 2024

Configuration menu

View commit details

Copy full SHA for a219123

Browse repository at this point

Copy the full SHA

a219123 View commit details

Browse the repository at this point in the history

[Inference] Clean duplicated vector utils #5715

Are you sure you want to change the base?

[Inference] Clean duplicated vector utils #5715

Commits on Jan 11, 2024

Commits on Jan 15, 2024

Commits on Jan 16, 2024

Commits on Jan 17, 2024

Commits on Jan 18, 2024

Commits on Jan 19, 2024

Commits on Jan 22, 2024

Commits on Jan 23, 2024

Commits on Jan 24, 2024

Commits on Jan 25, 2024

Commits on Jan 26, 2024

Commits on Jan 29, 2024

Commits on Jan 30, 2024

Commits on Jan 31, 2024

Commits on Feb 1, 2024

Commits on Feb 2, 2024

Commits on Feb 6, 2024

Commits on Feb 7, 2024

Commits on Feb 8, 2024

Commits on Feb 19, 2024

Commits on Feb 21, 2024

Commits on Feb 23, 2024

Commits on Feb 26, 2024

Commits on Feb 28, 2024

Commits on Mar 4, 2024

Commits on Mar 7, 2024

Commits on Mar 8, 2024

Commits on Mar 11, 2024

Commits on Mar 12, 2024

Commits on Mar 13, 2024

Commits on Mar 14, 2024

Commits on Mar 15, 2024

Commits on Mar 19, 2024

Commits on Mar 21, 2024

Commits on Mar 25, 2024

Commits on Mar 26, 2024

Commits on Mar 28, 2024

Commits on Apr 1, 2024

Commits on Apr 2, 2024

Commits on Apr 8, 2024

Commits on Apr 9, 2024

Commits on Apr 10, 2024

Commits on Apr 11, 2024

Commits on Apr 15, 2024

Commits on Apr 18, 2024

Commits on Apr 19, 2024

Commits on Apr 23, 2024

Commits on Apr 24, 2024

Commits on Apr 25, 2024

Commits on Apr 26, 2024

Commits on Apr 30, 2024

Commits on May 3, 2024

Commits on May 5, 2024

Commits on May 6, 2024

Commits on May 7, 2024

Commits on May 8, 2024

Commits on May 9, 2024

Commits on May 10, 2024

Commits on May 11, 2024

Commits on May 14, 2024