Skip to content

Commit

Permalink
[Layer] Better method to reinterpret KV cache (#397)
Browse files Browse the repository at this point in the history
* [Common] Add sequenceMeta, sequenceGroup and sequenecePool. (#343)

* merge batchSize and seqLen into one in TokenEembedding

* merge batchSize and seqLen into one in TokenEembedding (#350)

* [Common] Move Martix into xft namespace. (#351)

* remove unsed function in DecoderLayer

* [Layer] Remove unused functions in Decoder layer (#353)

* fix compile error of embeddingForward

* [Model] Fix compile error of embeddingForward in YaRNLlama (#358)

* [Common] Add sampling params into group seq. (#356)

* remove DecoderContext in computeSoftmax

* [Util] Remove DecoderContext in computeSoftmax (#362)

* [Common] Refactor sequence.h. (#363)

* [kernels] refactor flash attention for continuous batching (#361)

* [models] Add attnMeta for continuous batching (#364)

* [Layers] fix build error (#365)

* [Model] add interface for seq meta. (#366)

* refactor resize function in DecoderContext to support CB, and qkScores member removed

* [Common] Modify resize() in DecoderContext to support  (#367)

* add some code to CommonDecoder::forward()

* SequenceMeta refactor

* [Model] New CommonDecoder::forward impl. skeleton (#369)

* new KVCacheMgr supporting CB

* fix typo & set default prefixId to -1 in addSequence()

* [Common] New KVCacheMgr to support CB (#371)

* [Sampling] Add repetition penalty for new seq type. (#373)

* New foward to support CB (CommonDecoder->DecoderBlock->DecoderLayer->Attention/MLP)

* add todo

* [Sampling] Add greedy search for cb path. (#376)

* logic issue fix

* code fix to make new forward work

* add maxSeqLen limitation

* cross attention impl. for CB

* DecoderContext::resize fix

* correct the output of the new forward

* add cb_check

* fix incorrect buffer size calculation

* 2 sequences -> 3 sequences

* better method to prepare KV cache

---------

Co-authored-by: Changqing Li <changqing.li@intel.com>
Co-authored-by: Duyi-Wang <duyi.wang@intel.com>
Co-authored-by: Meng,Chen <chen.meng@intel.com>
  • Loading branch information
4 people committed May 13, 2024
1 parent 999cf3b commit 9cb8ec4
Showing 1 changed file with 4 additions and 13 deletions.
17 changes: 4 additions & 13 deletions src/layers/decoder_block.h
Original file line number Diff line number Diff line change
Expand Up @@ -91,19 +91,10 @@ class DecoderBlock {
std::vector<void *> keyCaches = kvCacheMgr.getKey(i);
std::vector<void *> valueCaches = kvCacheMgr.getValue(i);

std::vector<KVCacheTensor<KVCacheT> *> keyCachesVec(keyCaches.size());
std::vector<KVCacheTensor<KVCacheT> *> valueCachesVec(valueCaches.size());

// TODO: better method?
for (int j = 0; j < keyCaches.size(); ++j) {
keyCachesVec[j] = static_cast<KVCacheTensor<KVCacheT> *>(keyCaches[j]);
}

for (int j = 0; j < valueCaches.size(); ++j) {
valueCachesVec[j] = static_cast<KVCacheTensor<KVCacheT> *>(valueCaches[j]);
}

this->decoders[i]->forwardAttention(ctx, seqs, input, attnOut, totInSeqLen, keyCachesVec, valueCachesVec);
// Reinterpret the keyCaches and valueCaches to the correct type
this->decoders[i]->forwardAttention(ctx, seqs, input, attnOut, totInSeqLen,
*reinterpret_cast<std::vector<KVCacheTensor<KVCacheT> *> *>(&keyCaches),
*reinterpret_cast<std::vector<KVCacheTensor<KVCacheT> *> *>(&valueCaches));

// Merge the result of attention
// When attention and FFN/MLP are in parallel, do not need to reduce after attention
Expand Down

0 comments on commit 9cb8ec4

Please sign in to comment.