[Layer] Better method to reinterpret KV cache (#397)

* [Common] Add sequenceMeta, sequenceGroup and sequenecePool. (#343) * merge batchSize and seqLen into one in TokenEembedding * merge batchSize and seqLen into one in TokenEembedding (#350) * [Common] Move Martix into xft namespace. (#351) * remove unsed function in DecoderLayer * [Layer] Remove unused functions in Decoder layer (#353) * fix compile error of embeddingForward * [Model] Fix compile error of embeddingForward in YaRNLlama (#358) * [Common] Add sampling params into group seq. (#356) * remove DecoderContext in computeSoftmax * [Util] Remove DecoderContext in computeSoftmax (#362) * [Common] Refactor sequence.h. (#363) * [kernels] refactor flash attention for continuous batching (#361) * [models] Add attnMeta for continuous batching (#364) * [Layers] fix build error (#365) * [Model] add interface for seq meta. (#366) * refactor resize function in DecoderContext to support CB, and qkScores member removed * [Common] Modify resize() in DecoderContext to support (#367) * add some code to CommonDecoder::forward() * SequenceMeta refactor * [Model] New CommonDecoder::forward impl. skeleton (#369) * new KVCacheMgr supporting CB * fix typo & set default prefixId to -1 in addSequence() * [Common] New KVCacheMgr to support CB (#371) * [Sampling] Add repetition penalty for new seq type. (#373) * New foward to support CB (CommonDecoder->DecoderBlock->DecoderLayer->Attention/MLP) * add todo * [Sampling] Add greedy search for cb path. (#376) * logic issue fix * code fix to make new forward work * add maxSeqLen limitation * cross attention impl. for CB * DecoderContext::resize fix * correct the output of the new forward * add cb_check * fix incorrect buffer size calculation * 2 sequences -> 3 sequences * better method to prepare KV cache --------- Co-authored-by: Changqing Li <changqing.li@intel.com> Co-authored-by: Duyi-Wang <duyi.wang@intel.com> Co-authored-by: Meng,Chen <chen.meng@intel.com>
intel · May 13, 2024 · 9cb8ec4 · 9cb8ec4
1 parent 999cf3b
commit 9cb8ec4
Showing 1 changed file with 4 additions and 13 deletions.
diff --git a/src/layers/decoder_block.h b/src/layers/decoder_block.h
@@ -91,19 +91,10 @@ class DecoderBlock {
             std::vector<void *> keyCaches = kvCacheMgr.getKey(i);
             std::vector<void *> valueCaches = kvCacheMgr.getValue(i);
 
-            std::vector<KVCacheTensor<KVCacheT> *> keyCachesVec(keyCaches.size());
-            std::vector<KVCacheTensor<KVCacheT> *> valueCachesVec(valueCaches.size());
-
-            // TODO: better method?
-            for (int j = 0; j < keyCaches.size(); ++j) {
-                keyCachesVec[j] = static_cast<KVCacheTensor<KVCacheT> *>(keyCaches[j]);
-            }
-
-            for (int j = 0; j < valueCaches.size(); ++j) {
-                valueCachesVec[j] = static_cast<KVCacheTensor<KVCacheT> *>(valueCaches[j]);
-            }
-
-            this->decoders[i]->forwardAttention(ctx, seqs, input, attnOut, totInSeqLen, keyCachesVec, valueCachesVec);
+            // Reinterpret the keyCaches and valueCaches to the correct type
+            this->decoders[i]->forwardAttention(ctx, seqs, input, attnOut, totInSeqLen,
+                    *reinterpret_cast<std::vector<KVCacheTensor<KVCacheT> *> *>(&keyCaches),
+                    *reinterpret_cast<std::vector<KVCacheTensor<KVCacheT> *> *>(&valueCaches));
 
             // Merge the result of attention
             // When attention and FFN/MLP are in parallel, do not need to reduce after attention