After reading the benchmark code above, I have two questions:
Why does the benchmark not perform write operations for newly generated tokens?
Why does the benchmark simulate reading the KV cache only once when generating multiple tokens in a batch?
Or am I overthinking it — is it that there is no actual model that can do this, and this implementation is just trying to keep the code as simple as possible to maximize testing of storage pressure and performance?
After reading the benchmark code above, I have two questions:
Why does the benchmark not perform write operations for newly generated tokens?
Why does the benchmark simulate reading the KV cache only once when generating multiple tokens in a batch?
Or am I overthinking it — is it that there is no actual model that can do this, and this implementation is just trying to keep the code as simple as possible to maximize testing of storage pressure and performance?