# Fixed-Size KV Cache Example with LLaMA 3.2 1B Instruct

This notebook demonstrates how to use the Fixed-Size KV Cache with a LLaMA model to process long contexts efficiently.

In [1]:
import warnings
import torch
import os

from transformers import LlamaForCausalLM, AutoTokenizer, TextStreamer
from transformers.models.llama.modeling_llama import LlamaAttention
import transformers
from fixed_size_kv_cache import FixedSizeDynamicCache, CacheConfig
from fixed_size_kv_cache.llama_attention import forward
import logging

# Setup logging
logging.basicConfig(level=logging.INFO)



## Configure Fixed-Size KV Cache

Set up configuration parameters for the cache. You can configure via environment variables or directly with the config object.

In [2]:
# Configure Fixed-Size KV Cache with environment variables
# Basic configuration
os.environ["FSDC_KV_CACHE_SIZE"] = "1024"  # Maximum cache size
os.environ["FSDC_INCLUDE_SKIPPED"] = "false"  # Include summary of skipped tokens
os.environ["FSDC_FREE_MEMORY"] = "false"  # Don't free memory after truncation

# Advanced features
os.environ["FSDC_TRUNCATION_STRATEGY"] = "attention"  # Use attention or hybrid truncation (attention + recency)
os.environ["FSDC_HYBRID_SPLIT_RATIO"] = "0.9"  # 90% attention-based, 10% recency-based
os.environ["FSDC_OFFLOAD_TO_CPU"] = "true"  # Enable CPU offloading
os.environ["FSDC_OFFLOAD_SIZE"] = "4096"  # Maximum size of CPU cache
os.environ['FSDC_AUTO_RESTORE'] = 'True' # Enable auto-restore of leftover tokens
os.environ['FSDC_TOKEN_IMPORTANCE_WINDOW'] = '150' # Window size for token importance
os.environ['FSDC_IMPORTANCE_THRESHOLD'] = '0.15' # Threshold for token importance

# Performance optimizations
os.environ["FSDC_QUANTIZE"] = "true"  # Enable quantization for CPU offloading
os.environ["FSDC_QUANTIZATION_BITS"] = "8"  # Use 8-bit quantization
os.environ["FSDC_USE_MMAP"] = "true"  # Use memory mapping for CPU offloading
os.environ["FSDC_PARALLEL_PROCESSING"] = "true"  # Enable parallel processing
os.environ["FSDC_ADAPTIVE_PRECISION"] = "true"  # Use adaptive precision

# Set up LLaMA model to use FixedSizeDynamicCache
LlamaAttention.forward = forward
transformers.models.llama.modeling_llama.DynamicCache = FixedSizeDynamicCache
LlamaForCausalLM.__init__.__globals__['DynamicCache'] = FixedSizeDynamicCache

# Suppress warnings
warnings.filterwarnings("ignore")

## Alternative: Configure with Config Object

You can also configure the Fixed-Size KV Cache using a config object directly. This is useful when you want to set the configuration in code rather than through environment variables.

In [3]:
# # Alternative: Configure with config object
# config = CacheConfig(
#     kv_cache_size=1024,
#     include_skipped=True,
#     free_memory=False,
#     truncation_strategy="hybrid",
#     hybrid_split_ratio=0.7,
#     offload_to_cpu=True,
#     offload_size=4096,
#     quantize=True,
#     quantization_bits=8,
#     use_mmap=True,
#     parallel_processing=True,
#     adaptive_precision=True
# )

# # You can create a custom cache instance with this config and inject it
# # However, in this example, we rely on environment variables

## Load Model

Load the LLaMA model and tokenizer. For this example, we use LLaMA 3.2 1B Instruct.

In [4]:
# Use any LLaMA model you have access to
model_name = 'meditsolutions/Llama-3.2-SUN-HDIC-1B-Instruct'  # Replace with your model

# Load model and tokenizer
model = LlamaForCausalLM.from_pretrained(
    model_name, 
    cache_dir=os.getenv('HF_CACHE', None), 
)
tokenizer = AutoTokenizer.from_pretrained(
    model_name, 
    cache_dir=os.getenv('HF_CACHE', None)
)

# Change attention to eager (this is required as other attention implementations do not return attention weights)
model.config._attn_implementation = "eager"

## Create a long input

We'll create a long input to demonstrate the Fixed-Size KV Cache in action. This is a sample conversation with a request to summarize multiple articles.

In [5]:
test = [{"role": "user", "content": """Calculate the math formula hidden in this text:\n\n
                         
----------------------------------------------------     
Article 1:
Recent Advances and Future Directions in Attention Mechanisms for Large Language Models
The attention mechanism, a cornerstone of modern large language models (LLMs), has undergone significant innovations in recent years. While traditional self-attention mechanisms in transformers have enabled breakthroughs in natural language processing, researchers have identified limitations in computational efficiency, contextual prioritization, and structural expressiveness. This report synthesizes recent advancements, analyzes persistent gaps, and proposes novel pathways for reimagining attention in LLMs.

Recent Innovations in Attention Mechanism Design
Differential Attention for Noise Reduction
Microsoft's Differential Transformer11 introduces a partitioned attention mechanism that computes two separate softmax maps from subdivided query and key vectors. By subtracting these maps, the model cancels common-mode noise while amplifying signal components critical to context. This approach mirrors noise-canceling audio systems, demonstrating 15-20% improvements in factual consistency benchmarks compared to conventional transformers. The subtraction operation adds negligible computational overhead due to parallelization, making it viable for real-world deployment114.

Matrix Optimization Strategies for Efficient Fine-Tuning
Theoretical work by arXiv researchers1 reveals that selectively updating query (Q) and key (K) matrices during fine-tuning achieves comparable performance to full-parameter tuning while reducing memory usage by 40%. This stems from the QK system's role in determining attention score distributions, where strategic learning rate differentiation (higher rates for K matrices) accelerates convergence. Experimental validation on GLUE benchmarks shows this method matches full fine-tuning accuracy with 60% fewer training steps110.

Architectural Variants for Scalability
Recent implementations employ grouped-query attention (GQA) and sliding-window attention (SWA) to handle long-context processing4. GQA clusters similar queries using locality-sensitive hashing, reducing memory overhead from O(n²) to O(n log n) for n-token sequences. SWA processes text through overlapping 4k-token windows with positional encoding carryover, enabling 128k-token context handling with only 12% latency increase compared to standard 4k models412.

Persistent Limitations and Theoretical Constraints
Working Memory Capacity Boundaries
Empirical studies on N-back tasks reveal that transformer-based models exhibit performance degradation mirroring human cognitive limits when tracking dependencies beyond 7±2 elements6. Attention entropy analysis shows dispersion increases linearly with sequence length, suggesting fundamental capacity constraints rooted in the softmax normalization process. This manifests as 34% accuracy drop on 10-back tasks compared to 5-back scenarios across multiple architectures613.

Structural Expressiveness Deficits
Formal language analysis demonstrates transformers cannot recognize periodic finite-state languages like {a^n b^n c^n} without layer count scaling proportionally to input length13. The absence of stack-like mechanisms limits hierarchical parsing, resulting in 22% lower accuracy on recursively nested sentence structures compared to augmented transition network models139.

Computational Complexity Tradeoffs
While linear attention variants58 reduce theoretical complexity from O(n²) to O(n), practical implementations face 18-25% accuracy drops on semantic reasoning tasks due to low-rank approximation errors. The Hugging Face ecosystem currently lacks plug-and-play linear attention modules, forcing developers to choose between efficiency and performance57.

Paradigm-Shifting Alternatives to Conventional Attention
Feed-Forward Attention Substitution
Breakthrough work from ETH Zurich714 demonstrates that shallow feed-forward networks can replicate attention behavior when trained via knowledge distillation. Their Attention Layer Replacement (ALR) method achieves 98% baseline BLEU scores on IWSLT2017 translation tasks using 150M parameter replacements, though requiring 40% more neurons than original attention heads. Crucially, these "attentionless transformers" maintain sequence-length flexibility when augmented with dynamic padding masks714.

Learnable Lateral Connection Architectures
An open-source GPT variant8 replaces self-attention with trainable lateral weight matrices between input embeddings. Preliminary results show 12% faster inference speeds but 15% lower perplexity on WikiText-103, suggesting potential when combined with modern initialization techniques. The architecture enables fully parallelized training while maintaining position-awareness through injected sinusoidal weights816.
         
2+6=?

Hybrid Neuro-Symbolic Routing
Emerging approaches combine attention with symbolic rule engines for structural parsing. A prototype system routes noun phrases through probabilistic context-free grammar checkers while processing verbs via standard attention, achieving 89% parse accuracy on Penn Treebank compared to 78% for pure-transformer baselines. This hybrid model reduces attention head usage by 40% through targeted symbolic delegation915.

Strategic Recommendations for Next-Generation Architectures
Differentiated Attention Pathways
Inspired by biological vision systems, a dual-path framework could separate high-frequency token interactions (handled by optimized QK attention) from low-frequency semantic integration (managed by feed-forward networks). Early simulations show this division reduces computational load by 35% while improving long-range dependency modeling111.

Dynamic Attention Rank Adaptation
Implementing singular value decomposition (SVD) during forward passes enables real-time attention rank adjustment. By maintaining high-rank attention for critical tokens (nouns, verbs) while compressing ancillary elements (articles, prepositions), preliminary tests achieve 50% FLOP reduction with <2% accuracy loss on summarization tasks57.

Neuromodulatory Attention Gating
Drawing from neuroscience, trainable dopamine-like modulation signals could dynamically reweight attention scores based on reinforcement signals. Initial experiments using reward-modulated backpropagation demonstrate 30% faster convergence on instruction-following tasks compared to standard fine-tuning1015.

Conclusion
The evolution of attention mechanisms reveals both remarkable adaptability and fundamental constraints. While recent innovations like differential attention and matrix-optimized fine-tuning push performance boundaries, enduring challenges in computational complexity and structural expressiveness necessitate architectural paradigm shifts. The most promising paths forward involve hybrid models combining optimized attention variants with alternative processing modes—whether feed-forward, symbolic, or neuromorphic components.

Future research should prioritize dynamic architectures that automatically select attention mechanisms based on input characteristics, potentially combining:

QK-optimized core attention for semantic relationship mapping

Compressed linear attention for high-frequency token interactions

External memory banks for long-term dependency tracking

Symbolic routers for structural pattern enforcement

By moving beyond monolithic attention designs, next-generation LLMs could achieve unprecedented efficiency and cognitive fidelity while overcoming current theoretical limitations. The integration of biological inspiration with computational pragmatism will likely define the next evolutionary leap in language model architectures.
               
----------------------------------------------------
""" }]

## Tokenize and Run Inference

Now we'll tokenize the input and run the model with our Fixed-Size KV Cache.

In [6]:
# Tokenize the input
tokens = tokenizer(
    tokenizer.apply_chat_template(test, tokenize=False, add_generation_prompt=True),
    return_tensors='pt'
)

# Print the input length
print(f"Input length: {len(tokens['input_ids'][0])} tokens")

Input length: 1361 tokens


In [7]:
# Set up streaming for better visualization
streamer = TextStreamer(tokenizer, skip_prompt=True)

# Determine the device (GPU or CPU)
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Move model and tokens to device
model.to(device)
tokens.to(device)

# Generate with the fixed-size KV cache
with torch.no_grad():
    model.eval()
    output = model.generate(
        **tokens,
        max_new_tokens=2048,
        streamer=streamer,
        do_sample=True,
        temperature=0.7,
        use_cache=True
    )

The formula in the text that was hidden, which I have to solve, is 2+5=?

The formula 2+5 can be calculated to get \(\boxed{7}\).<|eot_id|>


## Examine Cache Statistics

We can examine the cache statistics to see how the Fixed-Size KV Cache performed.

In [None]:
# Get and print cache statistics
# We need to access the cache from the model's first attention layer
attn_layer = model.model.layers[0].self_attn
past_key_value = getattr(model, "_past_key_values", None)

if past_key_value and isinstance(past_key_value, FixedSizeDynamicCache):
    stats = past_key_value.get_statistics()
    
    print("Cache Statistics:")
    for key, value in stats.items():
        print(f"- {key.replace('_', ' ').title()}: {value}")

None


## Clean Up Resources

Make sure to clean up resources properly to avoid memory leaks.

In [9]:
# Clean up resources
if past_key_value and isinstance(past_key_value, FixedSizeDynamicCache):
    past_key_value.cleanup()
    
# Remove references to model and tensors to free GPU memory
model = model.to('cpu')
tokens = tokens.to('cpu')
output = output.to('cpu')