# üöÄ Inference Optimization: Server-Side and Client-Side Caching

In previous notebooks, we explored **model compression** techniques (like quantization and pruning) and **Parameter-Efficient Fine-Tuning (PEFT)** methods such as **LoRA**.
These approaches helped us **reduce training cost and memory footprint** while keeping model performance strong.

Now, we shift our focus from **training efficiency** to **inference efficiency** ‚Äî optimizing how the model serves predictions once deployed.

### üß† Why Optimize Inference?

Even after training, large language models remain computationally heavy:
- Each token generation step can require billions of FLOPs.
- Repeated computations across requests waste valuable GPU cycles.
- Memory bandwidth and latency can quickly become bottlenecks.

Inference optimizations allow us to:
- ‚ö° Reduce response latency (faster generations)
- üí∏ Lower compute cost per request
- üß© Serve more users concurrently with the same hardware


### üß© What We‚Äôll Cover

We‚Äôll study **two complementary caching strategies** ‚Äî one on the **server side** and one on the **client side**, that make inference significantly faster and more efficient.

#### üîπ 1. Server-Side Optimization ‚Äî *KV Caching (via vLLM)*  
> Used during token generation to reuse attention key/value pairs, avoiding redundant computation.

We‚Äôll see how **vLLM** implements KV caching with its **PagedAttention** mechanism, achieving high throughput and efficient GPU memory usage for large-scale serving.

#### üîπ 2. Client-Side Optimization ‚Äî *CacheSaver*  
> Used to locally store and reuse model responses or intermediate computations on the client side.

We‚Äôll explore how **CacheSaver** helps avoid repeated inference calls for identical inputs, reducing latency and API costs.


### üéØ Learning Goals
By the end of this notebook, you‚Äôll:
- Understand how **KV caching** speeds up sequential generation.
- Learn how **vLLM** leverages **PagedAttention** for scalable inference.
- Implement a **CacheSaver** for efficient client-side reuse.
- Measure real-world performance gains from both techniques.


Let‚Äôs dive in and make inference *blazingly fast*! ‚ö°


In [None]:
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
from vllm import LLM, SamplingParams
import matplotlib.pyplot as plt
import numpy as np

## üß™ Comparing Inference Performance: Transformers vs vLLM

In this experiment, we‚Äôll benchmark the inference speed of **Transformers** and **vLLM** using the `facebook/opt-125m` model.  
The model will be asked to **explain in detail how the internet works**, from connecting to Wi-Fi to rendering a website in a browser.  

We‚Äôll:
- Run **10 generations** for each framework  
- Measure and record inference times  
- Visualize per-run and average performance  

Set `verbose = True` if you‚Äôd like to print model outputs during testing.


In [None]:
model_name = "facebook/opt-125m"
prompt = "Explain in detail how the internet works, starting from how a computer connects to a Wi-Fi network all the way to how a website appears in a browser. Include examples and describe each step carefully."
verbose = False # Switch to True if you want the model's outputs printed

### Transformers

Let's first execute the experiment using Transformers

**`TODO:`**
1. Initialize the model using Transformers for Causal Language Modeling.
2. Use the model to generate 10 output samples. All generations should have the following decoding parameters:
    - `max_new_tokens`=1024
    - `do_sample`=True
    - `temperature`=0.7
    - `top_p`=0.9
3. Measure the time needed for each generation and then compute their average.

### vLLM
- **vLLM** is an open-source **high-performance inference and serving engine** for large language models (LLMs).  
- It‚Äôs designed to make model serving **much faster and more memory-efficient** through **KV caching** and an advanced mechanism called **PagedAttention**.  
- Traditional inference engines store key‚Äìvalue (KV) caches contiguously in GPU memory, which limits concurrency. vLLM‚Äôs **PagedAttention** treats attention cache memory like a **virtual memory system**, breaking it into small ‚Äúpages‚Äù that can be dynamically allocated, reused, and evicted.  
- This enables **many concurrent requests**, **lower latency**, and **higher GPU utilization** ‚Äî ideal for production-scale LLM serving.  
- vLLM can be used both **locally and in the cloud**, providing fast and cost-efficient inference for chatbots, code assistants, and other AI applications.

Check out vLLM‚Äôs GitHub: [https://github.com/vllm-project/vllm](https://github.com/vllm-project/vllm) üöÄ


Now let's repeat the experiment using vLLM. Initializing a model and making generations using vLLM is similar to Transformers but not exactly the same. The example below, shows everything you need for this exercise.

```python
llm = LLM(model=model_name)
sampling_params = SamplingParams(max_tokens=1024, temperature=0.7, top_p=0.9)
outputs = llm.generate([prompt], sampling_params)
```

**`TODO:`**
1. Initialize the model using vLLM.
2. Use the model to generate 10 output samples. All generations should have the following decoding parameters:
    - `max_new_tokens`=1024
    - `do_sample`=True
    - `temperature`=0.7
    - `top_p`=0.9
3. Measure the time needed for each generation and then compute their average.

**`Discussion:`** Investigate the data you've gathered to draw conclusions.

\[Your Answer\]

## üì¶ Client-Side Caching: CacheSaver

To complement our server-side optimization, we‚Äôll now explore a **client-side optimization** layer: **CacheSaver**.

### What is CacheSaver?
CacheSaver is a plug-and-play client framework that sits **outside the model internals** or server setup and adds intelligent caching to inference calls.  
It works with *any* model or API (open-source or closed-source) and introduces:
- Transparent reuse of past responses instead of new inference calls.  
- Namespace-aware caching to ensure **deterministic reproducibility** while preserving model randomness.  
- Low overhead: one line of code to integrate, no heavy dependencies.

Check out CacheSaver's GitHub: [https://github.com/au-clan/cachesaver/tree/main](https://github.com/au-clan/cachesaver/tree/main) üöÄ

### Why use CacheSaver?
- üß† **Reduces inference cost & latency** by avoiding redundant calls when the same prompt appears again.  
- üîÅ **Improves reproducibility**: identical prompts yield identical cached responses (useful for benchmarking or reasoning chains).  
- üîÑ **Enables reuse in multi-step workflows** (e.g., agent reasoning, chains of prompts) by detecting overlapping sub-tasks and skipping repeated computation.


Let‚Äôs set it up and see the impact!  


In [None]:
prompt = "Suggest a startup idea and its one-sentence elevator pitch. Keep the textual style format of the following examples : 'EcoFleet: AI for optimizing delivery routes', 'LinguaLoop: Personalized language tutoring via voice AI', 'TaskHaven: Calm productivity app for remote workers'"

### üí° Why Randomness Matters Before Caching

Before introducing CacheSaver, it‚Äôs worth seeing how *vanilla* OpenAI inference behaves when you call the same prompt multiple times.  

Even though the prompt and model are identical, each new request is treated as an independent sampling process, meaning the model will likely generate **different startup ideas** each time.  

In the example below, we:  
1. Ask for **3 startup ideas** first.  
2. Then immediately ask for **2 more startup ideas** using the same prompt.  

You‚Äôll notice that the ‚Äúfollow-up‚Äù ideas don‚Äôt necessarily overlap with the first batch, each call is random and stateless.


In [None]:
from openai import AsyncOpenAI

client = AsyncOpenAI()
response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[
        {"role": "user", "content": prompt}
    ],
    n=3
)
print(f"Initial ideas:\n\n", "\n".join(["\t -"+choice.message.content.strip() for choice in response.choices]))

response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[
        {"role": "user", "content": prompt}
    ],
    n=2
)
print(f"Follow-up ideas:\n\n", "\n".join(["\t -"+choice.message.content.strip() for choice in response.choices]))

### üß† When You Need to Regenerate Lost Ideas

Imagine you ran the earlier experiment, but forgot to save or record the ideas that were generated.  
Now you want to re-run the same prompt, this time asking for **five startup ideas** to make up for the ones you lost.

In a normal (non-cached) setup, this new call will **resample all five ideas from scratch**, producing an entirely new set of outputs.  
Even though it‚Äôs the same model and prompt, the randomness of sampling means you can‚Äôt easily recover or reproduce the original results.


**`TODO:`** Re-initialise the AyncOpenAI client and generate 5 more ideas in the same manner as before.

### üíæ Imagine Running the Same Experiment with CacheSaver

Now, let‚Äôs see how this scenario changes when we introduce **CacheSaver**.  

CacheSaver acts as a reproducible ‚Äúmemory‚Äù for your LLM calls. It remembers what the model has already generated for a given prompt and namespace.  
So if you run the same experiment again (even in a new session or notebook cell), CacheSaver will:

1. **Reproduce** the exact same ideas you saw earlier ‚Äî no more lost results.  
2. **Reuse** cached completions instead of re-sampling them from scratch.  
3. **Only generate missing completions** when you increase `n` (for example, going from 3 ‚Üí 5 ideas).  

In [None]:
from cachesaver.models.openai import AsyncOpenAI

client = AsyncOpenAI()
response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[
        {"role": "user", "content": prompt}
    ],
    n=3
)
print(f"Initial ideas:\n\n", "\n".join(["\t -"+choice.message.content.strip() for choice in response.choices]))

response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[
        {"role": "user", "content": prompt}
    ],
    n=2
)
print(f"Follow-up ideas:\n\n", "\n".join(["\t -"+choice.message.content.strip() for choice in response.choices]))

### ‚ö° Regenerating with CacheSaver ‚Äî Consistency and Reuse in Action

Now, let‚Äôs re-run the same request, this time with **CacheSaver** enabled.  

Because CacheSaver tracks results within a deterministic namespace, it recognizes that we‚Äôve already generated some ideas for this exact prompt.  
When we now ask for **five startup ideas**, it will:

- Instantly **reuse** the cached ideas from earlier runs.  
- Return a **stable, reproducible list** that remains the same across sessions and environments.  

This is where CacheSaver‚Äôs real value becomes obvious: instead of unpredictable, random generations, you get consistent, memory-backed outputs.

In [None]:
client = AsyncOpenAI()

response = await client.chat.completions.create(
    model="gpt-4.1-nano",
    messages=[
        {"role": "user", "content": prompt}
    ],
    n=5
)
print(f"Follow-up ideas:\n\n", "\n".join(["\t -"+choice.message.content.strip() for choice in response.choices]))

**`Discussion:`**  You‚Äôve now measured the inference times for both runs, with and without CacheSaver. What do you notice about the difference between the first and second execution? Why does CacheSaver behave this way, and how might this affect real-world workloads?