## Caching LLM Calls

*[Coding along with the Udemy Course [Advanced Retrieval Augmented Generation ](https://www.udemy.com/course/advanced-retrieval-augmented-generation/) by RÃ©mi Connesson]*

In [50]:
import pandas as pd
from openai import AsyncOpenAI

In [51]:
api_key = pd.read_csv("~/tmp/chat_gpt/agentic-design-1.txt", sep=" ", header=None)[0][0]
print("Don't be a fool and sent your api key to github")

Don't be a fool and sent your api key to github


In [52]:
client = AsyncOpenAI(api_key=api_key)

In [53]:
def _msg(role, content):
    return {'role': role, 'content': content}

def system(content):
    return _msg('system', content)

def user(content):
    return _msg('user', content)

def assistant(content):
    return _msg('assistant', content)

In [54]:
model = "gpt-4o-mini"

In [55]:
# sanity check 1, is this thing on?
completion = await client.chat.completions.create(
    model = model,
    messages = [user("What is caching in software engineering?")],
    max_tokens = 100 # limit the output to save costs; answer might get cut
)
completion.choices[0].message.content

'Caching in software engineering is a technique used to temporarily store copies of frequently accessed data or computations in order to improve performance and efficiency. By keeping this data close to where it is needed, such as in memory, subsequent requests for the same data can be served much faster than if they had to be recalculated or fetched from a slower data source (like a database or an API).\n\n### Key Concepts of Caching:\n\n1. **Types of Cache**:\n   - **Memory Cache**: Uses RAM'

In [56]:
completion.json() # json string of the completion

'{"id":"chatcmpl-ANd0gyK9hQqePVcG5gz9tAAMAj27V","choices":[{"finish_reason":"length","index":0,"logprobs":null,"message":{"content":"Caching in software engineering is a technique used to temporarily store copies of frequently accessed data or computations in order to improve performance and efficiency. By keeping this data close to where it is needed, such as in memory, subsequent requests for the same data can be served much faster than if they had to be recalculated or fetched from a slower data source (like a database or an API).\\n\\n### Key Concepts of Caching:\\n\\n1. **Types of Cache**:\\n   - **Memory Cache**: Uses RAM","refusal":null,"role":"assistant","audio":null,"function_call":null,"tool_calls":null}}],"created":1730194862,"model":"gpt-4o-mini-2024-07-18","object":"chat.completion","service_tier":null,"system_fingerprint":"fp_f59a81427f","usage":{"completion_tokens":100,"prompt_tokens":14,"total_tokens":114,"completion_tokens_details":{"audio_tokens":null,"reasoning_token

### Introducing Caching

In [57]:
# https://pypi.org/project/diskcache/
# Disk Cache -- Disk and file backed persistent cache
from diskcache import Cache

In [58]:
cache = Cache(directory=".cache")

### Making Caching Asynchronous

In [59]:
import asyncio

In [60]:
# creating a wrapper around the cache
# so I can call it in a way that's thread safe
async def set_async(key, val, **kwargs): # what the hell is kwargs???
    # await the cache.set operation
    return await asyncio.to_thread(cache.set, key, val, **kwargs)

async def get_async(key, default=None, **kwargs):
    return await asyncio.to_thread(cache.get, key, default, **kwargs)

### Caching LLM Calls

In [61]:
import json
from hashlib import md5

In [62]:
# when calling a function in Python there are different ways to arrange the oder of the arguments
# we've to make sure that arguments are in a certain order when he create a hashkey out of them
def make_cache_key(key_name, **kwargs):
    kwargs_string = json.dumps(kwargs, sort_keys=True)
    kwargs_hash = md5(kwargs_string.encode('utf-8')).hexdigest()
    cache_key = f"{key_name}__{kwargs_hash}"
    return cache_key

In [63]:
make_cache_key("demo_cache", a=1, b=2, c=4)

'demo_cache__ac6b59f8b9221cc50603ef2f4fcbf866'

In [64]:
make_cache_key("demo_cache", a=1, c=4, b=2)

'demo_cache__ac6b59f8b9221cc50603ef2f4fcbf866'

#### __Caching an LLM Call__

In [65]:
# caching a chat completion
# the * at the position of the first parameter forces us to explicitly pass parameters with the variable name
# Positional-Only Arguments: When you use a single * by itself in the function signature, it indicates that all arguments 
# following the * must be passed as keyword arguments. This is useful for enforcing readability, as it makes certain 
# arguments require a name when the function is called.
def _make_cache_key_for_chat_completion(
    *,
    model,
    messages,
    **kwargs
):
    return make_cache_key(
        "openai_chat_completion",
        model=model,
        messages=messages,
        **kwargs
    )

In [66]:
# what I want to return is the chat completion
# https://platform.openai.com/docs/guides/text-generation
from openai.types.chat import ChatCompletion

In [67]:
# A sentinel value is often used in function parameters to indicate that no value was provided by the user. 
# a sentinel is a value you can't accidentally create
# =>>> a trick to do this in Python is creating an object that has one memory address
# A common sentinel for this purpose is None.
CACHE_MISS_SENTINEL = object() # creating a sentinel

In [68]:
async def cached_chat_completion (
    *,
    model,
    messages,
    **kwargs
) -> ChatCompletion:
    # 1) CREATE CACHE KEY
    cache_key = _make_cache_key_for_chat_completion(
        model=model,
        messages=messages,
        **kwargs
    )

    cached_value = await get_async(cache_key, default=CACHE_MISS_SENTINEL)

    # we want to return the same value out of the cache like we get from the not cached call to completition
    # which is an object of type ChatCompletion

    # 2) CACHE MISS
    if cached_value is CACHE_MISS_SENTINEL:
        # no cached value so we cache the api call
        # api call the ChatCompletion
        completion = await client.chat.completions.create(
            model=model,
            messages=messages,
            **kwargs
        )
        # we're caching/storing the json string of the completion
        await set_async(cache_key, completion.json())
        return completion
    # 3) CACHE HIT
    else:
        # return cached_value
        # we want to return the same value out of the cache like we get from the not cached call to completition
        # which is an object of type ChatCompletion
        return ChatCompletion.validate(json.loads(cached_value))


In [69]:
# just for demonstration purposes:
# ChatCompletion.validate(json.loads(completion.json()))

In [70]:
completion = await cached_chat_completion(
    model = model,
    messages = [user("What is caching in software engineering?")],
    max_tokens = 100 # limit the output to save costs; answer might get cut
)
completion

ChatCompletion(id='chatcmpl-ANcqmVfYibrE7NzBAVJcQUnWcWxZk', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Caching in software engineering refers to the practice of storing copies of frequently accessed data or computations in a temporary storage location, known as a cache, to improve system performance and efficiency. The basic idea is to reduce the time it takes to access data by keeping a version of that data closer to the place where it is needed, thereby minimizing the need to repeatedly retrieve it from a slower source, such as a database, disk, or external API.\n\n### Key Concepts of Caching:\n\n1. **Cache Types', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1730194248, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier=None, system_fingerprint='fp_f59a81427f', usage=CompletionUsage(completion_tokens=100, prompt_tokens=14, total_tokens=114, completion_tokens

In [71]:
# calling it once more
completion = await cached_chat_completion(
    model = model,
    messages = [user("What is caching in software engineering?")],
    max_tokens = 100 # limit the output to save costs; answer might get cut
)
# if it's the same object from the cache the chat id should be the same
completion

ChatCompletion(id='chatcmpl-ANcqmVfYibrE7NzBAVJcQUnWcWxZk', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Caching in software engineering refers to the practice of storing copies of frequently accessed data or computations in a temporary storage location, known as a cache, to improve system performance and efficiency. The basic idea is to reduce the time it takes to access data by keeping a version of that data closer to the place where it is needed, thereby minimizing the need to repeatedly retrieve it from a slower source, such as a database, disk, or external API.\n\n### Key Concepts of Caching:\n\n1. **Cache Types', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1730194248, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier=None, system_fingerprint='fp_f59a81427f', usage=CompletionUsage(completion_tokens=100, prompt_tokens=14, total_tokens=114, completion_tokens

In [72]:
# direct call without cache
completion = await client.chat.completions.create(
    model = model,
    messages = [user("What is caching in software engineering?")],
    max_tokens = 100 # limit the output to save costs; answer might get cut
)
completion # different cache id

ChatCompletion(id='chatcmpl-ANd0hNO9CuzSpXrh5r8fPZyVcf6Bc', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Caching in software engineering refers to the technique of storing copies of frequently accessed data or computational results in a temporary storage area, known as a cache, to improve the performance and efficiency of data retrieval operations. By storing this data closer to the location where it will be used (for instance, in memory rather than on disk), caching reduces the time it takes to access the data and can significantly speed up application performance.\n\nHere are some key aspects of caching:\n\n1. **Types of Caches**:\n  ', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1730194863, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier=None, system_fingerprint='fp_f59a81427f', usage=CompletionUsage(completion_tokens=100, prompt_tokens=14, total_tokens=11

In [73]:
# let's loop just to make sure
for _ in range(5):
    completion = await cached_chat_completion(
        model = model,
        messages = [user("What is caching in software engineering?")],
        max_tokens = 100 # limit the output to save costs; answer might get cut
    )
    # if it's the same object from the cache the chat id should be the same
    print(completion) # n times same id

ChatCompletion(id='chatcmpl-ANcqmVfYibrE7NzBAVJcQUnWcWxZk', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Caching in software engineering refers to the practice of storing copies of frequently accessed data or computations in a temporary storage location, known as a cache, to improve system performance and efficiency. The basic idea is to reduce the time it takes to access data by keeping a version of that data closer to the place where it is needed, thereby minimizing the need to repeatedly retrieve it from a slower source, such as a database, disk, or external API.\n\n### Key Concepts of Caching:\n\n1. **Cache Types', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1730194248, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier=None, system_fingerprint='fp_f59a81427f', usage=CompletionUsage(completion_tokens=100, prompt_tokens=14, total_tokens=114, completion_tokens