To reduce the costs we can consider caching the results of the llm.
Langchain allows us to do global caching of calls.
You need to verify that the caching does not contain personalized answers that should not be cached.

In [8]:
%pip install langchain langchain-openai gptcache

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In the first strategy we enable the caching on a global level.
We use a database to store the input prompt and output prompt.
And when we get another request that is the same , we return it from cache.

In [9]:
from gptcache import Cache
from gptcache.manager.factory import manager_factory
from gptcache.processor.pre import get_prompt
from gptcache.adapter.api import init_similar_cache

from langchain.cache import GPTCache
import langchain
import hashlib


def get_hashed_name(name):
    return hashlib.sha256(name.encode()).hexdigest()


def init_gptcache_exact_match(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    cache_obj.init(
        pre_embedding_func=get_prompt,
        data_manager=manager_factory(manager="map", data_dir=f"map_cache_{hashed_llm}"),
    )


def init_gptcache_embeddings_match(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    init_similar_cache(cache_obj=cache_obj, data_dir=f"similar_cache_{hashed_llm}")



Running the same prompt 10 times is slow

In [10]:

from langchain_openai import OpenAI
llm=OpenAI(temperature=0)
prompt="Hello world"

langchain.llm_cache=None
for i in range(1,10):
    result = llm.invoke(prompt)
   # print(result)


When we enable the caching it goes a lot faster once it's warmed up

In [11]:
langchain.llm_cache = GPTCache(init_gptcache_exact_match)
# Now run it once
result = llm.invoke("Hello world")


In [12]:

# Now run it 10 times
for i in range(1,10):
    result = llm.invoke(prompt)
    #print(result)

With embeddings we can make this is a bit more clever. Not just exact matches can be used to return, but now also make it return similar questions.

In [13]:
# Set the caching to use embeddings
langchain.llm_cache = GPTCache(init_gptcache_embeddings_match)

# Now run it once to warm up the cache
result = llm.invoke("Hello world")


In [14]:
# Now run a similar request
similar_prompt="Hello world :)"
for i in range(1,10):
    result = llm.invoke(similar_prompt)