To reduce the costs we can consider caching the results of the llm.
Langchain allows us to do global caching of calls.
You need to verify that the caching does not contain personalized answers that should not be cached.

In [1]:
%pip install langchain langchain-openai gptcache

Collecting gptcache
  Obtaining dependency information for gptcache from https://files.pythonhosted.org/packages/3d/b2/08e81ec8d1c851a8ccbcec598100920c34f89963c5004a8cb6662a630df0/gptcache-0.1.40-py3-none-any.whl.metadata
  Downloading gptcache-0.1.40-py3-none-any.whl.metadata (23 kB)
Collecting cachetools (from gptcache)
  Obtaining dependency information for cachetools from https://files.pythonhosted.org/packages/a9/c9/c8a7710f2cedcb1db9224fdd4d8307c9e48cbddc46c18b515fefc0f1abbe/cachetools-5.3.1-py3-none-any.whl.metadata
  Downloading cachetools-5.3.1-py3-none-any.whl.metadata (5.2 kB)
Downloading gptcache-0.1.40-py3-none-any.whl (124 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m124.5/124.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hUsing cached cachetools-5.3.1-py3-none-any.whl (9.3 kB)
Installing collected packages: cachetools, gptcache
Successfully installed cachetools-5.3.1 gptcache-0.1.40
Note: you may need to restart the kernel to use updated packa

In the first strategy we enable the caching on a global level.
We use a database to store the input prompt and output prompt.
And when we get another request that is the same , we return it from cache.

In [2]:
from gptcache import Cache
from gptcache.manager.factory import manager_factory
from gptcache.processor.pre import get_prompt
from gptcache.adapter.api import init_similar_cache

from langchain.cache import GPTCache
import langchain
import hashlib


def get_hashed_name(name):
    return hashlib.sha256(name.encode()).hexdigest()


def init_gptcache_exact_match(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    cache_obj.init(
        pre_embedding_func=get_prompt,
        data_manager=manager_factory(manager="map", data_dir=f"map_cache_{hashed_llm}"),
    )


def init_gptcache_embeddings_match(cache_obj: Cache, llm: str):
    hashed_llm = get_hashed_name(llm)
    init_similar_cache(cache_obj=cache_obj, data_dir=f"similar_cache_{hashed_llm}")



start to install package: redis
successfully installed package: redis
start to install package: redis_om
successfully installed package: redis_om


Running the same prompt 10 times is slow

In [3]:

from langchain_openai import OpenAI
llm=OpenAI(temperature=0)
prompt="Hello world"

langchain.llm_cache=None
for i in range(1,10):
    result = llm.invoke(prompt)
   # print(result)


When we enable the caching it goes a lot faster once it's warmed up

In [4]:
langchain.llm_cache = GPTCache(init_gptcache_exact_match)
# Now run it once
result = llm("Hello world")


In [5]:

# Now run it 10 times
for i in range(1,10):
    result = llm(prompt)
    #print(result)

With embeddings we can make this is a bit more clever. Not just exact matches can be used to return, but now also make it return similar questions.

In [6]:
# Set the caching to use embeddings
langchain.llm_cache = GPTCache(init_gptcache_embeddings_match)

# Now run it once to warm up the cache
result = llm("Hello world")


start to install package: transformers


  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


successfully installed package: transformers


Downloading (…)okenizer_config.json: 100%|██████████| 465/465 [00:00<00:00, 1.80MB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 827/827 [00:00<00:00, 4.81MB/s]
Downloading spiece.model: 100%|██████████| 760k/760k [00:00<00:00, 7.96MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.31M/1.31M [00:00<00:00, 13.0MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 245/245 [00:00<00:00, 942kB/s]
Downloading model.onnx: 100%|██████████| 46.9M/46.9M [00:00<00:00, 85.2MB/s]


start to install package: faiss-cpu
successfully installed package: faiss-cpu


In [7]:
# Now run a similar request
similar_prompt="Hello world :)"
for i in range(1,10):
    result = llm(similar_prompt)