# How to cache LLM responses

LangChain provides an optional caching layer for LLMs. This is useful for two reasons:

It can save you money by reducing the number of API calls you make to the LLM provider, if you're often requesting the same completion multiple times.
It can speed up your application by reducing the number of API calls you make to the LLM provider.


In [1]:
%pip install -qU langchain_openai langchain_community

import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass()
# Please manually enter OpenAI Key

Note: you may need to restart the kernel to use updated packages.


StdinNotImplementedError: getpass was called, but this frontend does not support input requests.

In [2]:
from langchain_core.globals import set_llm_cache
from langchain_openai import OpenAI

# To make the caching really obvious, lets use a slower and older model.
# Caching supports newer chat models as well.
llm = OpenAI(model="gpt-3.5-turbo-instruct", n=2, best_of=2)

In [3]:
%%time
from langchain_core.caches import InMemoryCache

set_llm_cache(InMemoryCache())

# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")

CPU times: user 23.3 ms, sys: 23 ms, total: 46.3 ms
Wall time: 722 ms


"\n\nWhy don't scientists trust atoms? \n\nBecause they make up everything!"

In [4]:
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")

CPU times: user 220 µs, sys: 94 µs, total: 314 µs
Wall time: 317 µs


"\n\nWhy don't scientists trust atoms? \n\nBecause they make up everything!"

## SQLite Cache

In [5]:
!rm .langchain.db

In [6]:
# We can do the same thing with a SQLite cache
from langchain_community.cache import SQLiteCache

set_llm_cache(SQLiteCache(database_path=".langchain.db"))

In [7]:
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")

CPU times: user 463 ms, sys: 808 ms, total: 1.27 s
Wall time: 531 ms


"\nWhy couldn't the bicycle stand up by itself? Because it was two-tired."

In [8]:
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")

CPU times: user 29.4 ms, sys: 47.1 ms, total: 76.6 ms
Wall time: 76.1 ms


"\nWhy couldn't the bicycle stand up by itself? Because it was two-tired."