# Leveraging LLMs with Semantic Caching using Redis

This notebook demonstrates how to utilize caching to minimize repeated calls to Large Language Model (LLM) APIs for identical queries. We will implement caching using a Redis Cloud database.

In this demonstration, we will explore two types of caching: **standard caching** and **semantic caching**.

### Standard Caching
In standard caching, if a query is repeated, the system will serve the response from the cache, avoiding an additional call to the LLM API. However, if two queries are phrased differently but have the same meaning—for example, "What is the capital of France?" and "Which city is the capital of France?"—they will be treated as distinct queries. This means the LLM will be queried for each variation.

### Semantic Caching
Semantic caching, on the other hand, is more sophisticated. It recognizes that queries with similar meanings, even if expressed in different wording, refer to the same request. In the example above, both queries would be understood as asking for the capital of France, and the response would be fetched from the cache rather than querying the LLM again.

### Prerequisites
To follow along, you will need to create a [free Redis account](https://redis.io/try-free/). Additionally, an OpenAI API key is required for accessing their embeddings model. Note that OpenAI's API is not free, but you can create an account with as little as $5. This project should incur only minimal costs, leaving the remainder of your credit for future use.


In [1]:
# install libraries
!pip install -q langchain-core==1.1.1 langchain-openai==1.1.1 langchain-redis==0.2.5 redis==6.4.0

In [2]:
# import libraries
import os
import time
import redis

from langchain_core.globals import set_llm_cache
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_redis import RedisCache, RedisSemanticCache


In [3]:
# set OpenAI API Key and Redis password.
# Works only on Google Colab.
# Make sure you add REDIS_PASSWORD and
# OPENAI_API_KEY to Google Colab's Secrets.
# If you use a different notebook, modify this cell.
# Use your own Redis Url.

# if you use Google Colab
from google.colab import userdata
REDIS_URL = "redis-19585.c90.us-east-1-3.ec2.cloud.redislabs.com"
REDIS_PASSWORD = userdata.get("REDIS_PASSWORD")
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# if you don't use Google Colab
# REDIS_URL = ""
# REDIS_PASSWORD = ""
# os.environ["OPENAI_API_KEY"] = ""

### Standard Caching

In [4]:

redis_client = redis.Redis(
    host=REDIS_URL,
    port=19585,
    decode_responses=True,
    username="default",
    password=REDIS_PASSWORD,
)

response = redis_client.ping()
print(response) # Prints True if everything is configured correctly.

True


In [5]:
# set standard cache for the LLM
redis_cache = RedisCache(redis_client=redis_client)
set_llm_cache(redis_cache)
llm = OpenAI(temperature=0.0)

In [6]:
# custom decorator for finding the execution time
def timeit(func):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        print(f"Running function {func.__name__}.")
        print(f"It took {end_time - start_time:.2f} seconds to run.")
        return result
    return wrapper

# utility function to run query & calculate execution time
@timeit
def run_query(query):
    result = llm.invoke(query)
    return result

In [7]:
query1 = "Which city is the capital of France?"
print(run_query(query1))

Running function run_query.
It took 0.53 seconds to run.


Paris


In [8]:
# repeat with same query
query2 = "Which city is the capital of France?"
print(run_query(query2))

Running function run_query.
It took 0.03 seconds to run.


Paris


In [9]:
# repeat with semantically similar query
query3 = "Which is the capital of France?"
print(run_query(query3))

Running function run_query.
It took 0.84 seconds to run.


Paris


When query1 is served, the response is stored to stanadrd cache. Since query1 and query2 are identical, cache is used to serve a response to query2, and the response is served faster.

query3 is phrased differently, and standard cache considers it to be different from query1 and query2. So, a new LLM call is invoked and it takes longer to serve the response.

Next, we will see how with semantic caching, two differently phrased queries with same meaning are treated as the same query.

### Semantic Caching

In [10]:
embeddings = OpenAIEmbeddings()
semantic_cache = RedisSemanticCache(
    redis_client=redis_client,
    embeddings=embeddings,
    # distance_threshold=0.1,
)


set_llm_cache(semantic_cache)

In [11]:
# run a new query
query4 = "Who is the only woman to hold the office of the Chancellor of Germany"
print(run_query(query4))

Running function run_query.
It took 1.27 seconds to run.


Angela Merkel is the only woman to hold the office of the Chancellor of Germany. She has been in office since 2005 and is currently serving her fourth term.


In [12]:
# repeat with a semantically similar query
query5 = "What is the name of Germany's only female Chancellor?"
print(run_query(query5))

Running function run_query.
It took 0.23 seconds to run.


Angela Merkel is the only woman to hold the office of the Chancellor of Germany. She has been in office since 2005 and is currently serving her fourth term.


With semantic caching, the semantic meaning of queries are compared, and hence query4 and query5 are considered as "same". The semantic meanings are compared using similarity search, so the response takes longer than the standard cache.

Note that in this demo, we are using Redis Cloud. It is still faster than making an API call to the LLM. If Redis were set up locally(in the same premise as the notebook), the response would be even faster.