<a href="https://colab.research.google.com/github/promptmule4real/demo/blob/main/InMemory_Cache_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prompt Cache Techniques Part 1 - InMemory, August 2023
Welcome to the PromptMule CacheCast series of Generative AI Cache Demos! This collection of demos is designed to explore various cache techniques for Generative AI models, enabling faster and more efficient inference on large language models. In Part 1, we delve into the realm of "In-Memory Cache," where we'll showcase how caching data in memory can significantly improve the performance of Generative AI models. By employing smart caching strategies, we aim to reduce response times, optimize resource utilization, and create a seamless user experience when interacting with language models. Join us on this exciting journey as we uncover the power of caching in the world of Generative AI!

---

# InMemory Cache Benchmarking Demo

This repository contains a Python script for benchmarking the performance of a cache using a sample function (`llm_langchain`) as a test prompt. The script is designed to be executed on Google Colab notebooks.

## Introduction

The Cache Benchmarking Demo is a Python script that measures the execution time of a cache for a specific function (`llm_langchain`) that processes a given text. The script is designed to show the time difference between the first execution (bypassing the cache) and subsequent cache hits.

The demo utilizes the `timeit` module to calculate the execution time for each iteration. It provides valuable insights into how caching can significantly improve the performance of repetitive operations.

**Note:** Before running the demo, ensure you have an OpenAI API key. Replace `YOUR_OPENAI_API_KEY` in the script with your actual API key.

## Usage on Google Colab

1. Open Google Colab in your web browser: [https://colab.research.google.com/](https://colab.research.google.com/).

2. In Google Colab, click on "File" > "New Notebook" to create a new notebook.

3. In the code cell, paste the following to clone the repository and run the benchmarking demo:

```python
!git clone https://github.com/mrodgers/demo-testing/blob/main/InMemory_Cache_Demo.ipynb
%cd cache-benchmark-demo
!pip install openai  # Install OpenAI library
!python benchmark.py --api_key=YOUR_OPENAI_API_KEY
```

4. Replace `your_username` with your actual GitHub username and `YOUR_OPENAI_API_KEY` with your OpenAI API key.

5. Click on the "Run" button to execute the code. The benchmarking demo will run, and you will see the output showing the execution times.

## Functionality

### `llm_langchain(text)`

The `llm_langchain` function represents a sample prompt processing function. It should be implemented to perform some meaningful processing on the given text and return the result. In this demo, we have implemented a placeholder function that simply returns the input text. Replace this function with your actual prompt processing logic for accurate benchmarking.

### Benchmarking

The script performs the following steps for benchmarking the cache:

1. First Execution (Bypassing Cache):
   The script runs the `llm_langchain` function once to measure the time it takes for the first execution without utilizing the cache. This time represents the execution time without cache hits.

2. Cache Hits (Subsequent Executions):
   After the first execution, the script runs the `llm_langchain` function multiple times (`num_iterations`) to simulate cache hits. The time taken for each execution is measured, and the average execution time is calculated for cache hits.

## Contributing

Contributions to this cache benchmarking demo are welcome! If you find any issues, have suggestions, or want to extend the functionality, feel free to create a pull request.

## License

The Cache Benchmarking Demo is open-source software licensed under the [MIT License](LICENSE).

---

#Step 1 - input your OpenAI API Key here.

In [None]:
# @title Input OpenAI API Key { run: "auto", vertical-output: true, display-mode: "both" }
#@markdown Input your OpenAI API key here. To obtain an OpenAI API key (https://platform.openai.com/account/api-keys), OR sign up on the OpenAI website, provide necessary information, and upon approval, you'll be issued an API key to authenticate your requests to the API.

OPENAI_API_KEY = "sk-YOUR_OPENAI_API_KEY_HERE" #@param {type:"string"}
#@markdown ---


# Let's test our OpenAI Set up First...
- First let's install OpenAI and LangChain Libraries

In [None]:
!pip install openai
!pip install langchain

- Next let's test a basic prompt/response via LangChain...

In [None]:
import os
from langchain.llms import OpenAI
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
llm_langchain = OpenAI(model_name="text-davinci-003")
text_to_predict = "Which is the best technical skill to learn in 2023?"
print(llm_langchain(text_to_predict))

---
# Let's see what an in memory cache does...

In [None]:
import langchain
from langchain.llms import OpenAI
from langchain.cache import InMemoryCache

# To make the caching really obvious, let's use a slower model.
llm = OpenAI(model_name="text-davinci-002", n=2, best_of=2)
# Now, let's initialize the in-memory cache
langchain.llm_cache = InMemoryCache()

## Generative AI Cache Wall Time Clock (WTC) Benchmark

The benchmark code aims to test the performance of the Generative AI cache technique using a sample function called `llm_langchain`. The code utilizes the `timeit` module to measure the execution time of the `llm_langchain` function with and without cache hits.

### Code Overview

```python
import time
import timeit

# Define the function instruction
def instruction():
    for i in range(num_iterations):
        result = llm_langchain(text_to_predict)  # this is our test prompt function to the llm
        print(f"Iteration #{i + 1}: {result}")

# Perform the first run, bypassing the cache
start_time = timeit.timeit(instruction,number=1) # initial execution runs, bypasses cache, should take longest time
first_run = start_time

# Perform multiple cache hits and average the time
num_iterations = 1
cache_time = sum(timeit.timeit(instruction, number=1) for _ in range(num_iterations)) / num_iterations

print("--- WTC Benchmark ---")
print(f"Time taken for 1st execution: {first_run:.6f} seconds")
print(f"Time taken for Cache Hit execution (average): {cache_time:.6f} seconds")

The instruction function simulates multiple cache hits by running the llm_langchain function in a loop for a specified number of iterations (num_iterations). The first execution bypasses the cache, allowing for the measurement of the initial response time. Subsequently, the code runs the function multiple times to imitate cache hits, calculating the average execution time.

The output provides insightful information about the time differences between the first execution and the subsequent cache hits, providing valuable metrics for assessing the cache technique's efficiency. Additionally, the program labels the execution times and displays the results neatly with clear identifiers, making it easy to interpret and analyze the benchmark data.

In [None]:
import time
import timeit

text_to_predict = "Which is the best technical skill to learn in 2024?"
num_iterations = 0
# Define the function instruction
def instruction():
    for i in range(num_iterations):
        result = llm_langchain(text_to_predict)  # this is our test prompt function to the llm
        print(f"Iteration #{i + 1}: {result}")

# Perform the first run, bypassing the cache
start_time = timeit.timeit(instruction,number=1) # initial execution runs, bypasses cache, should take longest time
first_run = start_time

# Perform multiple cache hits and average the time
num_iterations = 3
cache_time = sum(timeit.timeit(instruction, number=1) for _ in range(num_iterations)) / num_iterations

print("--- WTC Benchmark ---")
print(f"Time taken for 1st execution: {first_run:.6f} seconds")
print(f"Time taken for Cache Hit execution (average): {cache_time:.6f} seconds")

# Testing of the Cache is complete
Look for additional Parts on out Github @promptmule or @promptmule4real or our youtube channel @cachecast

Check out promptmule's cloud semantic cache at www.promptmule.com