<a href="https://colab.research.google.com/github/jayozer/advanced_llm/blob/main/Module_1a_Advanced_LLMs_semantic_cache_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**If you use our code, please cite:**

@misc{2024<br>
  title = {Semantic Cache from Scratch},<br>
  author = {Hamza Farooq, Darshil Modi, Kanwal Mehreen, Nazila Shafiei},<br>
  keywords = {Semantic Cache},<br>
  year = {2024},<br>
  copyright = {APACHE 2.0 license}<br>
}

In [None]:
!pip install -U faiss-cpu sentence_transformers transformers

Collecting faiss-cpu
  Downloading faiss_cpu-1.8.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m52.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
Collecting transformers
  Downloading transformers-4.42.3-py3-none-any.whl (9.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.3/9.3 MB[0m [31m80.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-n

In [None]:
import faiss
import sqlite3
from sentence_transformers import SentenceTransformer
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
from pprint import pprint


  from tqdm.autonotebook import tqdm, trange



# Traversaal Ares API Overview

Traversaal Ares API is a cutting-edge solution designed to provide real-time search results generated from user queries. Leveraging advanced Large Language Models (LLMs), Ares connects to the internet to deliver accurate and factual information, including relevant URLs for reference. This API is tailored for speed and efficiency, providing lightning-fast search results within 3-4 seconds. Currently available for free during the beta phase, with priced solutions coming soon.

## Key Features:
- **Real-time Search Results:** Ares API offers unparalleled speed in generating search results.
- **Internet Connectivity:** Connects to the internet to fetch the latest and most accurate information.
- **Lightning-Fast Response:** Delivers search results with URLs in 3-4 seconds.
- **Free Beta Access:** Available for free during for the first 100 calls
- **Factual and Accurate:** Ensures the information provided is accurate and supported by relevant references. [Can make mistakes though]

## Getting Started:
To access the Ares API, sign up at [api.traversaal.ai](https://api.traversaal.ai) and refer to the usage documentation at [docs.traversaal.ai](https://docs.traversaal.ai/docs/intro).

Experience the future of AI-driven search with Traversaal Ares API!


In [None]:
from google.colab import userdata
import requests

def make_prediction(data):
    url = "https://api-ares.traversaal.ai/live/predict"
    headers = {
        "x-api-key": userdata.get('ARES_KEY'),
        "content-type": "application/json"
    }

    payload = {"query": data}

    try:
        response = requests.post(url, json=payload, headers=headers)

        if response.status_code == 200:
            # The request was successful
            print("Request was successful.")
            # If the response contains JSON data, you can parse it using response.json()
            try:
                json_data = response.json()
                #print("Parsed JSON data:", json_data)
                return json_data
            except ValueError:
                print("No JSON data in the response.")
                return None
        else:
            # The request was not successful, handle the error
            print(f"Request failed with status code {response.status_code}.")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Error during request: {e}")
        return None

# Example usage



In [None]:
response=make_prediction(['Events happening in London this week. '])

Request was successful.


In [None]:
response

{'data': {'response_text': 'Here are some events happening in London this week:\n\n1. UEFA Euros: Enjoy the excitement of the UEFA Euros football tournament taking place in London this week. Check out the schedule for matches and find a nearby venue to watch the games.\n\n2. Wimbledon: Experience the world-renowned Wimbledon tennis tournament, happening this week in London. Watch top tennis players compete on the grass courts and soak in the thrilling atmosphere.\n\n3. Barbie at the Design Museum: Visit the Design Museum to explore the exhibition dedicated to the iconic Barbie doll. Discover the history and cultural impact of Barbie through various displays and interactive experiences.\n\n4. The Constituent: Attend the play "The Constituent" at a theater in London this week. Immerse yourself in this thought-provoking production that explores political themes and societal issues.\n\n5. London Film & Comic Con 2024: If you\'re a fan of films, comics, and pop culture, don\'t miss the Lond

In [None]:
pprint(response['data']['response_text'])

('Here are some events happening in London this week:\n'
 '\n'
 '1. UEFA Euros: Enjoy the excitement of the UEFA Euros football tournament '
 'taking place in London this week. Check out the schedule for matches and '
 'find a nearby venue to watch the games.\n'
 '\n'
 '2. Wimbledon: Experience the world-renowned Wimbledon tennis tournament, '
 'happening this week in London. Watch top tennis players compete on the grass '
 'courts and soak in the thrilling atmosphere.\n'
 '\n'
 '3. Barbie at the Design Museum: Visit the Design Museum to explore the '
 'exhibition dedicated to the iconic Barbie doll. Discover the history and '
 'cultural impact of Barbie through various displays and interactive '
 'experiences.\n'
 '\n'
 '4. The Constituent: Attend the play "The Constituent" at a theater in London '
 'this week. Immerse yourself in this thought-provoking production that '
 'explores political themes and societal issues.\n'
 '\n'
 "5. London Film & Comic Con 2024: If you're a fan of fil

In [None]:
response['data']['web_url']

['https://www.timeout.com/london/things-to-do/things-to-do-in-london-this-week',
 'https://www.timeout.com/london',
 'https://www.eventbrite.com/d/united-kingdom--london/events--this-weekend/',
 'https://www.visitlondon.com/things-to-do/whats-on/special-events/london-events-calendar',
 'https://londonist.com/things-to-do-in-london-this-week',
 'https://www.eventbrite.com/d/united-kingdom--london/events--next-week/',
 'https://londontheinside.com/whats-on-this-week/',
 'https://www.londontourism.ca/events',
 'https://www.designmynight.com/london/whats-on/whats-on-this-week-in-london',
 'https://www.londontourism.ca/events/events-this-week']

Instead of using an LLM endpoint, we will be using Ares API for retrieval and generation, however you can replace is with your own rag function in 'generate answer' function

In [None]:
import faiss # library efficent retireval for embeddings. All embeddings are inspired by this or build on top as a wrapper.
import json
import numpy as np
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM
import time

class SemanticCaching:
    def __init__(self, json_file='cache.json'):
        # Initialize Faiss index with Euclidean distance
        self.index = faiss.IndexFlatL2(768)  # Use IndexFlatL2 with Euclidean distance - embedding size is 768. you can use hnsw for large datapoints, you can play with it.
        if self.index.is_trained:
            print('Index trained')

        # Initialize Sentence Transformer model
        self.encoder = SentenceTransformer('all-mpnet-base-v2') # encoder - once you decide on an encoder, you need to stick to it. I can use openai for sentence transformer. any embedding


        # Uncomment the following lines to use DialoGPT for question generation - use your own GPT for the answer.
        # self.tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-large")
        # self.model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-large")

        # Set Euclidean distance threshold
        self.euclidean_threshold = 0.3 # in euclidian distance - the smaller the value closer they are.
        self.json_file = json_file
        self.load_cache()

    def load_cache(self):
        # Load cache from JSON file, creating an empty cache if the file is not found
        try:
            with open(self.json_file, 'r') as file:
                self.cache = json.load(file)
        except FileNotFoundError:
            self.cache = {'questions': [], 'embeddings': [], 'answers': [], 'response_text': []}

    def save_cache(self):
        # Save the cache to the JSON file
        with open(self.json_file, 'w') as file:
            json.dump(self.cache, file)

    def ask(self, question: str) -> str:
        # Method to retrieve an answer from the cache or generate a new one
        start_time = time.time()
        try:
            l = [question]
            embedding = self.encoder.encode(l) # the question is embeddded.

            # Search for the nearest neighbor in the index - use faiss for searching in our faiss index
            D, I = self.index.search(embedding, 1)

            if D[0] >= 0:
                if I[0][0] != -1 and D[0][0] <= self.euclidean_threshold:  # if distance is <= 0.3 of two senteces use answer from first question
                    row_id = int(I[0][0])
                    print(f'Found cache in row: {row_id} with score {1 - D[0][0]}') # score inveresed to show similarity. He did 1 - to shw the similarity.
                    end_time = time.time()
                    elapsed_time = end_time - start_time
                    print(f"Time taken: {elapsed_time} seconds")
                    return self.cache['response_text'][row_id]

            # Handle the case when there are not enough results or Euclidean distance is not met
            answer, response_text = self.generate_answer(question)

            self.cache['questions'].append(question)
            self.cache['embeddings'].append(embedding[0].tolist())
            self.cache['answers'].append(answer)
            self.cache['response_text'].append(response_text)

            self.index.add(embedding)
            self.save_cache() # store the answer in Faiss index
            end_time = time.time()
            elapsed_time = end_time - start_time
            print(f"Time taken: {elapsed_time} seconds")

            return response_text
        except Exception as e:
            raise RuntimeError(f"Error during 'ask' method: {e}")

    def generate_answer(self, question: str) -> str:
        # Method to generate an answer using a separate function (make_prediction in this case)
        try:
            result = make_prediction([question])
            response_text = result['data']['response_text']

            return result, response_text
        except Exception as e:
            raise RuntimeError(f"Error during 'generate_answer' method: {e}")


In [None]:
cache = SemanticCaching() # initiate a cache class



Index trained


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

having multiple vector databases - question database and index database. use the vector db as a cache layer. here saving of embeddings of the questions and not the answer. Then compare question to question and only retireve a question. So skipping an entire geenration of question.

In [None]:
question1 = "What is the capital of France?"
answer1 = cache.ask(question1)
print(answer1)

# Question not seen before, generates answer from LLM

question2 = "Who is the CEO of Apple?"
answer2 = cache.ask(question2)
print(answer2)

# Stores question2, embedding and answer2 in cache

question3 = "Who is the CEO of Facebook?"
answer3 = cache.ask(question3)
print(answer3)

# Finds question2 is similar above threshold
# Returns cached answer2 instead of generating new answer

Request was successful.
Time taken: 8.191238164901733 seconds
The capital of France is Paris. It has been the capital since its liberation in 1944. Paris is also the largest city in France, with an estimated population of 2,102,650 residents as of January 1, 2023. It is located in the north-central part of the country and is situated on the Seine River. Paris is known for its rich history, cultural significance, and as a center for art, fashion, and romance.
Request was successful.
Time taken: 3.7132303714752197 seconds
The CEO of Apple is Tim Cook. He has been serving as the CEO since 2011 and also serves on the company's board of directors. You can find more information about Tim Cook and his role as CEO on the Apple Leadership page [^1^]. Tim Cook has been instrumental in doubling Apple's revenue since taking over as CEO [^2^]. He is widely recognized as one of the world's most valuable company executives [^3^]. Tim Cook's official Twitter account is @tim_cook [^4^]. For more detail

In [None]:
answer4 = cache.ask("What is the Capital of India")
print(answer4)

Request was successful.
Time taken: 5.951639413833618 seconds
The capital of India is New Delhi. It is located in the north-central part of the country, on the west bank of the Yamuna River. New Delhi is a part of the National Capital Territory of Delhi (NCT) and serves as the seat of all three branches of the Government. It is adjacent to and just south of the city of Delhi. New Delhi was built as the eighth city in a series of cities by successive lines of rulers. For more information, you can visit the Wikipedia page on New Delhi [^1^].


In [None]:
answer4 = cache.ask("Can you tell me what is the Capital of India")
print(answer4)

Found cache in row: 3 with score 0.8059848546981812
Time taken: 0.018686294555664062 seconds
The capital of India is New Delhi. It is located in the north-central part of the country, on the west bank of the Yamuna River. New Delhi is a part of the National Capital Territory of Delhi (NCT) and serves as the seat of all three branches of the Government. It is adjacent to and just south of the city of Delhi. New Delhi was built as the eighth city in a series of cities by successive lines of rulers. For more information, you can visit the Wikipedia page on New Delhi [^1^].


In [None]:
print(cache.ask('Who is the CEO of Facebook?'))

Found cache in row: 2 with score 1.0
Time taken: 0.019966602325439453 seconds
The CEO of Facebook is Mark Elliot Zuckerberg. He is an American businessman who co-founded the social media service Facebook and its parent company Meta Platforms (formerly Facebook, Inc.). Mark Zuckerberg currently serves as the chairman, chief executive officer, and controlling shareholder of Meta Platforms. You can find more information about Mark Zuckerberg on his Wikipedia page [^1^]. Additionally, you can also visit his Facebook profile [^2^], Meta Investor Relations [^3^], LinkedIn profile [^4^], Instagram profile [^5^], and Britannica Money page [^6^] for further details.

[1]: https://en.wikipedia.org/wiki/Mark_Zuckerberg
[2]: https://www.facebook.com/zuck/
[3]: https://investor.fb.com/leadership-and-governance/person-details/
[4]: https://www.linkedin.com/in/mark-zuckerberg-618bba58
[5]: https://www.instagram.com/zuck/?hl=en
[6]: https://www.britannica.com/money/Mark-Zuckerberg


In [None]:
print(cache.ask('Who is the current CEO of Google?'))

Request was successful.
Time taken: 5.135895013809204 seconds
The current CEO of Google is Sundar Pichai. He has been the CEO of Google since 2015 and also serves as the CEO of Alphabet Inc., Google's parent company, since 2019. Sundar Pichai has been with Google since 2004 and has played a pivotal role in the development of the company. He is known for his leadership and has been recognized as one of the world's highest-paid executives. You can find more information about Sundar Pichai's career and background on his Wikipedia page [^1^]. He is also active on Instagram [@sundarpichai] and Twitter [@sundarpichai].


In [None]:
print(cache.ask('Is Sundar Pichai the CEO of Google?')) #initially it did not see it as semantically similar. So when i first run it it took even longer to answer.

Found cache in row: 5 with score 1.0
Time taken: 0.020261764526367188 seconds
Yes, Sundar Pichai is the CEO of Google. According to Wikipedia, Sundar Pichai, whose full name is Pichai Sundararajan, is an Indian-born American business executive. He has been serving as the chief executive officer (CEO) of Alphabet Inc. and its subsidiary Google. You can find more information about Sundar Pichai on his Wikipedia page at the following link: [Sundar Pichai - Wikipedia](https://en.wikipedia.org/wiki/Sundar_Pichai).

Additionally, Sundar Pichai's LinkedIn profile confirms that he is the CEO of Google and Alphabet. His focus is on organizing the world's information and making it universally accessible and useful, as well as building great products. You can find his LinkedIn profile at the following link: [Sundar Pichai - CEO - Google - LinkedIn](https://www.linkedin.com/in/sundarpichai).

Sundar Pichai is also active on Instagram, where he shares updates and insights. His Instagram handle is @

In [None]:
print(cache.ask('Best local food spots in Edinburgh for a couple?'))

Request was successful.
Time taken: 8.15780258178711 seconds
Here are some of the best local food spots in Edinburgh for a couple:

1. Dine: This European and British restaurant offers a romantic dining experience. (TripAdvisor)

2. The Tollhouse: Another European and British restaurant known for its romantic ambiance. (TripAdvisor)

3. Dine Murrayfield: Located in Murrayfield, this restaurant offers a delightful European and British cuisine. (TripAdvisor)

4. Harajuku Kitchen: A top restaurant in Edinburgh, known for its high-end awards and delicious dining experience. (SquareMeal)

5. The Outsider: Located on George IVth Eleanor, this restaurant offers a great dining experience. (Reddit)

6. Scran and Scallie: Situated in Stockbridge, this restaurant is known for its Scottish cuisine. (Reddit)

7. Papillio: Located in Bruntsfield Place, this restaurant offers a delightful dining experience. (Reddit)

8. Scotsman Hotel & Grand Cafe: Perfect for date night, this restaurant in Edinburgh

In [None]:
print(cache.ask('Best local food spots in Edinburgh?'))

Found cache in row: 6 with score 0.85077765583992
Time taken: 0.01564168930053711 seconds
Here are some of the best local food spots in Edinburgh for a couple:

1. Dine: This European and British restaurant offers a romantic dining experience. (TripAdvisor)

2. The Tollhouse: Another European and British restaurant known for its romantic ambiance. (TripAdvisor)

3. Dine Murrayfield: Located in Murrayfield, this restaurant offers a delightful European and British cuisine. (TripAdvisor)

4. Harajuku Kitchen: A top restaurant in Edinburgh, known for its high-end awards and delicious dining experience. (SquareMeal)

5. The Outsider: Located on George IVth Eleanor, this restaurant offers a great dining experience. (Reddit)

6. Scran and Scallie: Situated in Stockbridge, this restaurant is known for its Scottish cuisine. (Reddit)

7. Papillio: Located in Bruntsfield Place, this restaurant offers a delightful dining experience. (Reddit)

8. Scotsman Hotel & Grand Cafe: Perfect for date night,

In [None]:
print(cache.ask('Best local food spots in London?'))

Request was successful.
Time taken: 5.295650243759155 seconds
Based on the information provided, here are some of the best local food spots in London:

1. Casa Pastor in King's Cross
2. Tayyabs in Whitechapel
3. Oklava in Shoreditch
4. Bright in Hackney
5. Fish, Wings & Tings in Brixton
6. Kudu in Peckham
7. Petersham Nurseries Café and Tearoom in Richmond
8. Maggie Jones's in Kensington

You can find more information about these restaurants and other local food spots in London on the [Visit London website](https://www.visitlondon.com/things-to-do/food-and-drink/restaurant/local-restaurants).

Please note that this information is based on the available sources and may be subject to change.


In [None]:
print(cache.ask('Best local food spots in London?')) # here we also do observability - found in cache with score and time taken.

Found cache in row: 7 with score 1.0
Time taken: 0.01632237434387207 seconds
Based on the information provided, here are some of the best local food spots in London:

1. Casa Pastor in King's Cross
2. Tayyabs in Whitechapel
3. Oklava in Shoreditch
4. Bright in Hackney
5. Fish, Wings & Tings in Brixton
6. Kudu in Peckham
7. Petersham Nurseries Café and Tearoom in Richmond
8. Maggie Jones's in Kensington

You can find more information about these restaurants and other local food spots in London on the [Visit London website](https://www.visitlondon.com/things-to-do/food-and-drink/restaurant/local-restaurants).

Please note that this information is based on the available sources and may be subject to change.


one tihng to build on top is build a critic llm - then cache it to make sure the answer is correct. policy is up to you what to keep and when to delete the cache.

This lives in storage area of the enterprise rag. - where does this live? Add semantic cache right in storage. when question comes to encoder, push your encoder to semantic cache and ask if the question is there or not.
if the cache is hit, get a response.


Semantic caching - hugging face article - his student - as old as 3 months max as of now - 20240706.
