<a href="https://colab.research.google.com/github/hamzafarooq/multi-agent-course/blob/main/Module_3_Agentic_RAG/Agentic_RAG_with_Semantic_Cache.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Agentic RAG with Semantic Cache

This notebook combines two powerful concepts:

- **Semantic Cache** ‚Äî A FAISS-backed cache that stores previous query embeddings and their answers. When a new query is semantically similar to a cached one, the stored answer is returned instantly ‚Äî no LLM or API call needed.
- **Agentic RAG** ‚Äî An intelligent retrieval system that routes queries to the right knowledge source: OpenAI documentation (via Qdrant), 10-K financial filings (via Qdrant), or live internet search (via SerpApi).

## Architecture

```
User Query
    ‚îÇ
    ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Is query time-sensitive?   ‚îÇ  ‚îÄ‚îÄYES‚îÄ‚îÄ‚ñ∂  Agentic RAG (no caching)
‚îÇ  (current events, "today",  ‚îÇ
‚îÇ   live data, etc.)          ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
               ‚îÇ NO
               ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   Semantic Cache Lookup     ‚îÇ  ‚îÄ‚îÄHIT‚îÄ‚îÄ‚ñ∂  Return cached answer ‚ö°
‚îÇ   (FAISS similarity search) ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
               ‚îÇ MISS
               ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ      Agentic RAG Router     ‚îÇ
‚îÇ   (GPT-4o classifies query) ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
       ‚îÇ          ‚îÇ           ‚îÇ
  OPENAI      10K_DOC    INTERNET
  QUERY       QUERY       QUERY
    ‚îÇ            ‚îÇ            ‚îÇ
  Qdrant      Qdrant      SerpApi
  (RAG)       (RAG)      (live web)
       ‚îÇ          ‚îÇ           ‚îÇ
       ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
               ‚îÇ
               ‚ñº
    Store answer in cache üíæ
               ‚îÇ
               ‚ñº
          Return answer
```

## Why combine them?

- **Speed**: Cached answers return in milliseconds vs. 2‚Äì5 seconds for full RAG.
- **Cost**: Fewer LLM and API calls for repeated or similar questions.
- **Correctness**: Time-sensitive queries (e.g., *"What happened today?"*) always bypass the cache to ensure fresh answers.

## 1. Setup

Install dependencies, clone the course repository (which contains `rag_helpers.py` and the pre-built Qdrant vector database), and import the helper module.

In [1]:
!pip install -U faiss-cpu sentence_transformers transformers openai qdrant_client python-dotenv nest_asyncio -q

import os, sys, nest_asyncio

# ‚îÄ‚îÄ Colab: clone the course repo if not already present ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
try:
    import google.colab
    _REPO = "/content/multi-agent-course"
    if not os.path.exists(_REPO):
        os.system(f"git clone https://github.com/hamzafarooq/multi-agent-course.git")
        print("Repository cloned ‚úÖ")
    else:
        print("Repository already present ‚úÖ")
    _MODULE_DIR = f"{_REPO}/Module_3_Agentic_RAG"
except ImportError:
    # Running locally ‚Äî rag_helpers.py lives in Module_3_Agentic_RAG/
    _MODULE_DIR = os.path.dirname(os.path.abspath("__file__"))
    print(f"Running locally ‚Äî helpers path: {_MODULE_DIR}")

sys.path.insert(0, _MODULE_DIR)
nest_asyncio.apply()  # Required for asyncio.run() inside Jupyter/Colab

from rag_helpers import init_rag, SemanticCaching, agentic_rag_with_cache
print("‚úÖ Helpers imported from rag_helpers.py")

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m23.8/23.8 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m10.4/10.4 MB[0m [31m65.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m390.4/390.4 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25hRepository cloned ‚úÖ
‚úÖ Helpers imported from rag_helpers.py


## 2. API Keys

**On Google Colab** ‚Äî store keys in the Secrets panel (`üîë` icon, left sidebar):
| Secret name | Where to get it |
|---|---|
| `SERP_API_KEY` | [serpapi.com](https://serpapi.com) |
| `OPENAI_API_KEY` | [platform.openai.com](https://platform.openai.com) |

**Running locally** ‚Äî add keys to `Module_3_Agentic_RAG/.env`:
```
serp_api_key=<your_key>
openai_api_key=<your_key>
```
The cell below detects the environment automatically and loads from the right source.

In [2]:
# ‚îÄ‚îÄ Load API keys ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
try:
    from google.colab import userdata
    serp_api_key   = userdata.get('SERP_API_KEY')
    openai_api_key = userdata.get('OPENAI_API_KEY')
    QDRANT_PATH    = f"{_REPO}/Module_3_Agentic_RAG/Agentic_RAG/qdrant_data"
    print("Colab: credentials loaded from Secrets.")
except ImportError:
    from dotenv import load_dotenv, find_dotenv
    load_dotenv(find_dotenv())
    serp_api_key   = os.getenv("serp_api_key")   or os.getenv("SERP_API_KEY")
    openai_api_key = os.getenv("openai_api_key") or os.getenv("OPENAI_API_KEY")
    QDRANT_PATH    = os.path.join(_MODULE_DIR, "Agentic_RAG", "qdrant_data")
    print("Local: credentials loaded from .env.")

print(f"SerpApi key:    {'‚úÖ' if serp_api_key else '‚ùå MISSING'}")
print(f"OpenAI API key: {'‚úÖ' if openai_api_key else '‚ùå MISSING'}")

# ‚îÄ‚îÄ Initialise the RAG pipeline (loads models + connects to Qdrant) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
init_rag(openai_api_key=openai_api_key, serp_api_key=serp_api_key, qdrant_path=QDRANT_PATH)

Colab: credentials loaded from Secrets.
SerpApi key:    ‚úÖ
OpenAI API key: ‚úÖ
Loading Nomic text model for Qdrant retrieval embeddings...


config.json: 0.00B [00:00, ?B/s]

configuration_hf_nomic_bert.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- configuration_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

modeling_hf_nomic_bert.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- modeling_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/547M [00:00<?, ?B/s]



‚úÖ RAG pipeline ready.


## 3. Create the Semantic Cache

`SemanticCaching` is defined in `rag_helpers.py`. It provides:

| Method | Purpose |
|---|---|
| `is_time_sensitive(q)` | Returns `True` for questions with temporal keywords ‚Äî these always bypass the cache |
| `check_cache(q)` | Embeds the query and runs a FAISS nearest-neighbour search; returns hit/miss + pre-computed embedding |
| `add_to_cache(q, answer, embedding)` | Persists a new entry to FAISS + JSON after a RAG call |

**Similarity threshold** (`threshold=0.2`): distance ‚â§ 0.2 (Euclidean) counts as a hit. Lower = stricter matching. Try `0.1` for exact-ish matches or `0.35` for a looser hit rate.

In [6]:
# Instantiate the semantic cache
# Set clear_on_init=True to wipe any previously stored entries
cache = SemanticCaching(json_file='rag_cache.json', threshold=0.2, clear_on_init=True)

Loading Nomic embedding model for semantic cache...




Cache embedding model ready.
Semantic cache cleared.


## 4. Agentic RAG Pipeline (from `rag_helpers.py`)

The pipeline components are all defined in `rag_helpers.py` ‚Äî see the file for full implementations.

| Function | What it does |
|---|---|
| `get_internet_content(query)` | Live Google search via SerpApi |
| `route_query(query)` | GPT-4o classifies into `OPENAI_QUERY`, `10K_DOCUMENT_QUERY`, or `INTERNET_QUERY` |
| `_retrieve_and_respond(query, action)` | Embeds query ‚Üí searches the right Qdrant collection ‚Üí generates a cited RAG answer |
| `_run_rag_pipeline(query)` | Orchestrates routing + handler dispatch; returns the answer string |
| **`agentic_rag_with_cache(query, cache)`** | **Main entry point** ‚Äî applies the cache layer on top of the full pipeline |

**Qdrant collections loaded by `init_rag()`:**
- `opnai_data` ‚Äî OpenAI Agents documentation  
- `10k_data`   ‚Äî Uber 2021 & Lyft 2024 10-K filings

## 5. Demo ‚Äî Semantic Cache + Agentic RAG in Action

`agentic_rag_with_cache(query, cache)` is the single function to call. It handles all routing, retrieval, caching, and display automatically.

### Test queries ‚Äî three cache paths

| Query | Expected path |
|---|---|
| *"What was Uber's revenue in 2021?"* | Cache MISS ‚Üí 10K RAG ‚Üí stored |
| *"How much did Uber earn in 2021?"* | Cache HIT (semantically similar) |
| *"How do I build an agent with the OpenAI Agents SDK?"* | Cache MISS ‚Üí OpenAI RAG ‚Üí stored |
| *"What are the best AI tools this week?"* | Time-sensitive ‚Üí bypass cache ‚Üí SerpApi |
| *"What is the current stock price of Apple?"* | Time-sensitive ‚Üí bypass cache ‚Üí SerpApi |
| *"What are the most popular open-source LLMs?"* | Cache MISS ‚Üí INTERNET ‚Üí SerpApi ‚Üí stored |

In [8]:
result = agentic_rag_with_cache("What was Uber's revenue in 2021?", cache)

[1m[96müë§ Query:[0m What was Uber's revenue in 2021?

[93m‚ùå Cache MISS ‚Äî running Agentic RAG pipeline...[0m

[90müìç Route: 10K_DOCUMENT_QUERY  |  Query about company's financial data[0m

[92müíæ Cached for future similar queries.[0m

[1m[96mü§ñ Response:[0m
Uber's revenue in 2021 was $3,208,323,000, or approximately $3.2 billion [1][2].



In [10]:
# Test 2: Cache HIT ‚Äî semantically similar to Test 1, returns instantly from cache
result = agentic_rag_with_cache("How much did Uber earn in fiscal year 2021?", cache)

[1m[96müë§ Query:[0m How much did Uber earn in fiscal year 2021?

[92m‚úÖ Cache HIT[0m (row 0, similarity: 0.838, 0.124s)

[1m[96mü§ñ Response (cached):[0m
Uber's revenue in 2021 was $3,208,323,000, or approximately $3.2 billion [1][2].



In [11]:
# Test 3: Cache MISS ‚Äî routes to OPENAI_QUERY and stores result
result = agentic_rag_with_cache("How do I build an agent with the OpenAI Agents SDK?", cache)

[1m[96müë§ Query:[0m How do I build an agent with the OpenAI Agents SDK?

[93m‚ùå Cache MISS ‚Äî running Agentic RAG pipeline...[0m

[90müìç Route: OPENAI_QUERY  |  The query is about building an agent using OpenAI's SDK, which relates to OpenAI documentation.[0m

[92müíæ Cached for future similar queries.[0m

[1m[96mü§ñ Response:[0m
To build an agent with the OpenAI Agents SDK, follow these steps using the fundamental components of an agent: 

1. **Model**: Choose a large language model (LLM) that will handle reasoning and decision-making for the agent.

2. **Tools**: Define external functions or APIs that the agent can use to take action. Code example: 
   ```python
   weather_agent = Agent(
       name="Weather agent",
       instructions="You are a helpful agent who can talk to users about the weather.",
       tools=[get_weather],
   )
   ```

3. **Instructions**: Provide explicit guidelines and guardrails for how the agent should behave.

Initially, you might cons

In [12]:
# Test 4: Time-sensitive query ‚Äî BYPASSES cache, calls Ares API for live answer
result = agentic_rag_with_cache("What are the best AI tools this week?", cache)

[1m[96müë§ Query:[0m What are the best AI tools this week?

[93m‚è∞ Time-sensitive ‚Äî bypassing cache for a fresh answer.[0m

[90müìç Route: INTERNET_QUERY  |  This asks for current AI trends.[0m
Getting your response from the internet üåê ...

[1m[96mü§ñ Response (live):[0m
[1] I tried 70+ best AI tools in 2026
    I went deep into each tool, from image generation to email automation, chatbot building to scheduling assistants.
    Source: https://www.techradar.com/best/best-ai-tools

[2] The best AI productivity tools in 2026
    The list you're about to see contains a collection of great AI productivity tools tested by Zapier's app review team, myself included.
    Source: https://zapier.com/blog/best-ai-productivity-tools/

[3] The 12 Best AI Tools for 2026 (That People Actually Use)
    The 12 Best AI Tools for 2026 (That People Actually Use) ¬∑ 1. ChatGPT ¬∑ 2. Gemini ¬∑ 3. Veo ¬∑ 4. Claude ¬∑ 5. Grok ¬∑ 6. NotebookLM ¬∑ 7. Lovable.
    Source: https://www.synthesia

In [13]:
# Test 5: Time-sensitive query ‚Äî stock price, always fetched live
result = agentic_rag_with_cache("What is the current stock price of Apple?", cache)

[1m[96müë§ Query:[0m What is the current stock price of Apple?

[93m‚è∞ Time-sensitive ‚Äî bypassing cache for a fresh answer.[0m

[90müìç Route: INTERNET_QUERY  |  Real-time data request[0m
Getting your response from the internet üåê ...

[1m[96mü§ñ Response (live):[0m
[1] Stock Price - Apple - Investor Relations
    Stock Quote: NASDAQ: AAPL ; Day's Open262.60 ; Closing Price260.58 ; Volume30.8 ; Intraday High264.48 ; Intraday Low260.05.
    Source: https://investor.apple.com/stock-price/default.aspx

[2] AAPL: Apple Inc - Stock Price, Quote and News
    Apple Inc AAPL:NASDAQ ; Close. 264.58 quote price arrow up +4.00 (+1.54%) ; Volume. 36,424,718 ; 52 week range. 169.21 - 288.62.
    Source: https://www.cnbc.com/quotes/AAPL

[3] Apple Inc. Stock Quote (U.S.: Nasdaq) - AAPL
    264.49 ; Volume: 42.07M ¬∑ 65 Day Avg: 48.26M ; 258.16 Day Range 264.75 ; 169.21 52 Week Range 288.62 ...
    Source: https://www.marketwatch.com/investing/stock/aapl?gaa_at=eafs&gaa_n=AWEtsqeMzP

In [14]:
# Test 6: Cache MISS ‚Äî INTERNET_QUERY, stored after Ares API call
result = agentic_rag_with_cache("What are the most popular open-source LLMs?", cache)

[1m[96müë§ Query:[0m What are the most popular open-source LLMs?

[93m‚ùå Cache MISS ‚Äî running Agentic RAG pipeline...[0m

[90müìç Route: INTERNET_QUERY  |  Query not specific to OpenAI.[0m
Getting your response from the internet üåê ...

[92müíæ Cached for future similar queries.[0m

[1m[96mü§ñ Response:[0m
[1] Open LLM Leaderboard 2025
    This LLM leaderboard displays the latest public benchmark performance for SOTA open-sourced model versions released after April 2024.
    Source: https://www.vellum.ai/open-llm-leaderboard

[2] Top 10 open source LLMs for 2025
    Top open source LLMs in 2024 ¬∑ 1. LLaMA 3 ¬∑ 2. Google Gemma 2 ¬∑ 3. Command R+ ¬∑ 4. Mistral-8x22b ¬∑ 5. Falcon 2 ¬∑ 6. Grok 1.5 ¬∑ 7. Qwen1.5 ¬∑ 8. BLOOM.
    Source: https://www.instaclustr.com/education/open-source-ai/top-10-open-source-llms-for-2025/

[3] Best Open-source AI models? : r/LocalLLM
    I think Deepseek and Qwen are the way to go for most of them, Janus 7b or stable diffusion or Lumin

In [15]:
# Test 7: Cache HIT ‚Äî similar to Test 6
result = agentic_rag_with_cache("Which open-source large language models are most widely used?", cache)

[1m[96müë§ Query:[0m Which open-source large language models are most widely used?

[93m‚ùå Cache MISS ‚Äî running Agentic RAG pipeline...[0m

[90müìç Route: INTERNET_QUERY  |  Query about general LLMs, not specific to OpenAI.[0m
Getting your response from the internet üåê ...

[92müíæ Cached for future similar queries.[0m

[1m[96mü§ñ Response:[0m
[1] Top 10 open source LLMs for 2025
    Unlike proprietary models developed by companies like OpenAI and Google, open source LLMs are licensed to be freely used, modified, and distributed by anyone.
    Source: https://www.instaclustr.com/education/open-source-ai/top-10-open-source-llms-for-2025/

[2] The best open source large language model
    The largest open-source models, DeepSeek-V3 and DeepSeek-R1, match GPT-4o and o1-pro, respectively. Newer open source LLMs like Nemotron Llama ...
    Source: https://www.baseten.co/blog/the-best-open-source-large-language-model/

[3] 9 Top Open-Source LLMs for 2026 and Their Uses
 

## 6. Inspect the Cache

View all entries currently stored in the semantic cache.

In [16]:
print(f"Total cached entries: {len(cache.cache['questions'])}")
print(f"FAISS index size: {cache.index.ntotal}\n")

for i, (q, a) in enumerate(zip(cache.cache['questions'], cache.cache['response_text'])):
    print(f"[{i}] Q: {q}")
    print(f"    A: {a[:120]}...\n" if len(a) > 120 else f"    A: {a}\n")

Total cached entries: 4
FAISS index size: 4

[0] Q: What was Uber's revenue in 2021?
    A: Uber's revenue in 2021 was $3,208,323,000, or approximately $3.2 billion [1][2].

[1] Q: How do I build an agent with the OpenAI Agents SDK?
    A: To build an agent with the OpenAI Agents SDK, follow these steps using the fundamental components of an agent: 

1. **Mo...

[2] Q: What are the most popular open-source LLMs?
    A: [1] Open LLM Leaderboard 2025
    This LLM leaderboard displays the latest public benchmark performance for SOTA open-so...

[3] Q: Which open-source large language models are most widely used?
    A: [1] Top 10 open source LLMs for 2025
    Unlike proprietary models developed by companies like OpenAI and Google, open s...



## Assignment: Extend the System

Try one or more of these extensions:

1. **Adjustable similarity threshold** ‚Äî Experiment with `threshold=0.1` (stricter) vs `threshold=0.35` (looser). How does it affect hit rate and answer quality?

2. **Cache TTL (Time-To-Live)** ‚Äî Add an expiry timestamp to each cache entry. Stale entries (e.g., older than 7 days) should be evicted and re-fetched.

3. **Sub-query division** ‚Äî Before checking the cache, use a GPT call to split compound questions (e.g., *"What was Uber and Lyft revenue in 2021?"*) into sub-queries. Check and populate the cache per sub-query.

4. **Cache analytics** ‚Äî Track and display cache hit rate, average latency for hits vs misses, and the most-queried topics over a session.