# Hybrid RAG for Large JSON (Hotel Availability Use Case)

This notebook demonstrates how a **RAG system can handle very large structured JSON data** using:

- Metadata-first filtering
- Semantic (vector) retrieval
- Casual HuggingFace models
- GPT-2 for final natural language generation

**Scenario:**
We simulate **50,000 hotels** and answer user queries like:
> "I need a hotel available at night with a budget around $100–$200"

This is a **production-style hybrid RAG pipeline**, not naive document RAG.

## 1️⃣ Install Dependencies

In [1]:
!pip install transformers sentence-transformers numpy



## 2️⃣ Import Libraries

In [2]:
import random
import numpy as np
from typing import List, Dict
from sentence_transformers import SentenceTransformer
from transformers import pipeline

## 3️⃣ Generate a Very Large JSON Dataset (50,000 Hotels)

This simulates a **huge API response** that would normally be impossible to send directly to an LLM.

In [3]:
def generate_hotels(n: int = 50_000) -> List[Dict]:
    cities = ["Toronto", "Vancouver", "Montreal", "Calgary", "Ottawa"]
    hotels = []

    for i in range(n):
        price = random.randint(60, 400)
        hotels.append({
            "hotel_id": i,
            "name": f"Hotel_{i}",
            "city": random.choice(cities),
            "price_per_night": price,
            "available_night": random.choice(["night", "day"]),
            "rating": round(random.uniform(2.5, 5.0), 1),
            "description": (
                f"A {'luxury' if price > 250 else 'budget' if price < 120 else 'mid-range'} "
                f"hotel with {'excellent' if price > 200 else 'basic'} amenities."
            )
        })
    return hotels

hotels = generate_hotels()
len(hotels)

50000

## 4️⃣ Hybrid Query Understanding (Metadata Extraction)

Instead of embedding everything, we **extract structured filters first**.

In production, this step is often:
- LLM function calling
- Small intent model
- Rules + heuristics

In [8]:
def extract_filters(user_query: str) -> Dict:
    filters = {}

    if "night" in user_query.lower():
        filters["available_night"] = "night"

    if "100" in user_query and "200" in user_query:
        filters["min_price"] = 100
        filters["max_price"] = 200

    return filters

user_query = "I need a hotel available at night with a budget around 100 to 200 dollars"
extract_filters(user_query)

{'available_night': 'night', 'min_price': 100, 'max_price': 200}

## 5️⃣ Metadata Filtering (Critical for Scale)

This step **dramatically reduces search space** before embeddings are used.

In [9]:
def apply_metadata_filter(hotels: List[Dict], filters: Dict) -> List[Dict]:
    results = []

    for h in hotels:
        if "available_night" in filters:
            if h["available_night"] != filters["available_night"]:
                continue

        if "min_price" in filters:
            if not (filters["min_price"] <= h["price_per_night"] <= filters["max_price"]):
                continue

        results.append(h)
    return results

filtered_hotels = apply_metadata_filter(hotels, extract_filters(user_query))
len(filtered_hotels)

7372

## 6️⃣ Semantic Embeddings (Vector RAG Layer)

We now embed **only the relevant subset**, not the full JSON.

In [10]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')
texts = [h['description'] for h in filtered_hotels]

# 1. Start a pool of processes (Colab free tier has 2 cores)
pool = embedder.start_multi_process_pool()

# 2. Encode using the pool
# batch_size=128 is a good 'sweet spot' for CPUs
embeddings = embedder.encode_multi_process(
    texts,
    pool,
    batch_size=128,
    chunk_size=500
)

# 3. Always stop the pool when done to free up RAM
embedder.stop_multi_process_pool(pool)

print(f"Encoded {len(embeddings)} hotels. Shape: {embeddings.shape}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  embeddings = embedder.encode_multi_process(


Encoded 7372 hotels. Shape: (7372, 384)


## 7️⃣ Hybrid Semantic Retrieval

In [21]:
def semantic_search(query, embeddings, texts, top_k=5):
    q_emb = embedder.encode([query])[0]
    scores = np.dot(embeddings, q_emb)
    idx = scores.argsort()[-top_k:][::-1]
    return [texts[i] for i in idx]

top_chunks = semantic_search(user_query, embeddings, texts)
top_chunks

['Hotel_28154 costs 100 dollars. A budget hotel with basic amenities.',
 'Hotel_40018 costs 113 dollars. A budget hotel with basic amenities.',
 'Hotel_37189 costs 100 dollars. A budget hotel with basic amenities.',
 'Hotel_48094 costs 160 dollars. A mid-range hotel with basic amenities.',
 'Hotel_44281 costs 100 dollars. A budget hotel with basic amenities.']

## 8️⃣ GPT-2 for Casual Answer Generation

GPT-2 is used **only for natural language generation**, not reasoning or filtering.

In [24]:
import torch
import gc

# 1. Clear variables from memory if you're done with them
if 'llm' in globals():
    del llm

# 2. Force garbage collection (removes reference cycles)
gc.collect()

# 3. Empty the CUDA cache (releases unused memory to the OS)
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("GPU cache cleared.")

GPU cache cleared.


In [25]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Mistral-7B-Instruct-v0.3 is NOT gated (no token required)
model_id = "mistralai/Mistral-7B-Instruct-v0.3"

# 1. Load Tokenizer and Model
# Using bfloat16 to save memory while maintaining accuracy
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto"
)

# 2. Structure the prompt using Mistral's Chat Template
# This wraps the context and query in [INST] tags automatically
messages = [
    {
        "role": "user",
        "content": (
            "You are a friendly hotel assistant. Use the following hotel data to help the user. "
            "Answer casually and helpfully.\n\n"
            f"User question: {user_query}\n\n"
            "Relevant hotels:\n" + "\n".join(top_chunks)
        )
    }
]

# Apply the template (Mistral v3 specific)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# 3. Tokenize and Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=150,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)

# 4. Decode ONLY the new response (skipping the prompt)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)

print("--- AI RESPONSE ---")
print(response)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



--- AI RESPONSE ---
Hello there! I'm happy to help you find a hotel within your budget. Here are a few options that I have for you:

1. Hotel_28154 and Hotel_37189 both cost 100 dollars and offer basic amenities.
2. Hotel_40018 is another budget hotel that costs 113 dollars and also offers basic amenities.
3. Hotel_44281 is another 100-dollar option with basic amenities.

All of these hotels are available for night stays. If you're interested in a mid-range option, Hotel_48094 costs 160 dollars


In [28]:
import torch
import gc

# 1. Clear variables from memory if you're done with them
if 'model' in globals():
    del model

# 2. Force garbage collection (removes reference cycles)
gc.collect()

# 3. Empty the CUDA cache (releases unused memory to the OS)
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("GPU cache cleared.")

GPU cache cleared.


## We also test the Lama-CPP in the RAG

In [13]:
# Specifically for Colab's CUDA 12.2 environment
!pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu122/llama_cpp_python-0.3.16-cp312-cp312-linux_x86_64.whl (551.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m551.6/551.6 MB[0m [31m?[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.3.16


In [38]:
import os
import torch
import psutil
from llama_cpp import Llama
from sentence_transformers import SentenceTransformer

# --- STEP 1: PRE-CHECK VRAM ---
def check_vram():
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated(0) / (1024**3)
        reserved = torch.cuda.memory_reserved(0) / (1024**3)
        print(f"GPU Memory: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved")
    else:
        print("No GPU detected. Proceeding with CPU.")

check_vram()

# --- STEP 2: FAST EMBEDDINGS (50k Hotels) ---
# We use all-MiniLM-L6-v2 but with multi-process to save 10+ minutes
embedder = SentenceTransformer("all-MiniLM-L6-v2")
texts = [f"{h['name']} costs {h['price_per_night']} dollars. {h['description']}" for h in filtered_hotels]

print("Encoding 50,000 hotels (Multi-process)...")
pool = embedder.start_multi_process_pool()
embeddings = embedder.encode_multi_process(texts, pool, batch_size=128)
embedder.stop_multi_process_pool(pool)

# --- STEP 3: SAFE MODEL LOADING (7B Mistral) ---
# We use n_gpu_layers=20 to stay safe on the T4's 15GB limit
print("Loading Llama-3.2-3B-Instruct...")
llm = Llama.from_pretrained(
    repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF",
    filename="Llama-3.2-3B-Instruct-Q4_K_M.gguf",
    n_ctx=4096,           # Larger context window for more hotel data
    n_gpu_layers=-1,      # 3B model fits ENTIRELY on GPU easily
    n_batch=512,
    verbose=False,
    stream=True,
)


GPU Memory: 0.01GB allocated, 0.10GB reserved
Encoding 50,000 hotels (Multi-process)...


  embeddings = embedder.encode_multi_process(texts, pool, batch_size=128)


Loading Llama-3.2-3B-Instruct...


./Llama-3.2-3B-Instruct-Q4_K_M.gguf:   0%|          | 0.00/2.02G [00:00<?, ?B/s]

llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


In [48]:
import os
import torch
from llama_cpp import Llama

# 1. Clear GPU memory to avoid 'cudaMalloc' errors
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# 2. Load Llama 3.2 3B (Fast & Accurate)
print("Loading Llama-3.2-3B-Instruct...")
llm = Llama.from_pretrained(
    repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF",
    filename="Llama-3.2-3B-Instruct-Q4_K_M.gguf",
    n_ctx=2048,           # Ample space for hotel context
    n_gpu_layers=-1,      # Offload ALL 28 layers to GPU
    n_batch=512,          # Processes the hotel list faster
    verbose=False
)



Loading Llama-3.2-3B-Instruct...


llama_context: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized


In [54]:
# 3. Llama 3.2 Prompt Format (Using Header IDs)
prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful hotel assistant. Use the provided hotel data to suggest options within the user's budget.
Always state the hotel name and the exact price from the context.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
{user_query}

Relevant Hotels:
{context_text}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""

# Updated Generation with Precision Settings
print("--- AI RESPONSE (PRECISION MODE) ---")
stream = llm(
    prompt,
    max_tokens=300,
    stop=["<|eot_id|>"],
    stream=True,

    # PRECISION SETTINGS:
    temperature=0.1,      # Low randomness for factual accuracy
    top_p=0.9,            # Only consider the top 90% most likely tokens
    repeat_penalty=1.1,   # Prevents the model from repeating "budget hotel"
    top_k=40              # Standard for Llama 3 family
)

for output in stream:
    token = output["choices"][0]["text"]
    print(token, end="", flush=True)

--- AI RESPONSE (PRECISION MODE) ---


Based on your budget of $100 to $200, I would recommend the following options:

1. **Hotel_28154**: This is a budget hotel with basic amenities and costs exactly $100.
2. **Hotel_37189**: Another budget hotel with basic amenities, also costing $100.
3. **Hotel_44281**: A budget hotel with basic amenities, priced at $100.

All three options are within your budget and offer the same level of comfort and amenities. If you'd like to consider a slightly upgraded option, I can suggest:

4. **Hotel_40018**: This mid-range hotel costs $113, which is still within your budget. It offers basic amenities as well, but with a slight upgrade in quality.

Let me know if you have any other preferences or specific requirements!

## ✅ Final Takeaways

- Large JSON should **never** be sent directly to LLMs
- Metadata filtering comes **before embeddings**
- Semantic RAG is applied only to narrowed data
- GPT-2 is sufficient for casual response generation

**This is a real hybrid RAG architecture used in production systems.**