# LLM Deployment for Shoplite (Colab-ready)

This notebook is self-contained. It embeds the Shoplite knowledge base, builds a FAISS index with `sentence-transformers`, loads an open-source LLM (Llama 3.1 8B by default), exposes a Flask API (`/chat`, `/ping`, `/health`) and uses `pyngrok` to create a public tunnel.

**IMPORTANT**: Instructors must supply their ngrok authtoken via an `input()` prompt at runtime. Do NOT hardcode tokens. If the model cannot be loaded in Colab due to size or access restrictions, follow the troubleshooting notes in the last cell to switch to a smaller model or use quantized loading.

In [1]:
# Cell 1: Install dependencies
# Run this cell in Colab (ensure GPU runtime selected: Runtime -> Change runtime type -> GPU)
!pip install --quiet --upgrade pip
!pip install --quiet transformers accelerate bitsandbytes safetensors sentence-transformers faiss-cpu flask pyngrok uvicorn gunicorn -U
# bitsandbytes is used for 8-bit loading/quantization; accelerate helps with device mapping


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/1.8 MB[0m [31m27.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# Cell 2: Imports and utilities
import os, time, json, threading
from typing import List, Dict, Any
from flask import Flask, request, jsonify
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from pyngrok import ngrok

# Helper to safely serialize numpy arrays to JSON
class NpEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.ndarray):
            return obj.tolist()
        return super().default(obj)


In [3]:
# Cell 3: Knowledge Base (embedded)
KNOWLEDGE_BASE = [
    {"id": "doc1", "title": "Shoplite User Registration and Account Management", "content": "To create a Shoplite account, users must visit the registration page and provide a valid email address, password, and basic profile information. Email verification is required within 24 hours. Users can choose between: - Buyer accounts (free) - Seller accounts (requires business verification and tax information). Account Management Features: Update personal information, Change passwords, Set security questions, Manage notification preferences, Deactivate accounts (requires email confirmation; may affect active orders/subscriptions). Buyer Access: product browsing, purchasing, order tracking, reviews. Seller Access: seller dashboard, inventory management, order processing, analytics. Security Measures: two-factor authentication recommended, password recovery via email and phone verification."},
    {"id": "doc2", "title": "Shoplite Product Search and Filtering Features", "content": "Shoplite provides a powerful search engine: Search Capabilities: keyword queries, category selection, brand filters. Filtering Options: price range, rating, availability, seller location, shipping speed, promotions, eco-friendly options. Features: Autocomplete suggestions, Spelling correction, Save searches & alerts, Faceted navigation, Optimized for large catalogs with real-time indexing, Mobile responsive interface."},
    {"id": "doc3", "title": "Shoplite Shopping Cart and Checkout Process", "content": "Add multiple items from different sellers; Review quantities, apply promo codes/gift cards; Cart preserved across sessions for logged-in users. Checkout Steps: 1. Shipping selection (standard, expedited, same-day) 2. Payment selection (credit/debit cards, digital wallets, cash-on-delivery) 3. Order confirmation. Security & Processing: PCI-DSS compliant payment gateways, Real-time stock updates, Order confirmation emails with tracking, Seller notifications for new orders, Integrated returns and refunds system."},
    {"id": "doc4", "title": "Shoplite Payment Methods and Security", "content": "Accepted Payment Methods: credit/debit cards, PayPal, Apple Pay, Google Pay, local solutions. Security Measures: SSL encryption, PCI-DSS compliance, fraud detection, two-factor authentication, sensitive info encrypted in transit and at rest. Other Features: digital wallet integration, structured dispute/chargeback process, seller payments after order confirmation."},
    {"id": "doc5", "title": "Shoplite Order Tracking and Delivery", "content": "Real-time tracking with confirmation emails and unique tracking number. Stages: confirmed -> processing -> shipped -> in transit -> delivered. Delivery modification requests (seller approval required). International shipments display customs/import duties. Optimized logistics with estimated arrival and delay notifications. Support assistance for lost/delayed packages."},
    {"id": "doc6", "title": "Shoplite Return and Refund Policies", "content": "Return Period: typically 30 days from delivery. Process: select order/item, specify reason, use prepaid label if eligible. Refunds: processed in 5–7 business days to original payment method. Digital/personalized items may have exceptions. Automated order status updates. Sellers must comply with policies to maintain ratings. Dispute resolution available."},
    {"id": "doc7", "title": "Shoplite Product Reviews and Ratings", "content": "Buyers rate products on a five-star scale and leave comments. Reviews moderated for compliance. Sellers can respond to reviews. Ratings influence search ranking. Verified purchase badges for authenticity. Aggregate ratings provided. Review analytics available for sellers."},
    {"id": "doc8", "title": "Shoplite Seller Account Setup and Management", "content": "Create seller account with business documents and tax verification. Seller Dashboard: inventory management, order processing, sales analytics. Product listing via individual or bulk upload (CSV/API). Profile customization: branding, policies, shipping, returns. Notifications: new orders, low stock, inquiries. Pricing, promotions, and shipping fee management. Performance metrics tracked; third-party integrations supported."},
    {"id": "doc9", "title": "Shoplite Inventory Management for Sellers", "content": "Track stock levels, reorder thresholds, and availability in real-time. Low-stock alerts. Bulk imports supported. Variants (size, color, bundle) supported. Inventory reports for trends and seasonal demand. Manage warehouses and shipping locations."},
    {"id": "doc10", "title": "Shoplite Commission and Fee Structure", "content": "Commission fees per product category. Additional fees: premium listings, promotions, special services. Transparent notifications in dashboard. Payments made after commission deduction (weekly/bi-weekly). Transaction reports available. Pricing guidance provided."},
    {"id": "doc11", "title": "Shoplite Customer Support Procedures", "content": "Support via live chat, email, phone, and AI chatbot (24/7). Ticket categorization: orders, payments, returns, technical, account management. Unique tracking IDs. Backend integration for order/payment info. Dedicated seller support channel. Help center with guides, FAQs, videos. Fast, transparent, fair resolution."},
    {"id": "doc12", "title": "Shoplite Mobile App Features", "content": "iOS & Android support. Browse, filter, add to cart, purchase. Push notifications for promotions/order updates. Barcode scanning and QR code payments. Mobile wallets, fingerprint, Face ID login. Seller management on-the-go. Offline caching for previously loaded content. Intuitive, responsive, accessible interface."},
    {"id": "doc13", "title": "Shoplite API Documentation for Developers", "content": "RESTful API endpoints: product catalog, orders, accounts, inventory. OAuth 2.0 authentication. Rate limiting (higher for verified partners). Detailed docs: request/response, parameters, error codes. Webhooks for real-time events. Sandbox environment for testing. Versioned API with backward-compatible updates."},
    {"id": "doc14", "title": "Shoplite Security and Privacy Policies", "content": "Data Protection: TLS encryption, AES-256 at rest, authorized access. Two-factor authentication & strong passwords. GDPR & CCPA compliance. Security monitoring for suspicious activity. Clear privacy policies: data collection, usage, third-party sharing. Policy change notifications; user control over data."},
    {"id": "doc15", "title": "Shoplite Promotional Codes and Discounts", "content": "Sellers create promotions: discount codes, seasonal sales, bundle offers. Code types: percentage, fixed, conditional. Start/end dates, usage limits, minimum purchase configurable. Automatic verification at checkout. Analytics: redemption, revenue, engagement. User notifications for active promotions. Special events highlighted on homepage/app. Compliance with platform policies."}
]


In [4]:
# Cell 4: Prompts embedded as Python dict (converted from assistant-prompts.yml)
PROMPTS = {
    "version": "1.0",
    "created": "2025-09-23",
    "author": "Joseph Chamoun",
    "base_retrieval_prompt": {
        "role": "You are a helpful Shoplite customer service assistant.",
        "goal": "Provide accurate answers using only the provided Shoplite documentation.",
        "context_guidelines": ["Use only information from the provided document snippets", "Cite specific documents when possible"],
        "response_format": "Answer: [Your response based on context]\nSources: [List document titles referenced]\n",
    },
    "multi_doc_synthesis": {
        "role": "You are an expert Shoplite support agent who synthesizes multiple documents.",
        "goal": "Combine information from multiple retrieved documents to create a concise, accurate answer.",
        "context_guidelines": ["State which documents you used", "When information conflicts, show both options and recommend the safer/default one"],
        "response_format": "Answer: [Synthesis]\nSources: [Doc titles]\nConfidence: [High|Medium|Low]\n",
    },
    "refusal_when_no_context": {
        "role": "You are a safety-conscious assistant.",
        "goal": "Refuse to answer if no relevant context is found in the knowledge base and ask for clarification or external data.",
        "context_guidelines": ["If top retrieved documents have low similarity (< threshold), return a refusal", "Suggest the user provide more details or check the Shoplite help center"],
        "response_format": "Answer: I don't have enough information in the Shoplite docs to answer that. Please provide more details or check [Help Center].\n",
    }
}


In [5]:
# Cell 5: Build embeddings and FAISS index
EMBED_MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
print('Loading embedding model:', EMBED_MODEL_NAME)
embed_model = SentenceTransformer(EMBED_MODEL_NAME)

DOCUMENT_TEXTS = [d['title'] + '\n\n' + d['content'] for d in KNOWLEDGE_BASE]
DOC_IDS = [d['id'] for d in KNOWLEDGE_BASE]

print('Encoding documents...')
doc_embeddings = embed_model.encode(DOCUMENT_TEXTS, convert_to_numpy=True, show_progress_bar=True)

def normalize_embeddings(embs):
    norms = np.linalg.norm(embs, axis=1, keepdims=True)
    return embs / np.clip(norms, a_min=1e-10, a_max=None)

doc_embeddings = normalize_embeddings(doc_embeddings)
d = doc_embeddings.shape[1]
index = faiss.IndexFlatIP(d)
index.add(doc_embeddings)
print(f'FAISS index created with {index.ntotal} vectors (dim={d})')


Loading embedding model: sentence-transformers/all-MiniLM-L6-v2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Encoding documents...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

FAISS index created with 15 vectors (dim=384)


In [6]:
# Cell 6: Retrieval functions
def retrieve_docs(query: str, top_k: int = 3):
    q_emb = embed_model.encode([query], convert_to_numpy=True)
    q_emb = normalize_embeddings(q_emb)
    scores, indices = index.search(q_emb, top_k)
    results = []
    for score, idx in zip(scores[0], indices[0]):
        if idx < 0 or idx >= len(KNOWLEDGE_BASE):
            continue
        doc = KNOWLEDGE_BASE[idx]
        results.append({
            'id': doc['id'],
            'title': doc['title'],
            'content': doc['content'],
            'score': float(score)
        })
    return results

# Quick test (optional)
print(retrieve_docs('How do I create a seller account on Shoplite?', top_k=3))


[{'id': 'doc1', 'title': 'Shoplite User Registration and Account Management', 'content': 'To create a Shoplite account, users must visit the registration page and provide a valid email address, password, and basic profile information. Email verification is required within 24 hours. Users can choose between: - Buyer accounts (free) - Seller accounts (requires business verification and tax information). Account Management Features: Update personal information, Change passwords, Set security questions, Manage notification preferences, Deactivate accounts (requires email confirmation; may affect active orders/subscriptions). Buyer Access: product browsing, purchasing, order tracking, reviews. Seller Access: seller dashboard, inventory management, order processing, analytics. Security Measures: two-factor authentication recommended, password recovery via email and phone verification.', 'score': 0.7689905166625977}, {'id': 'doc8', 'title': 'Shoplite Seller Account Setup and Management', 'con

In [7]:
# Cell 6.5 (REPLACE): Hugging Face Login (runtime input; no hardcoded token)
# NOTE: Do NOT hardcode tokens in the notebook. Enter at runtime when prompted.
from huggingface_hub import login

hf_token = input("🔑 Enter your Hugging Face token (paste it here; will not be saved in the notebook): ").strip()
if not hf_token:
    print("No token provided. Attempts to load private models may fail. Proceeding without login.")
else:
    try:
        login(hf_token)
        print("Hugging Face login successful.")
    except Exception as e:
        print("Hugging Face login failed:", e)
        print("If you can't load protected models, try switching to a publicly available model.")


🔑 Enter your Hugging Face token (paste it here; will not be saved in the notebook): hf_example
Hugging Face login successful.


In [8]:
# Cell 7 (REPLACE): Model loading with modern quantization config + safe fallbacks
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"  # preferred (may require HF access)
FALLBACK_MODEL = "tiiuae/falcon-7b-instruct"     # a smaller public instruct model used as fallback

use_cuda = torch.cuda.is_available()
print("Attempting to load model:", MODEL_NAME)
print("CUDA available:", use_cuda)

model = None
tokenizer = None

def try_load_model(model_name, quant_config=None, dtype=torch.float16):
    """Attempt to load the model with optional quantization config."""
    try:
        tok = AutoTokenizer.from_pretrained(model_name, use_fast=True)
        if quant_config is not None:
            mdl = AutoModelForCausalLM.from_pretrained(
                model_name,
                device_map="auto" if use_cuda else None,
                quantization_config=quant_config,
                torch_dtype=dtype,
                low_cpu_mem_usage=True,
            )
        else:
            mdl = AutoModelForCausalLM.from_pretrained(
                model_name,
                device_map="auto" if use_cuda else None,
                torch_dtype=dtype,
                low_cpu_mem_usage=True,
            )
        mdl.eval()
        return tok, mdl
    except Exception as e:
        print(f"Load failed for {model_name}: {repr(e)}")
        return None, None

# Prepare a BitsAndBytesConfig for 8-bit (modern API)
try:
    bnb_config = BitsAndBytesConfig(
        load_in_8bit=True,          # prefer 8-bit quantization when available
        llm_int8_threshold=6.0      # heuristic threshold (tweak if needed)
    )
except Exception:
    # In case BitsAndBytesConfig import signature differs on some versions
    bnb_config = None

# 1) Try to load preferred model with quantization
if bnb_config is not None:
    print("Trying quantized load with BitsAndBytesConfig...")
    tokenizer, model = try_load_model(MODEL_NAME, quant_config=bnb_config, dtype=torch.float16)
else:
    print("BitsAndBytesConfig not available; attempting standard/float16 load...")
    tokenizer, model = try_load_model(MODEL_NAME, quant_config=None, dtype=torch.float16)

# 2) Fallback: try without quant or try the fallback model
if model is None:
    print("Primary model load failed. Trying fallback public model:", FALLBACK_MODEL)
    tokenizer, model = try_load_model(FALLBACK_MODEL, quant_config=bnb_config, dtype=torch.float16)
    if model:
        print("Fallback model loaded:", FALLBACK_MODEL)
    else:
        print("Fallback model also failed. The notebook will still run retrieval, but LLM responses will be disabled.")
        model = None
        tokenizer = None

# Final status
if model is not None and tokenizer is not None:
    print("Model and tokenizer loaded successfully.")
else:
    print("No model loaded. Generation endpoints will return retrieval-only responses.")


Attempting to load model: meta-llama/Llama-3.1-8B-Instruct
CUDA available: True
Trying quantized load with BitsAndBytesConfig...


tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

Model and tokenizer loaded successfully.


In [62]:
# Cell 8: Final optimized version with proper text extraction
import torch

TEMPERATURE = 0.7
MAX_TOKENS = 120
SIMILARITY_THRESHOLD = 0.25

def build_prompt_from_retrieval(query: str, retrieved_docs: List[Dict[str, Any]]):
    """Builds prompt using top 5 documents."""
    docs_text = ""
    for i, doc in enumerate(retrieved_docs[:5], 1):
        content = doc['content'][:200] + "..." if len(doc['content']) > 200 else doc['content']
        docs_text += f"\n[Document {i}: {doc['title']}]\n{content}\n"

    prompt = (
        f"You are a helpful Shoplite customer service assistant.\n\n"
        f"Use these documents to answer:{docs_text}\n\n"
        f"Question: {query}\n"
        f"Answer in 2-3 sentences:"
    )
    return prompt


def generate_response(query: str, top_k: int = 5, temperature: float = TEMPERATURE, max_tokens: int = MAX_TOKENS, debug: bool = False):
    """Generate response with proper text extraction."""

    query_lower = query.lower().strip()

    # Greetings
    if any(greeting in query_lower for greeting in ["hi", "hello", "hey", "good morning"]):
        if "how are you" in query_lower:
            return {"answer": "I'm here and ready to help! How can I assist you with Shoplite today?", "sources": [], "confidence": "High"}
        return {"answer": "Hello! I'm here to help you with Shoplite. What would you like to know?", "sources": [], "confidence": "High"}

    # Help requests
    if query_lower in ["help", "can you help", "can u help me", "help me"]:
        return {"answer": "Of course! I can help you with Shoplite registration, orders, payments, returns, seller accounts, and more. What specific information do you need?", "sources": [], "confidence": "High"}

    # Retrieve documents
    retrieved = retrieve_docs(query, top_k=top_k)

    if not retrieved:
        return {
            "answer": "I don't have information about that. I can only answer questions about Shoplite's platform, features, and services.",
            "sources": [],
            "confidence": "Low"
        }

    top_score = max(d['score'] for d in retrieved)

    if debug:
        print(f"Top similarity score: {top_score:.3f}")

    # Reject only truly irrelevant queries
    if top_score < SIMILARITY_THRESHOLD:
        return {
            "answer": "I'm sorry, that question appears to be outside my knowledge base. I can help with questions about Shoplite's registration, orders, payments, returns, seller accounts, product search, and customer support.",
            "sources": [],
            "confidence": "Low"
        }

    # Adjusted confidence thresholds
    if top_score >= 0.6:
        confidence = "High"
    elif top_score >= 0.4:
        confidence = "Medium"
    else:
        confidence = "Low"

    prompt = build_prompt_from_retrieval(query, retrieved)

    if model is None:
        answer_parts = []
        for doc in retrieved[:3]:
            sentences = [s.strip() + '.' for s in doc['content'].split('.') if s.strip()][:1]
            answer_parts.extend(sentences)
        answer = ' '.join(answer_parts[:3])
        return {"answer": answer, "sources": [d['title'] for d in retrieved[:3]], "confidence": confidence}

    inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=1536)
    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            do_sample=True,
            temperature=temperature,
            max_new_tokens=max_tokens,
            top_p=0.9,
            repetition_penalty=1.3,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )

    generated_ids = outputs[0][inputs['input_ids'].shape[1]:]
    raw_text = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    if debug:
        print(f"Raw generated text: {raw_text[:200]}")

    # Extract meaningful content - take first 3 sentences
    sentences = []
    current = ""

    for char in raw_text:
        current += char
        # End of sentence markers
        if char in '.!?' and len(current.strip()) > 15:
            sentences.append(current.strip())
            current = ""
            if len(sentences) >= 3:
                break

    # Join first 2-3 sentences
    text = ' '.join(sentences[:3])

    # Remove trailing incomplete sentence
    if text and not any(text.endswith(p) for p in ['.', '!', '?']):
        last_period = text.rfind('.')
        if last_period > 0:
            text = text[:last_period + 1]

    # If still empty or too short, use extractive approach
    if len(text) < 20:
        top_doc = retrieved[0]
        content_sentences = [s.strip() + '.' for s in top_doc['content'].split('.') if len(s.strip()) > 20][:2]
        text = ' '.join(content_sentences)

    return {
        "answer": text,
        "sources": [d['title'] for d in retrieved[:3]],
        "confidence": confidence
    }

In [47]:
# Cell 9: Flask API with timeout protection
from flask import Flask, request, jsonify
import threading

app = Flask(__name__)

@app.route('/health', methods=['GET'])
def health():
    return jsonify({'status': 'ok', 'model_loaded': model is not None, 'num_docs': len(KNOWLEDGE_BASE)})

@app.route('/ping', methods=['POST'])
def ping():
    data = request.json or {}
    text = data.get('text', 'Hello')
    return jsonify({'reply': f'Pong: {text}'})

@app.route('/chat', methods=['POST'])
def chat():
    data = request.json or {}
    query = data.get('query')
    top_k = int(data.get('top_k', 3))
    debug = data.get('debug', False)

    if not query:
        return jsonify({'error': 'missing query'}), 400

    try:
        result = generate_response(query, top_k=top_k, debug=debug)
        return jsonify(result)
    except Exception as e:
        return jsonify({'error': str(e)}), 500

# Run Flask in background thread
def run_flask():
    app.run(host='0.0.0.0', port=5000, threaded=True)

thread = threading.Thread(target=run_flask, daemon=True)
thread.start()
print('Flask server started (background thread).')

Flask server started (background thread).
 * Serving Flask app '__main__'
 * Debug mode: off


Address already in use
Port 5000 is in use by another program. Either identify and stop that program, or start the server with a different port.


In [11]:
# Cell 10: ngrok tunnel setup (prompt for authtoken)
print('\n=== NGROK TUNNEL SETUP ===')
ngrok_token = input('Enter your ngrok authtoken (paste it here): ').strip()
if ngrok_token:
    try:
        ngrok.set_auth_token(ngrok_token)
        public_url = ngrok.connect(5000)
        print('ngrok tunnel created at:', public_url)
    except Exception as e:
        print('Failed to create ngrok tunnel:', e)
        public_url = None
else:
    print('No ngrok token provided; remember to set up a tunnel separately.')
    public_url = None

print('If public_url is not None, use it to call the /chat endpoint from outside Colab.')



=== NGROK TUNNEL SETUP ===
 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://172.28.0.12:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m


Enter your ngrok authtoken (paste it here): 2tkzDe4CgyDZw0djJQnD9CjmBhL_3QgC9RJvVxjy59hwhWC3W
ngrok tunnel created at: NgrokTunnel: "https://f3fceec26a40.ngrok-free.app" -> "http://localhost:5000"
If public_url is not None, use it to call the /chat endpoint from outside Colab.


In [12]:
# Cell 11: Quick test examples (run after ngrok created and model loaded)
print('Local health check:')
try:
    import requests
    r = requests.get('http://127.0.0.1:5000/health', timeout=5)
    print('Health:', r.json())
except Exception as e:
    print('Local health check failed:', e)

if 'public_url' in globals() and public_url:
    print(f'Call this externally: POST {public_url}/chat with JSON {{"query": "How do I create a seller account on Shoplite?"}}')


INFO:werkzeug:127.0.0.1 - - [29/Sep/2025 17:27:53] "GET /health HTTP/1.1" 200 -


Local health check:
Health: {'model_loaded': True, 'num_docs': 15, 'status': 'ok'}
Call this externally: POST NgrokTunnel: "https://f3fceec26a40.ngrok-free.app" -> "http://localhost:5000"/chat with JSON {"query": "How do I create a seller account on Shoplite?"}


In [41]:
import requests

r = requests.post(
    "http://127.0.0.1:5000/chat",
    json={"query": "can u help me, how to register?"}
)

print(r.json())


INFO:werkzeug:127.0.0.1 - - [29/Sep/2025 18:39:20] "POST /chat HTTP/1.1" 200 -


{'answer': '``', 'confidence': 'Low', 'sources': ['Shoplite User Registration and Account Management', 'Shoplite Seller Account Setup and Management', 'Shoplite Mobile App Features']}


In [67]:
# Cell 13: Test with faster generation
import requests
import time

time.sleep(2)

print("Testing with optimized generation...")

# Test a few queries
test_queries = [
    "what is the company name",
    "hi how are you can u help me?",
    "your website name?"
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Query: {query}")
    print('='*60)

    try:
        r = requests.post(
            "http://127.0.0.1:5000/chat",
            json={"query": query},
            timeout=60  # Reduced timeout
        )
        result = r.json()
        print(f"Answer: {result.get('answer')}")
        print(f"Sources: {', '.join(result.get('sources', []))}")
        print(f"Confidence: {result.get('confidence')}")
    except requests.exceptions.Timeout:
        print("ERROR: Request timed out")
    except Exception as e:
        print(f"ERROR: {e}")

INFO:werkzeug:127.0.0.1 - - [29/Sep/2025 19:12:04] "POST /chat HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [29/Sep/2025 19:12:04] "POST /chat HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [29/Sep/2025 19:12:04] "POST /chat HTTP/1.1" 200 -


Testing with optimized generation...

Query: what is the company name
Answer: I'm sorry, that question appears to be outside my knowledge base. I can help with questions about Shoplite's registration, orders, payments, returns, seller accounts, product search, and customer support.
Sources: 
Confidence: Low

Query: hi how are you can u help me?
Answer: I'm here and ready to help! How can I assist you with Shoplite today?
Sources: 
Confidence: High

Query: your website name?
Answer: I'm sorry, that question appears to be outside my knowledge base. I can help with questions about Shoplite's registration, orders, payments, returns, seller accounts, product search, and customer support.
Sources: 
Confidence: Low


# Notes & Troubleshooting

- **Model load failures**: Llama 3.1 8B is large and may not load on free Colab GPUs or without HF access. If loading fails, switch `MODEL_NAME` to a smaller model or use `load_in_8bit=True` with bitsandbytes.
- **Quantization**: Use `bitsandbytes` + `accelerate` for 8-bit loading. Some setups require configuring `~/.cache/huggingface` with tokens/credentials.
- **Security**: Never commit your ngrok authtoken. Enter it at runtime when prompted.
- **Persistence**: Colab runtimes are ephemeral; save `doc_embeddings.npy` and `kb.json` if you want to reuse them within the same session.

If you want, I can now generate the `/src/chat-interface.py` CLI script that connects to the ngrok endpoint and demonstrates example queries.