<a href="https://colab.research.google.com/github/mamathasara/mamathasara-business-faq-rag-bot/blob/main/business_faq_rag_bot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 STEP 1: Install Required Libraries

In [1]:
!pip install -U pinecone-client sentence-transformers


Collecting pinecone-client
  Downloading pinecone_client-6.0.0-py3-none-any.whl.metadata (3.4 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-5.0.0-py3-none-any.whl.metadata (16 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone-client)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transfo

❓ Why I Shifted from OpenAI Embeddings to Hugging Face Sentence Transformers
Initially, I planned to use OpenAI’s text-embedding-3-small model for generating embeddings in my Retrieval-Augmented Generation (RAG) system. However, during implementation, I encountered the following issue:

❌ RateLimitError: “You exceeded your current quota...”

This error occurred because I had exhausted my free quota on the OpenAI API and had not set up a paid billing method. Since the assignment required a working RAG model — and I wanted to continue without incurring costs — I decided to shift to a free and reliable alternative.

In [3]:
!pip uninstall -y pinecone-client
!pip install pinecone


Found existing installation: pinecone-client 6.0.0
Uninstalling pinecone-client-6.0.0:
  Successfully uninstalled pinecone-client-6.0.0
Collecting pinecone
  Downloading pinecone-7.3.0-py3-none-any.whl.metadata (9.5 kB)
Collecting pinecone-plugin-assistant<2.0.0,>=1.6.0 (from pinecone)
  Downloading pinecone_plugin_assistant-1.7.0-py3-none-any.whl.metadata (28 kB)
Downloading pinecone-7.3.0-py3-none-any.whl (587 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.6/587.6 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_assistant-1.7.0-py3-none-any.whl (239 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.0/240.0 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pinecone-plugin-assistant, pinecone
Successfully installed pinecone-7.3.0 pinecone-plugin-assistant-1.7.0


✅ STEP 2: Import Libraries and Set API Keys

✅ STEP 3: Create or Connect to Pinecone Index


In [5]:
from pinecone import Pinecone, ServerlessSpec

# Create Pinecone client object
pc = Pinecone(api_key="pcsk_22VKRh_NVKxv12BGVpzsbqBKCnma5Jok5M48h4P8QLGquumK1RzJefKbU8uGYtFwXinhvb")

# Create index if it doesn’t exist
if "rag-qa-bot" not in pc.list_indexes().names():
    pc.create_index(
        name="rag-qa-bot",
        dimension=384,  # For sentence-transformers model
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

# Connect to the index
index = pc.Index("rag-qa-bot")


✅ STEP 4: Prepare Business FAQs

In [6]:
docs = [
    "Our business hours are 9 AM to 5 PM, Monday through Friday.",
    "You can contact our customer support by emailing support@ourcompany.com.",
    "We offer a 30-day return policy for all products purchased through our website.",
    "Shipping typically takes 5–7 business days for standard delivery.",
    "Our company specializes in custom software solutions for e-commerce businesses.",
    "You can reset your account password using the 'Forgot Password' link on the login page.",
    "We provide onboarding sessions for new clients every Monday at 11 AM via Zoom.",
    "Invoices are generated automatically at the end of each billing cycle and can be downloaded from your dashboard.",
    "Employees are eligible for health insurance benefits after 3 months of full-time employment.",
    "To cancel your subscription, go to the billing settings in your account dashboard and click 'Cancel Plan'.",
    "We provide 24/7 live chat support for premium customers.",
    "Our data privacy policy complies with GDPR and CCPA regulations.",
    "Bulk orders above 100 units are eligible for custom discounts. Contact sales@ourcompany.com for more info.",
    "You can track your order status using the tracking link sent to your registered email.",
    "Our mobile app is available on both iOS and Android platforms.",
    "We host monthly webinars on topics like business automation and AI in CRM systems.",
    "All employees are required to complete annual security awareness training.",
    "Technical documentation for our APIs is available on our developer portal.",
    "Refunds are processed within 5–10 business days after receiving returned items.",
    "We partner with Stripe for secure online payments and subscription management."
]


✅ STEP 5: Generate Embeddings & Upload to Pinecone (Free + Fast)

In [9]:
if "rag-qa-bot" in pc.list_indexes().names():
    pc.delete_index("rag-qa-bot")


I am deleting the existing Pinecone index because it was created using OpenAI embeddings with a dimension of 1536. Since I am now switching to a Hugging Face model, which uses a different embedding dimension, I need to create a new index with the appropriate configuration.

In [10]:
from pinecone import ServerlessSpec

pc.create_index(
    name="rag-qa-bot",
    dimension=384,  # ✅ Matches Hugging Face embedding size
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)


{
    "name": "rag-qa-bot",
    "metric": "cosine",
    "host": "rag-qa-bot-28mdvhu.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "vector_type": "dense",
    "dimension": 384,
    "deletion_protection": "disabled",
    "tags": null
}

In [11]:
index = pc.Index("rag-qa-bot")


In [12]:
# Load Hugging Face model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate 384-dim embeddings
vectors = model.encode(docs).tolist()

# Upload to Pinecone using new SDK format
for i, (text, vector) in enumerate(zip(docs, vectors)):
    index.upsert(
        vectors=[
            {
                "id": str(i),
                "values": vector,
                "metadata": {"text": text}
            }
        ]
    )


✅ STEP 6: Answer Generation using Flan-T5 (Free Hugging Face Model)


In [14]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch


In [15]:
# Load FLAN-T5 (small model for speed on Colab)
model_name = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
hf_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [16]:
def generate_answer_flant5(question, context):
    # Create the prompt with context
    prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"

    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)

    # Generate answer
    outputs = hf_model.generate(**inputs, max_new_tokens=100)

    # Decode the output
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return answer


In [17]:
def retrieve_relevant_docs(question, top_k=3):
    question_vector = model.encode(question).tolist()
    results = index.query(vector=question_vector, top_k=top_k, include_metadata=True)
    return [match['metadata']['text'] for match in results['matches']]


In [18]:
def rag_pipeline_free(question):
    # Retrieve similar docs
    retrieved_docs = retrieve_relevant_docs(question)

    # Combine as single context block
    context = "\n".join(retrieved_docs)

    # Generate answer using Flan-T5
    answer = generate_answer_flant5(question, context)

    return answer


In [25]:
question = "How can I reset my password?"
answer = rag_pipeline_free(question)

print("Q:", question)
print("A:", answer)


Q: How can I reset my password?
A: Forgot Password


In [26]:
!pip install gradio

import gradio as gr

def rag_ui(question):
    return rag_pipeline_free(question)

# Build the Gradio interface
interface = gr.Interface(
    fn=rag_ui,
    inputs=gr.Textbox(lines=2, placeholder="Ask a business-related question..."),
    outputs="text",
    title="📚 Business FAQ RAG Bot (Free)",
    description="Ask your business-related questions — powered by Hugging Face + Pinecone."
)

# Launch and get link
interface.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://f7b0f0671192d40755.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


