<a href="https://colab.research.google.com/github/mr-cri-spy/SLM/blob/main/multilingual_FAISS_bot_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers faiss-cpu indic-nlp-library




Prepare FAQ Dataset

In [7]:
import json

faq_pairs = [
    {"question": "What is Crisbee?", "answer": "Crisbee is an AI startup focusing on chatbot automation."},
    {"question": "What services do you offer?", "answer": "We offer AI chatbots, voice bots, and NLP solutions."},
    {"question": "Where are you located?", "answer": "We are based in Mysore, India."},
    {"question": "Do you support regional languages?", "answer": "Yes, we support Hindi, Kannada, and Tamil."}
]

with open("multilang_faq.json", "w") as f:
    json.dump(faq_pairs, f, indent=4)


 Load IndicBERT + Tokenizer

In [8]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

model_name = "ai4bharat/indic-bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def get_embedding(text):
    tokens = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    with torch.no_grad():
        output = model(**tokens)
        return output.last_hidden_state.mean(dim=1).squeeze().numpy().astype("float32")


Prepare FAISS Index from English Questions

In [9]:
import faiss

# Load data
with open("multilang_faq.json", "r") as f:
    faq_pairs = json.load(f)

questions = [item["question"] for item in faq_pairs]
answers = [item["answer"] for item in faq_pairs]

# Encode and index
question_embeddings = np.array([get_embedding(q) for q in questions]).astype("float32")

index = faiss.IndexFlatL2(question_embeddings.shape[1])
index.add(question_embeddings)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Building ask_bot() for Multilingual Query


In [10]:
def ask_bot(query, k=1):
    query_vector = get_embedding(query).reshape(1, -1)
    distances, indices = index.search(query_vector, k)

    results = []
    for idx in indices[0]:
        results.append(answers[idx])
    return results[0] if k == 1 else results


Test the Multilingual Bot

In [11]:
# English
print("EN:", ask_bot("Where is your company located?"))

# Hindi
print("HI:", ask_bot("क्या आप हिंदी में सहायता प्रदान करते हैं?"))

# Kannada
print("KA:", ask_bot("ನೀವು ಕನ್ನಡ ಬೆಂಬಲಿಸುತ್ತೀರಾ?"))


EN: We are based in Mysore, India.
HI: Crisbee is an AI startup focusing on chatbot automation.
KA: Crisbee is an AI startup focusing on chatbot automation.


In [19]:
print("EN:", ask_bot("where are you located"))

EN: We are based in Mysore, India.
