# 🧠 Q&A Bot - Computer Programming
This notebook loads a Q&A dataset, builds a simple retrieval model using TF-IDF and cosine similarity, and allows you to ask questions.

### Step 1: Load the Restaurant Q&A Dataset

In this step, we import the pandas library and load the restaurant dataset (`restaurant_qa_10k.csv`) into a DataFrame called `df`.
We then display the first 5 rows to understand the structure of the data.


In [1]:
import pandas as pd

# Load the restaurant dataset
df = pd.read_csv('combined_knowledge_base.csv')
df.head()

Unnamed: 0,Question,Answer
0,What should I do if I have a stomach ache?,"Drink plenty of water, rest well, and avoid he..."
1,How can I reduce a fever?,"Rest well, drink plenty of water, and you can ..."
2,What prevents a baby from sleeping at night?,"A baby may not sleep due to hunger, stomach ac..."
3,Should I vaccinate my baby?,"Yes, it is important to vaccinate your baby ac..."
4,What prevents a woman from getting pregnant?,"Causes include hormonal problems, uterine prob..."


### Step 2: Split the Dataset into Training and Testing Sets

We use `train_test_split` from `sklearn.model_selection` to split the dataset:
- 90% of the data is used for training (`train_df`)
- 10% of the data is used for testing (`test_df`)
Setting `random_state=42` ensures that the split is reproducible.
We then check the shapes of the resulting training and testing sets.


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216 entries, 0 to 215
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Question  216 non-null    object
 1   Answer    216 non-null    object
dtypes: object(2)
memory usage: 3.5+ KB


In [3]:
from sklearn.model_selection import train_test_split

# 90% Train / 10% Test
train_df, test_df = train_test_split(df, test_size=0.1, random_state=42)

train_df.shape, test_df.shape


((194, 2), (22, 2))

### Step 3: Load the Pre-trained SentenceTransformer Model

We load the `all-MiniLM-L6-v2` model from the `sentence_transformers` library.
- This model is small but very efficient for generating sentence embeddings.
- It will help convert questions and answers into vector representations.


In [4]:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Load small but powerful model
model = SentenceTransformer('paraphrase-MiniLM-L3-v2')


### Step 4: Prepare Training Data and Loss Function

- We create `InputExample` objects for each question-answer pair from the training dataset.
- The `label=0.9` means we expect a high similarity between the question and its correct answer.
- We use a `DataLoader` to efficiently load the data during training, shuffling it and using a batch size of 32.
- We define the **Cosine Similarity Loss**, which will train the model to bring similar question-answer pairs closer in vector space.


In [5]:
# Create training examples
train_examples = [InputExample(texts=[row['Question'], row['Answer']],label=0.9) for idx, row in train_df.iterrows()]

# Create DataLoader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32)

# Define Cosine Similarity Loss
train_loss = losses.CosineSimilarityLoss(model)


### Step 5: Fine-tuning the Model

- We fine-tune the pre-trained SentenceTransformer model using our restaurant dataset.
- `train_objectives` specifies the training data and loss function.
- `epochs=1` means the model will see the entire training dataset once (can be increased for better accuracy).
- `warmup_steps=100` helps the model stabilize during the early training phase.
- `show_progress_bar=True` displays a progress bar to monitor training.


In [6]:
# Fine-tuning
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,  # You can increase to 2–5 epochs for better results
    warmup_steps=100,
    show_progress_bar=True
)


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Step,Training Loss


### Step 6: Encode the Questions into Embeddings

- We use the fine-tuned model to **convert** all training and testing questions into **vector embeddings**.
- `convert_to_tensor=True` ensures that the embeddings are stored as PyTorch tensors, which are efficient for similarity computation later.
- These embeddings will help the model find the most relevant answers during prediction.


In [7]:
# Encode questions into embeddings
train_embeddings = model.encode(train_df['Question'].tolist(), convert_to_tensor=True)
test_embeddings = model.encode(test_df['Question'].tolist(), convert_to_tensor=True)


### Step 7: Evaluate the Model Performance

- We define an `evaluate` function to check how well the model retrieves the correct answers.
- **Process**:
  - For each question in the test set:
    - Encode the question.
    - Compute **cosine similarity** with all training questions.
    - Find the top 3 most similar questions.
    - Check if the expected answer is among the top retrieved results.
- **Metrics**:
  - **Top-1 Accuracy**: Correct answer is the first most similar prediction.
  - **Top-3 Accuracy**: Correct answer is among the top 3 predictions.
- Finally, we print the evaluation results.


In [8]:
from sentence_transformers.util import cos_sim
import torch

def evaluate(test_df, train_df, train_embeddings, model, top_k=3):
    correct_top1 = 0
    correct_top3 = 0

    for idx, row in test_df.iterrows():
        query = row['Question']
        expected_answer = row['Answer']
        
        query_embedding = model.encode(query, convert_to_tensor=True)
        scores = cos_sim(query_embedding, train_embeddings)[0]
        top_results = torch.topk(scores, k=top_k)
        
        found = False
        for score_idx in top_results.indices:
            candidate_answer = train_df.iloc[score_idx.item()]['Answer']
            if candidate_answer == expected_answer:
                found = True
                break
        
        if top_results.indices[0] == idx:
            correct_top1 += 1
        if found:
            correct_top3 += 1

    top1_acc = correct_top1 / len(test_df)
    top3_acc = correct_top3 / len(test_df)
    
    return top1_acc, top3_acc

# Run evaluation
top1_acc, top3_acc = evaluate(test_df, train_df, train_embeddings, model)
print(f"Top-1 Accuracy: {top1_acc:.2f}")
print(f"Top-3 Accuracy: {top3_acc:.2f}")


Top-1 Accuracy: 0.00
Top-3 Accuracy: 0.86


### Step 9: Define Chatbot Response Function

- In this step, we define a function `chatbot_response()` to simulate a query-response mechanism.
- The function:
  - Takes a user query as input and converts it into an embedding.
  - Uses `semantic_search()` to find the top `n` similar questions from the training data.
  - Prints the matched question, answer, and the similarity score.
- In this case, we're testing the chatbot with the query `"ji"`.


In [9]:
# Import library if not already imported
from sentence_transformers.util import semantic_search

# Define function to test the model
def chatbot_response(user_query, top_n=3):
    user_embedding = model.encode(user_query, convert_to_tensor=True)
    hits = semantic_search(user_embedding, train_embeddings, top_k=top_n)[0]
    
    for hit in hits:
        idx = hit['corpus_id']
        matched_question = train_df.iloc[idx]['Question']
        answer = train_df.iloc[idx]['Answer']
        score = hit['score']
        
        print(f"Matched Question: {matched_question}")
        print(f"Answer: {answer}")
        print(f"Similarity Score: {score:.2f}")
        print("-" * 50)


### Step 10: Define Chatbot Response Function And test here

- In this case, we're testing the chatbot with the query `"ji"`.


In [10]:
chatbot_response("hi")

Matched Question: Hello
Answer: Hello! How can I assist you today?
Similarity Score: 0.71
--------------------------------------------------
Matched Question: Hey
Answer: Hey there! How can I help you today?
Similarity Score: 0.70
--------------------------------------------------
Matched Question: What is the benefit of taking a warm bath after childbirth?
Answer: A warm bath relieves pain, improves blood circulation, and helps the body recover after childbirth.
Similarity Score: 0.20
--------------------------------------------------


### Step 11: Save the Model Using Pickle

- Here, we save the entire trained model as a `.pkl` file using **Pickle**.
- The model is saved as `'restaurant_chatbot_model.pkl'`.
- This allows us to load and use the model later in other projects or environments.


In [11]:
model.save_pretrained("momcare_model")


In [12]:
import pickle

# Save the whole model
with open('momcare.pkl', 'wb') as f:
    pickle.dump(model, f)

print("Model saved as restaurant_chatbot_model.pkl")

Model saved as restaurant_chatbot_model.pkl
