# Oral QA dataset

**British university in Egypt** \\
*ICS*

---

### STUDENT DETAIL:
- **Omar Islam**
  - Email: omar219127@bue.edu.eg
  - Student ID: 219127


---




## Contribution

### **Embedding-Based Retrieval Chatbot (Sentence Transformers + Cosine Similarity)**:

  * Implements a simple but effective retrieval-based chatbot that converts user queries and stored questions into vector embeddings using a pretrained SentenceTransformer (all-MiniLM-L6-v2).

  * Enables fast semantic similarity search using cosine similarity to find the best matching question in the dataset.

  * Provides domain-specific Q&A by returning the stored answer corresponding to the closest matched question.

  * Simple architecture with minimal overhead, suitable for small to medium-sized datasets.

### Fine Tuning

1. Data Preparation and Formatting:

  * Extracts question-answer pairs from raw data, formats them into a suitable structure for training a sequence-to-sequence model (T5), with explicit prefixes like "question:" for input and plain text answers for output.

2. Fine-tuning a Pretrained T5 Model for QA

  * Uses Hugging Face’s Seq2SeqTrainer to fine-tune the t5-small model on the oral disease QA dataset.

  * Applies tokenization and padding with max length, optimizing for question-answer pairs to generate relevant answers in natural language.

3. Corpus Construction for Retrieval

  * Converts the dataset’s answers into a retrieval corpus, each entry containing a title (the question) and a corresponding text (the answer/context), formatted as JSON.

  * Prepares the corpus to be indexed by a vector similarity search library (FAISS).

4. Vector Embedding with Sentence Transformers

  * Encodes corpus texts into dense vector embeddings using a fast and efficient transformer model (all-MiniLM-L6-v2) suitable for semantic search.

  * Generates embeddings that capture semantic similarity between queries and corpus entries.

5. FAISS Index Creation and Persistence

  * Builds a FAISS index over the embedded corpus vectors to enable fast nearest neighbor search for retrieval.

  * Saves the FAISS index and corpus texts to disk for reuse in retrieval.

6. Custom Retriever Implementation

  * Defines a CustomRetriever class that:

    * Loads the FAISS index and corpus texts from disk.

    * Encodes user queries into embeddings using the same sentence transformer.

    * Searches the FAISS index for the top-k semantically similar corpus entries.

    * Returns retrieved contexts for use as input to the QA model.

7. Retrieval-Augmented Generation (RAG) Inference Pipeline

  * Implements a RAG-style QA function that:

    * Takes a user question as input.

    * Retrieves relevant contexts from the corpus via the retriever.

    * Concatenates these contexts to form a single input prompt for the fine-tuned T5 model.

    * Generates a natural language answer based on both the question and the retrieved context.

    * Provides more informed and accurate answers by grounding generation in retrieved knowledge.

8. End-to-End Demonstration and Testing

  * Showcases the complete pipeline from question input to answer output, demonstrating how fine-tuning, retrieval, and generation integrate to provide context-aware answers for oral disease queries.

## Brief summary:

My code builds a **domain-specific, retrieval-augmented QA** system by combining:

  * Fine-tuned generative modeling (T5) for natural language answer generation,

  * Semantic retrieval (SentenceTransformer + FAISS) for contextual grounding,

  * A clean integration layer (CustomRetriever + RAG inference) to combine retrieved evidence with generative answer synthesis.

#### Downloading the dataset using github

In [None]:
## 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

# 2. Load the dataset

In [None]:
# 2. Load the dataset
import pandas as pd
from sentence_transformers import SentenceTransformer, util


# Import pandas
import pandas as pd

# Load the dataset directly from GitHub
url = "https://raw.githubusercontent.com/LinesHogan/Open-Domain-Oral-Disease-QA-Dataset/main/ODOD-SFT.jsonl"
df = pd.read_json(url, lines=True)

# View the first few rows
df.head()


Unnamed: 0,query,response
0,I feel pain and sound when I open or close my ...,I'm an AI language model and can't diagnose me...
1,"My gums are a little swollen and sore, and my ...",While I can't diagnose medical conditions as a...
2,The gum near my last tooth is a little swollen...,"As an AI, I can't diagnose medical conditions,..."
3,There are some brown and black spots on the su...,"While I cannot make a diagnosis as an AI, the ..."
4,My upper and lower teeth don't align when I cl...,If your upper and lower teeth don't align prop...


## Create documents from your QA pairs

## Initialize sentence embedding model

In [None]:
# Initialize sentence embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")
questions = df['query'].tolist()
answers = df['response'].tolist()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Embed all questions

In [None]:
# Embed all questions
question_embeddings = model.encode(questions, convert_to_tensor=True)

# Chatbot function

In [None]:
# Define a chatbot function
def chatbot(user_input):
    input_embedding = model.encode(user_input, convert_to_tensor=True)
    scores = util.cos_sim(input_embedding, question_embeddings)[0]
    best_match = scores.argmax().item()
    print("🤖:", answers[best_match])

In [None]:
# Example
chatbot("My gums are swollen and red, what could it be?")

🤖: While I can't diagnose your condition as an AI developed by OpenAI, the symptoms you describe could suggest a few dental conditions.
1. Pericoronitis: This is a dental disorder in which the gum tissue around the wisdom teeth becomes swollen and infected. It's most common around partially erupted wisdom teeth, where the tooth has not surfaced fully and the gum has created a flap or pocket around the tooth where food and bacteria can get trapped. This can cause symptoms like swollen gums, pain, and bad breath.
2. Gingivitis or Periodontitis: These are stages of gum disease. Gingivitis is the milder, reversible form, while periodontitis is more serious and can lead to tooth loss. Both can cause swollen, tender gums and bad breath.
3. Tooth Decay or Abscess: A cavity or an abscessed tooth can also cause bad breath and swollen gums, especially if the decay or infection is near the gum line.
4. Poor Oral Hygiene: Not brushing and flossing regularly or correctly can lead to plaque buildup,

# Streamlit UI

In [None]:
!pip install streamlit

Collecting streamlit
  Downloading streamlit-1.45.1-py3-none-any.whl.metadata (8.9 kB)
Collecting watchdog<7,>=2.1.5 (from streamlit)
  Downloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Downloading streamlit-1.45.1-py3-none-any.whl (9.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m125.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m128.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading watchdog-6.0.0-py3-none-manylinux2014_x86_64.whl (79 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.1/79.1 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hIn

In [None]:
%%writefile app.py
import streamlit as st
from sentence_transformers import SentenceTransformer, util
import pandas as pd

@st.cache_data
def load_data():
    url = "https://raw.githubusercontent.com/LinesHogan/Open-Domain-Oral-Disease-QA-Dataset/main/ODOD-SFT.jsonl"
    df = pd.read_json(url, lines=True)
    return df, df['query'].tolist(), df['response'].tolist()

df, questions, answers = load_data()
model = SentenceTransformer("all-MiniLM-L6-v2")
question_embeddings = model.encode(questions, convert_to_tensor=True)

st.title("🦷 Oral Disease QA Chatbot")
user_input = st.text_input("Ask your dental question:")

if user_input:
    input_embedding = model.encode(user_input, convert_to_tensor=True)
    scores = util.cos_sim(input_embedding, question_embeddings)[0]
    best_match = scores.argmax().item()
    st.markdown(f"**🤖 Answer:** {answers[best_match]}")


Writing app.py


In [None]:
!pip install pyngrok



Collecting pyngrok
  Downloading pyngrok-7.2.11-py3-none-any.whl.metadata (9.4 kB)
Downloading pyngrok-7.2.11-py3-none-any.whl (25 kB)
Installing collected packages: pyngrok
Successfully installed pyngrok-7.2.11


In [None]:
!ngrok config add-authtoken 2y4MeL1IzecMT7rxK4OJnT2X6Np_4zgvcM9fEkjZ15oD77QCT


Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [None]:
from pyngrok import ngrok
import threading

# Start streamlit in background
def run():
    !streamlit run app.py --server.port 8501

thread = threading.Thread(target=run)
thread.start()

# Wait a bit for the server to start
import time
time.sleep(5)

# Expose the Streamlit app
public_url = ngrok.connect(8501)
print(f"Public URL: {public_url}")



Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.143.228.49:8501[0m
[0m
Public URL: NgrokTunnel: "https://db9a-34-143-228-49.ngrok-free.app" -> "http://localhost:8501"


# Fine-Tuning Script (Using Hugging Face Transformers)

## 1. Preprocess Data to Hugging Face Format

In [None]:
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq
)

# Only keep question-answer pairs
# Rename columns to match T5 input expectations
hf_data = Dataset.from_pandas(df[["query", "response"]].rename(columns={"query": "question", "response": "answer"}))
hf_data = hf_data.train_test_split(test_size=0.1)


## 2. Fine-Tune a Simple Seq2Seq Model (T5 or BART)

In [None]:
import transformers
print(transformers.__version__)


4.52.3


In [None]:
import transformers
print(transformers.__version__)
print(transformers.__file__)


4.52.3
/usr/local/lib/python3.11/dist-packages/transformers/__init__.py


In [None]:
!pip show transformers
!pip list | grep transformers


Name: transformers
Version: 4.52.3
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/lib/python3.11/dist-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: peft, sentence-transformers
sentence-transformers                 4.1.0
transformers                          4.52.3


In [None]:
# Load T5 tokenizer and model
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Tokenize dataset
def preprocess(batch):
    inputs = ["question: " + q for q in batch["question"]]
    outputs = batch["answer"]
    return tokenizer(inputs, text_target=outputs, truncation=True, padding="max_length", max_length=512)




tokenized_data = hf_data.map(preprocess, batched=True)

training_args = Seq2SeqTrainingArguments(
    output_dir="./t5_qa_model",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    num_train_epochs=10,
    logging_dir="./logs",
    fp16=True
)


trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForSeq2Seq(tokenizer, model=model),
)

# Start training
trainer.train()

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Map:   0%|          | 0/431 [00:00<?, ? examples/s]

Map:   0%|          | 0/48 [00:00<?, ? examples/s]

  trainer = Seq2SeqTrainer(


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33momarislam590[0m ([33momarislam590-british-university-in-egypt[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
500,3.0607


TrainOutput(global_step=540, training_loss=3.012318957293475, metrics={'train_runtime': 314.8597, 'train_samples_per_second': 13.689, 'train_steps_per_second': 1.715, 'total_flos': 583323164344320.0, 'train_loss': 3.012318957293475, 'epoch': 10.0})

In [None]:
model.save_pretrained("./t5_qa_model")
tokenizer.save_pretrained("./t5_qa_model")


('./t5_qa_model/tokenizer_config.json',
 './t5_qa_model/special_tokens_map.json',
 './t5_qa_model/spiece.model',
 './t5_qa_model/added_tokens.json',
 './t5_qa_model/tokenizer.json')

## Test Your Fine-Tuned Model

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainer

# After you finish training your model with Seq2SeqTrainer, save both model and tokenizer:
# (Run this **after training**, once)
trainer.save_model("./t5_qa_model")          # saves model and config
tokenizer.save_pretrained("./t5_qa_model")   # saves tokenizer files

# -----------------------------------------------------------
# Now, to load the fine-tuned model and tokenizer for inference:

model_name_or_path = "./t5_qa_model"  # directory where model and tokenizer are saved

# Load tokenizer and model from the saved directory
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)

def answer_question(question):
    input_text = "question: " + question
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True)
    outputs = model.generate(**inputs, max_length=64, num_beams=5, early_stopping=True)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

# Test example
test_question = "My gums are swollen and red, what could it be?"
print("Question:", test_question)
print("Answer:", answer_question(test_question))



Question: My gums are swollen and red, what could it be?
Answer: gums are swollen and red


# RAG

In [None]:
import json
from sentence_transformers import SentenceTransformer, util
import faiss
import numpy as np
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

## 1. Prepare the Corpus File

Transform your dataset into a RAG-compatible format

In [None]:
# Load fine-tuned T5 model and tokenizer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_path = "./t5_qa_model"  # local directory containing your saved model

tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path, local_files_only=True)



In [None]:
# Rename and format to fit RAG's passage format
corpus = []
for _, row in df.iterrows():
    corpus.append({
        "title": row["query"],  # dummy title (RAG expects one)
        "text": row["response"]  # actual context to retrieve
    })

# Save to disk
with open("corpus.json", "w") as f:
    json.dump(corpus, f)

## 2. Vectorize with SentenceTransformer + FAISS (Highly Efficient)

In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import json

# Load the response texts
with open("corpus.json") as f:
    corpus = json.load(f)

texts = [doc["text"] for doc in corpus]

# Use a fast transformer model for vectorization
encoder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = encoder.encode(texts, convert_to_numpy=True, show_progress_bar=True)

# Create FAISS index
dimension = embeddings.shape[1]
faiss_index = faiss.IndexFlatL2(dimension)
faiss_index.add(embeddings)

# Save for later use
faiss.write_index(faiss_index, "faiss_index.idx")
with open("corpus_texts.json", "w") as f:
    json.dump(texts, f)


Batches:   0%|          | 0/15 [00:00<?, ?it/s]

## 3. Create a Retriever for Use in RAG

In [None]:
class CustomRetriever:
    def __init__(self, encoder_model_name, index_path, corpus_path):
        self.encoder = SentenceTransformer(encoder_model_name)
        self.index = faiss.read_index(index_path)

        with open(corpus_path, "r") as f:
            self.corpus_texts = json.load(f)

        if isinstance(self.corpus_texts, dict):
            self.corpus_ids = list(self.corpus_texts.keys())
            self.text_lookup = self.corpus_texts
        else:
            self.corpus_ids = list(range(len(self.corpus_texts)))
            self.text_lookup = {str(i): text for i, text in enumerate(self.corpus_texts)}

    def retrieve(self, query, top_k=5):
        query_embedding = self.encoder.encode(query, convert_to_numpy=True)
        scores, indices = self.index.search(np.array([query_embedding]), top_k)
        return [self.corpus_texts[self.corpus_ids[i]] for i in indices[0]]


## 4. Generate Answers Using T5

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# --- Init your retriever ---
retriever = CustomRetriever(
    encoder_model_name="sentence-transformers/all-MiniLM-L6-v2",  # or your encoder
    index_path="faiss_index.idx",
    corpus_path="corpus_texts.json"
)

# --- RAG-like QA inference ---
def rag_answer(query):
    retrieved_contexts = retriever.retrieve(query)
    context = " ".join(retrieved_contexts)

    input_text = f"question: {query} context: {context}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True).to(model.device)

    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=128, num_beams=4, early_stopping=True)

    return tokenizer.decode(outputs[0], skip_special_tokens=True)


In [None]:
# Example test
query = "My gums are swollen and red, what could it be?"
print("Q:", query)
print("A:", rag_answer(query))

Q: My gums are swollen and red, what could it be?
A: swollen, red, or bleeding gums, loose teeth, and gaps between the teeth
