<a href="https://colab.research.google.com/github/hemhalatha/medical-rag-chatbot/blob/main/XML_to_FAISS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Load XML file

In [11]:
xml_path = "/content/drive/MyDrive/mplus_topics_2025-11-20.xml"

import xml.etree.ElementTree as ET

tree = ET.parse(xml_path)
root = tree.getroot()

Extract text content from XML


In [12]:
texts = []

def extract_text(node):
    if node.text and node.text.strip():
        texts.append(node.text.strip())
    for child in node:
        extract_text(child)

extract_text(root)


Clean & filter text

In [13]:
cleaned = []
for t in texts:
    t = t.replace("\n", " ").replace("\t", " ").strip()
    if len(t) > 20:     # keep meaningful sentences
        cleaned.append(t)

 Chunk the text

In [14]:
chunks = []
chunk_size = 300
chunk_id = 0

for text in cleaned:
    words = text.split()
    for i in range(0, len(words), chunk_size):
        chunk = " ".join(words[i:i+chunk_size])
        chunks.append({"id": f"chunk_{chunk_id}", "text": chunk})
        chunk_id += 1

Build FAISS DB using SentenceTransformer embeddings


In [15]:
!pip install -q faiss-cpu sentence-transformers

In [16]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

corpus = [c["text"] for c in chunks]
embeddings = model.encode(corpus, convert_to_numpy=True)

d = embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(embeddings)

Save FAISS DB + Metadata

In [17]:
faiss.write_index(index, "/content/drive/MyDrive/medline_faiss.index")

import json
with open("/content/drive/MyDrive/medline_chunks.json", "w") as f:
    json.dump(chunks, f, indent=2)

print("FAISS index and metadata saved.")

FAISS index and metadata saved.
