<a href="https://colab.research.google.com/github/madhupathy/PGP-GABA-UTA-JUL25/blob/main/Medical_Assistant_RAG_FullCode_Starter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Medical Assistant (RAG) — Full‑Code Starter

**Goal:** Answer clinical questions using a Retrieval‑Augmented Generation (RAG) pipeline over the Merck Manual PDF. This notebook covers installs, GPU check, data ingestion, embeddings + FAISS, prompt‑sweeps, RAG QA, and evaluation.

**Submission:** Export to HTML at the end.

> ⚠️ Educational use only. Not a substitute for professional medical advice.


## 0) Runtime & Setup (Colab)
- *Runtime → Change runtime type → GPU (T4 recommended)*
- If CUDA errors occur, re‑select GPU and **Restart runtime**.

In [2]:
# Check GPU
!nvidia-smi

# Ensure CUDA wheels (Colab often has them, this makes it explicit)
!pip -q install --upgrade "torch==2.3.1+cu121" "torchvision==0.18.1+cu121" --index-url https://download.pytorch.org/whl/cu121

import torch; print("CUDA available:", torch.cuda.is_available())

# Re-init SentenceTransformer on GPU
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device='cuda')


Fri Sep 26 02:40:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   41C    P0             27W /   70W |     206MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 1) Installs

In [3]:
#@title Install libraries (may take a few minutes)
!pip -q install --upgrade pip
!pip -q install faiss-cpu sentence-transformers pypdf langchain langchain-community langchain-text-splitters
!pip -q install llama-cpp-python==0.2.90  # works well on Colab T4; adjust if needed
# Optional: reranker
!pip -q install FlagEmbedding                    # BGE reranker


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.8/1.8 MB[0m [31m72.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.32.4, but you have requests 2.32.5 which is incompatible.[0m[31m
[0m  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25

## 2) Configuration

In [4]:
#@title Configuration
from pathlib import Path
PDF_PATH = Path('/content/medical_diagnosis_manual.pdf')
EMBED_MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
FAISS_DIR = Path('/content/faiss_index')
FAISS_DIR.mkdir(exist_ok=True, parents=True)

# LLM config (llama-cpp); alternatively, plug in OpenAI etc.
LLAMA_MODEL_PATH = '/content/llama-3.2-3b-instruct.Q4_K_M.gguf'
TEMPERATURE = 0.2
MAX_TOKENS = 1024
TOP_P = 0.95
TOP_K = 40

!pip -q install -U huggingface_hub
from huggingface_hub import hf_hub_download

MODEL_REPO = "bartowski/Llama-3.2-3B-Instruct-GGUF"
FILENAME   = "Llama-3.2-3B-Instruct-Q4_K_M.gguf"

local_path = hf_hub_download(repo_id=MODEL_REPO, filename=FILENAME, local_dir="/content", local_dir_use_symlinks=False)
print("Saved to:", local_path)


For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Llama-3.2-3B-Instruct-Q4_K_M.gguf:   0%|          | 0.00/2.02G [00:00<?, ?B/s]

Saved to: /content/Llama-3.2-3B-Instruct-Q4_K_M.gguf


## 3) Load & Chunk PDF

In [None]:
#@title Load PDF and chunk
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

assert PDF_PATH.exists(), f"Upload your PDF to {PDF_PATH}"
loader = PyPDFLoader(str(PDF_PATH))
docs = loader.load()
print(f"Pages loaded: {len(docs)}")

splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=150)
chunks = splitter.split_documents(docs)
print(f"Chunks: {len(chunks)} (size≈1200, overlap=150)")


## 4) Embeddings & FAISS

In [None]:
#@title Build / Load FAISS index
from sentence_transformers import SentenceTransformer
import faiss, numpy as np

embedder = SentenceTransformer(EMBED_MODEL_NAME)
embs = embedder.encode(
    texts,
    batch_size=256,          # adjust down if OOM
    convert_to_numpy=True,
    normalize_embeddings=True,
    show_progress_bar=True
)

index = faiss.IndexFlatIP(embs.shape[1])
index.add(embs)
print('FAISS index size:', index.ntotal)

# Store minimal store (texts + metadata)
texts = [c.page_content for c in chunks]
metas = [c.metadata for c in chunks]


## 5) Retriever

In [None]:
#@title Define retriever (similarity search)
import numpy as np

def retrieve(query, k=6):
    q = embedder.encode([query], convert_to_numpy=True, normalize_embeddings=True)
    D, I = index.search(q, k)
    ctxs = [(texts[i], metas[i]) for i in I[0]]
    return ctxs

# quick test
print(len(retrieve("sepsis hour-1 bundle", k=3)))


## 6) LLM (llama-cpp) — local

In [None]:
#@title Load llama-cpp (provide GGUF path above)
from llama_cpp import Llama
llm = Llama(model_path=LLAMA_MODEL_PATH, n_gpu_layers=50, n_ctx=4096, n_threads=4, verbose=False)



## 7) Prompt Templates

In [None]:
#@title Prompts (baseline, engineered, RAG)
BASELINE_SYS = """You are a careful clinical assistant. If unsure, say so and ask for clarification. Do not fabricate.
Educational use only — not medical advice.
"""

ENGINEERED_SYS = """You are an ICU attending. Answer concisely in bullets and include a short ‘Rationale’. If evidence is insufficient, state that explicitly.
Return JSON with fields: answer, rationale, red_flags.
"""

RAG_SYS = """You are a clinical assistant. Answer ONLY using the provided CONTEXT.
- If the context is insufficient, say "Insufficient context" and request what is missing.
- Cite chunk IDs like [C1], [C2] from the given CONTEXT.
Educational use only — not medical advice.
"""

USER_QS = [
    "What is the protocol for managing sepsis in a critical care unit?",
    "What are the common symptoms of appendicitis, and can it be cured via medicine? If not, what surgical procedure should be followed to treat it?",
    "What are treatments and possible causes for sudden patchy hair loss (localized bald spots on scalp)?",
    "What treatments are recommended for a person with brain tissue injury (TBI) causing temporary or permanent impairment?"
]


## 8) Generation utilities

In [None]:
#@title Generation helpers
from textwrap import shorten

def chat(llm, system, user, temperature=0.2, top_p=0.95, max_tokens=1024, top_k=40):
    prompt = f"<s>[INST]<<SYS>>\n{system}\n<</SYS>>\n{user}[/INST]"
    out = llm(prompt, temperature=temperature, top_p=top_p, max_tokens=max_tokens, top_k=top_k)
    return out["choices"][0]["text"].strip()

# RAG
def answer_with_rag(question, k=6, temperature=0.2):
    ctxs = retrieve(question, k=k)
    ctx_text = "\n\n".join([f"[C{i+1}] {shorten(t[:1200], width=1200)}" for i,(t,_) in enumerate(ctxs)])
    user = f"CONTEXT:\n{ctx_text}\n\nQUESTION: {question}\n\nAnswer with citations [C#]."
    return chat(llm, RAG_SYS, user, temperature=temperature)


## 9) Baseline & Prompt‑Engineered QA (5+ variants)

In [None]:
#@title Fast + Optimized Baselines & Prompt Sweeps (with final pass)
import itertools, time

# ---------- TUNABLES ----------
# Fast "sweep" on just 1 question to pick params quickly
QUESTIONS_FOR_SWEEP = [USER_QS[0]]   # only sepsis; add USER_QS[1] if you want 2
# Keep combos small for speed (3 combos is enough to pick a winner)
PARAM_GRID = {
    'temperature': [0.0, 0.2],
    'top_p': [0.9, 1.0],
    'top_k': [20, 40]
}
COMBOS = list(itertools.product(PARAM_GRID['temperature'], PARAM_GRID['top_p'], PARAM_GRID['top_k']))[:3]

# Shorter output during sweep; long output only in final pass
SWEEP_MAX_TOKENS = 200
FINAL_MAX_TOKENS = 700   # reduce from 1024 to speed up; increase if you truly need longer answers

# Add stop tokens to end early when the answer is complete
STOP_TOKENS = ["</s>", "###", "\n\n\n"]

# ---------- HELPER ----------
def chat_fast(llm, system, user, temperature=0.2, top_p=0.95, top_k=40, max_tokens=256):
    prompt = f"<s>[INST]<<SYS>>\n{system}\n<</SYS>>\n{user}[/INST]"
    out = llm(
        prompt,
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        max_tokens=max_tokens,
        stop=STOP_TOKENS
    )
    return out["choices"][0]["text"].strip()

runs = []

# ---------- 1) FAST SWEEP (small outputs, 1 question) ----------
print("=== FAST SWEEP START ===")
best_combo = None
best_score = -1.0

for q in QUESTIONS_FOR_SWEEP:
    # Baseline (short)
    t0 = time.time()
    out = chat_fast(llm, BASELINE_SYS, q, temperature=0.2, max_tokens=SWEEP_MAX_TOKENS)
    dt = time.time() - t0
    print(f"[baseline] '{q[:28]}...' took {dt:.1f}s")
    runs.append({"mode":"baseline_sweep","q":q,"params":{"temperature":0.2,"max_tokens":SWEEP_MAX_TOKENS},"answer":out})

    # Prompt-engineered short sweeps
    for (t,tp,tk) in COMBOS:
        t0 = time.time()
        out = chat_fast(llm, ENGINEERED_SYS, q, temperature=t, top_p=tp, top_k=tk, max_tokens=SWEEP_MAX_TOKENS)
        dt = time.time() - t0
        print(f"[engineered] t={t}, tp={tp}, tk={tk} took {dt:.1f}s")

        # Simple heuristic score: longer (but capped) and contains bullets or rationale keywords
        # (You’ll replace with your real evaluation later)
        score = (0.5 if ("•" in out or "-" in out) else 0.0) + min(len(out), SWEEP_MAX_TOKENS)/SWEEP_MAX_TOKENS*0.5
        if score > best_score:
            best_score = score
            best_combo = {"temperature": t, "top_p": tp, "top_k": tk}
        runs.append({"mode":"engineered_sweep","q":q,"params":{"temperature":t,"top_p":tp,"top_k":tk,"max_tokens":SWEEP_MAX_TOKENS},"answer":out})

print("Best combo from sweep:", best_combo, "score:", round(best_score,3))
print("=== FAST SWEEP END ===\n")

# ---------- 2) FINAL PASS (all 4 questions, chosen params, longer but capped outputs) ----------
print("=== FINAL PASS START ===")
final_params = best_combo if best_combo else {"temperature":0.2, "top_p":0.95, "top_k":40}

# Baseline (once for all questions)
for q in USER_QS:
    t0 = time.time()
    out = chat_fast(llm, BASELINE_SYS, q, temperature=0.2, top_p=0.95, top_k=40, max_tokens=FINAL_MAX_TOKENS)
    print(f"[final-baseline] '{q[:28]}...' took {time.time()-t0:.1f}s")
    runs.append({"mode":"baseline_final","q":q,"params":{"temperature":0.2,"top_p":0.95,"top_k":40,"max_tokens":FINAL_MAX_TOKENS},"answer":out})

# Engineered (best combo)
for q in USER_QS:
    t0 = time.time()
    out = chat_fast(llm, ENGINEERED_SYS, q, **final_params, max_tokens=FINAL_MAX_TOKENS)
    print(f"[final-engineered] '{q[:28]}...' took {time.time()-t0:.1f}s")
    runs.append({"mode":"engineered_final","q":q,"params":{**final_params,"max_tokens":FINAL_MAX_TOKENS},"answer":out})

print("=== FINAL PASS END ===")

len(runs)


## 10) RAG QA

In [None]:
#@title Run RAG answers
for q in USER_QS:
    ans = answer_with_rag(q, k=6, temperature=0.2)
    runs.append({"mode":"rag","q":q,"params":{"k":6,"temperature":0.2},"answer":ans})

len(runs)


## 11) Evaluation — Groundedness, Relevance, Context Precision

In [None]:
#@title Simple automatic rubric (embedding similarity)
import numpy as np

def cosine(a,b):
    return float(np.dot(a,b) / (np.linalg.norm(a)*np.linalg.norm(b) + 1e-8))

# Groundedness: max similarity between answer and its retrieved context chunks (proxy)
# Relevance: similarity between answer and question
# Context precision: average similarity of retrieved chunks to the question (for RAG)

def embed_text(t: str):
    return embedder.encode([t], convert_to_numpy=True, normalize_embeddings=True)[0]

question_vecs = {q: embed_text(q) for q in USER_QS}

evals = []
for r in runs:
    q = r['q']
    ans = r['answer']
    ans_vec = embed_text(ans)
    rel = cosine(ans_vec, question_vecs[q])
    grounded = None
    ctx_prec = None
    if r['mode'] == 'rag':
        ctxs = retrieve(q, k=6)
        ctx_vecs = [embed_text(t) for t,_ in ctxs]
        grounded = max(cosine(ans_vec, v) for v in ctx_vecs)
        ctx_prec = float(np.mean([cosine(question_vecs[q], v) for v in ctx_vecs]))
    evals.append({**r, 'metrics': {'relevance': rel, 'groundedness': grounded, 'context_precision': ctx_prec}})

len(evals)


## 12) Display & Export

In [None]:
#@title 🧼➡️📄➡️⬇️ Clean notebook (widget-safe) + Export HTML (lab) + Download
SRC_PATH = "/content/Medical_Assistant_RAG_FullCode_Starter.ipynb"  # change if your filename differs
OUT_HTML = "/content/Medical_Assistant_RAG_Submission.html"

import nbformat as nbf
from pathlib import Path
from google.colab import files
import subprocess, shlex

src = Path(SRC_PATH)
assert src.exists(), f"Notebook not found at {src}. Use File > Download .ipynb, then upload to /content."

clean = src.with_name(src.stem + "_CLEAN.ipynb")
nb = nbf.read(str(src), as_version=4)

# remove all widget traces
nb.metadata.pop("widgets", None)
WIDGET_VIEW  = "application/vnd.jupyter.widget-view+json"
WIDGET_STATE = "application/vnd.jupyter.widget-state+json"
for c in nb.cells:
    c.metadata.pop("widgets", None)
    if "outputs" in c:
        for o in c["outputs"]:
            if isinstance(o, dict) and "data" in o:
                o["data"].pop(WIDGET_VIEW, None)
                o["data"].pop(WIDGET_STATE, None)

nbf.write(nb, str(clean))
print("Cleaned →", clean)

# Try lab template (nicer); if it fails, fallback to basic
cmd = f'jupyter nbconvert --to html --template=lab --no-input "{clean}" --output "{OUT_HTML}"'
print("[RUN]", cmd)
rc = subprocess.call(cmd, shell=True)
if rc != 0:
    cmd2 = f'jupyter nbconvert --to html --template=basic --no-input "{clean}" --output "{OUT_HTML}"'
    print("[FALLBACK]", cmd2)
    rc2 = subprocess.call(cmd2, shell=True)
    if rc2 != 0:
        raise RuntimeError("nbconvert failed. Use Colab menu: File > Download > .html")

print("HTML ready at:", OUT_HTML)
files.download(OUT_HTML)


### HTML Export (for submission)

In [None]:
# Clean -> Export HTML (lab template, code hidden) -> Download
import nbformat as nbf
from pathlib import Path
from google.colab import files
import subprocess, shlex

SRC_PATH = "/content/Medical_Assistant_RAG_FullCode_Starter.ipynb"   # adjust if needed
OUT_HTML = "/content/Medical_Assistant_RAG_Submission.html"

src = Path(SRC_PATH); clean = src.with_name(src.stem + "_CLEAN.ipynb")
nb = nbf.read(str(src), as_version=4)

# Strip ipywidgets metadata that breaks nbconvert
nb.metadata.pop("widgets", None)
WV="application/vnd.jupyter.widget-view+json"; WS="application/vnd.jupyter.widget-state+json"
for c in nb.cells:
    c.metadata.pop("widgets", None)
    if "outputs" in c:
        for o in c["outputs"]:
            if isinstance(o, dict) and "data" in o:
                o["data"].pop(WV, None); o["data"].pop(WS, None)
nbf.write(nb, str(clean))
print("Cleaned →", clean)

# Pretty export (hides code); fallback to basic if lab fails
rc = subprocess.call(f'jupyter nbconvert --to html --template=lab --no-input "{clean}" --output "{OUT_HTML}"', shell=True)
if rc != 0:
    subprocess.check_call(f'jupyter nbconvert --to html --template=basic --no-input "{clean}" --output "{OUT_HTML}"', shell=True)

print("HTML ready:", OUT_HTML)
files.download(OUT_HTML)


In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Notes & Citations
- Merck Manual (Professional Edition): Sepsis & Septic Shock; Appendicitis; Alopecia Areata; Traumatic Brain Injury.
- Surviving Sepsis Campaign (SCCM) Adult Guidelines (Hour‑1 Bundle), 2021.
