# Step 1: SNOMED Ground Truth Extraction (via BioPortal API)

This notebook:
1. Validates each concept term against the BioPortal SNOMED CT API
2. Replaces NOT FOUND concepts with random valid SNOMED concepts
3. Extracts ground truth relationships (parents, grandparents, children, siblings)
4. Saves `validated_concepts.csv` and `ground_truth.csv`

**Run this notebook ONCE** from the `ground_truth/` folder.
The output is shared by all LLM testing folders (testing_gpt, testing_claude, etc.).

After this, run `step2_llm_queries.ipynb` in each testing folder.

## Configuration

In [6]:
import os
import re
import random
import requests
import time
from datetime import datetime
from pathlib import Path
from urllib.parse import quote

import pandas as pd

# ============================================================
# Path Configuration
# ============================================================

_cwd = Path(".").resolve()
# This notebook lives in ground_truth/ at repo root
if _cwd.name == "ground_truth":
    REPO_ROOT = _cwd.parent
else:
    REPO_ROOT = _cwd  # running from repo root

OUTPUT_ROOT = (REPO_ROOT / "output").resolve()
GT_ROOT = OUTPUT_ROOT / "ground_truth"
OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)
GT_ROOT.mkdir(parents=True, exist_ok=True)

# Output files (shared across all LLMs)
GT_OUT = GT_ROOT / "ground_truth.csv"
VALIDATED_CONCEPTS_OUT = GT_ROOT / "validated_concepts.csv"
LOG_DIR = GT_ROOT

# Set to True to run on 5 random concepts only (for testing)
TEST_MODE = False

print("Output directory:", GT_ROOT)


Output directory: /Users/narenkhatwani/Documents/GitHub/llm-as-ontology-server/output/ground_truth


## Concept Terms

In [7]:
# Full list (same as in step1 / step2 / step3)
CONCEPT_TERMS = [
    "Acute myocardial infarction", "Atrial fibrillation", "Hypertensive disorder, systemic arterial", "Atherosclerosis",
    "Embolus", "Ventricular tachycardia", "Apnea", "Cyanosis", "Bronchitis", "Hyperglycemia", "Blood in urine", "Oliguria", "Polyuria", "Necrosis", "Ischemia", "Sepsis", "Malignant", "Benign", "Abscess", "Edema", "Intubation", "Biopsy", "Venipuncture", "Laparotomy", "Opening of chest", "Cholecystectomy", "Cardiopulmonary resuscitation", "Triage", "Proximal", "Distal", "Chronic", "Determination of prognosis", "Anaphylaxis", "Structure of ischiogluteal bursa of right hip", "Does not cut own nails", "Laceration of right brachial artery", "Date sample received in laboratory", "Autosomal recessive severe congenital neutropenia due to jagunal homolog 1 deficiency", "Product containing only ezetimibe and rosuvastatin", "Alpha>1< anti-trypsin isotype MZ", "2-dehydro-3-deoxy-L-pentonate aldolase", "Thiourea", "Protriptyline", "Use of sling suspension", "Insulin receptor defect", "Akinesis of basal anterior segment of left cardiac ventricle", "Bioterrorist attack",
    "Rickettsia australis", "Computed tomography 3 dimensional reconstruction", "Open reduction of fracture of tarsal bone without internal fixation", "Identification of antibody to red blood cell antigens from Colton system (International Society of Blood Transfusion 015)", "Spirometra houghtonii", "Difficulty preparing feed", "Public school", "Assistive electrical shaver adaptor", "Bathing of skin of free upper limb", "Primary hyperparathyroidism", "Ultrasonography of soft tissue of wrist region", "Standing tolerance", "Product containing precisely alverine citrate 120 milligram/1 each conventional release oral capsule", "Structure of temporomandibular articulation vein", "Closed comminuted fracture of proximal phalanx of finger", "Choroidal retinal neovascularization", "Microunit/milliliter", "Unable to use a non-speech system for communication", "Chronic infective pericarditis", "Accidental poisoning caused by salicylates", "Product containing only nicotinic acid", "Dimethyl-ether propane", "Nipple sharing technique", "Accidental clorazepate poisoning", "Fetal disorder due to persistent occipito-posterior malposition during labor", "Carnitine acylcarnitine translocase deficiency", "Revision of gastrojejunal anastomosis with reconstruction with partial gastrectomy", "Abnormal radial pulse", "Accident due to heat in drying room", "Disease caused by Trichinelloidea", "Cervical anterior longitudinal ligament sprain", "Entire costal groove of eighth rib", "Myxopapillary ependymoma of brain",
    "Repositioning of cardioverter/defibrillator leads", "Buttiauxella brennerae", "Position of testicle", "Accidental lomustine poisoning", "Hemangioma of retina of bilateral eyes", "Mandelate racemase", "Structure of rhomboideus thoracis muscle", "Unable to move from lying to sitting", "Contact lens inserter/remover", "Cytosine deaminase", "Atrolysin F", "Entire pupillary margin of iris", "Compression hosiery class II below knee stocking flatbed knit made to measure", "Entire small subcutaneous vein", "Operation on pharynx", "Fluoroscopic angiography of dialysis fistula using contrast with insertion of stent graft", "Product containing only methotrexate", "Wart clinic", "Hemoglobin Dagestan", "Substance", "Family Chlorobiaceae", "Structure of pedicle of axis", "Bone structure of L4", "Lasègue's arm sign", "Neuroendocrine tumor of middle ear", "Extension Namespace {1000194}", "Fracture of third rib", "Overriding tricuspid valve", "Cerebellar laceration with open intracranial wound", "Sacral spinal cord injury without bone injury", "Dehydrated hereditary stomatocytosis", "Product containing precisely tinzaparin sodium 20000 unit/1 milliliter conventional release solution for injection", "History of mood disorder", "Dislocation of prosthetic joint of elbow", "Product containing only brompheniramine and guaifenesin and hydrocodone in oral dose form", "Unable to wash laundry", "Entire corticothalamic fibers of posterior limb of internal capsule", "Product containing precisely lidocaine hydrochloride 1 milligram/1 milliliter conventional release solution for injection",
    "Congenital absence of uvula", "Streptococcus pyogenes type emm35", "Fracture of mandible involving dental socket", "Foreign body in skin of finger with infection", "Product containing precisely naloxone hydrochloride 400 microgram/1 milliliter conventional release solution for injection", "Excisional biopsy of lacrimal sac", "Catatonic posturing", "Structure of palmaris brevis muscle", "Butteroil", "Entire ventral posterolateral nucleus of thalamus", "Anomalous pulmonary venous drainage to right atrium", "Caries of cervical margin of tooth", "Bursitis of shoulder", "Laboratory oven", "Assessment using physical function intensive care unit test", "Oral dyskinesia", "Family Ranidae", "Acute postthoracotomy pain syndrome", "Trapezius flap", "Fetal movements palpated by healthcare professional", "Product containing only sodium thiosulfate in parenteral dose form", "Zamia", "Provision of removable artificial eye", "Hypokinesis of myocardium of anterolateral region of left ventricle", "CV20", "Primary malignant neoplasm of intrathoracic organs", "Salmonella IIIb 38:(k):z54", "Purpura simplex", "Arsine poisoning", "Continuous flow apneic ventilation", "Extension Namespace {1000164}", "Ear and auditory finding", "Antibody to myelin associated glycoprotein", "Genus Cercocebus", "Urge incontinence due to prolapse of female genital organ", "Streptococcus pyogenes type emm111", "Partial glossectomy with radical dissection of left half of neck", "Paenibacillus doosanensis", "Diapterus auratus", "Ribonucleic acid of Parainfluenza virus 4", "Cryptocotyle concava", "Calicivirus gastroenteritis", "Acute sensory polyneuropathy", "Sabouraud dextrose agar", "Urrets-Zavalia syndrome of eye due to and following ocular surgery",
    "Product containing only flortaucipir (18-F) in parenteral dose form", "Structure of skin of left knee", "Grafting of auricle of ear", "Closed fracture of fifth metatarsal bone of right foot", "Ultrasonic wave", "Accessory mobilization of the cervical spine", "Scilla nonscripta", "Product containing precisely mepacrine hydrochloride 100 milligram/1 each conventional release oral tablet", "Antibiotic sensitivity, fungus", "Alcohol, methyl measurement", "Structure of bursa of thumb", "History and physical examination, sports participation", "Haemophilus felis", "Cholylglycine measurement", "Radiation half-value thickness", "Entire embryo at stage 15", "Product containing only mecobalamin", "Genus Lophodytes", "Mycoplasma hyopneumoniae", "Fetal chromosomal abnormality", "History of cancer metastatic to bone", "Orthopedic fixation plate, non-bioabsorbable, non-sterile", "Neoplasm of isthmus of uterus", "Urethrotome", "Death by immolation", "Skin of part of back", "Croton oil", "Associated_finding_filler", "Lyssavirus aravan", "Main spoken language Esperanto", "Unable to prepare food hygienically", "Proximal radioulnar joint structure", "Early onset cerebellar ataxia", "Cresylglycidyl ether", "Etheostoma zoniferum", "Plasma thawing unit", "Fracture of coccyx", "3 o'clock position on mammogram", "Old posterior cruciate ligament disruption", "Anxious avoidant attachment", "Inflammation of left iliohypogastric nerve", "Re-excision of breast for clearance of tumor margins", "Diamond dental laboratory bur", "Family Fuselloviridae", "Occlusion of vein of corpus cavernosum of penis", "Autosomal dominant intermediate Charcot-Marie-Tooth disease type C", "Hemoglobin Volga", "Punctate outer retinal toxoplasmosis", "Congenital talipes calcaneovalgus", "Recurrent acute streptococcal tonsillitis",
    "Jaeger type 10", "Myofascial pain syndrome of lower back", "Spine dislocation due to birth trauma", "Netta rufina", "Congenital abnormality of systemic vein", "Stachybotrys chartarum", "Fracture of phalanx of finger of left hand", "Astroscopus y-graecum", "Vaccine product containing only Clostridium tetani and Human poliovirus antigens", "Single photon emission computed tomography system", "Measurement of Influenza A virus antibody and Influenza B virus antibody", "Chronic radiation nephritis", "Giant retinal tear", "Structure of lateral surface of upper arm", "Family Echinorhynchidae", "Embolectomy with catheter of celiac artery by abdominal incision", "Entire zygomatic arch", "Application of hot water bottle", "Tilmicosin", "Hypothalamic releasing factor", "Perioral numbness", "Electron microscopy study, transmission, examination and report", "Structure of hypothalamicohypophyseal tract", "Incision and exploration of tunica vaginalis", "Injury of nasopharynx", "Grafting of fascia to tarsal cartilage", "Accidental ingestion of spurge olive berries", "Stimulus response inventory", "Fall risk assessment declined", "Vegetable suet", "Flexion deformity of joint of right ankle", "Delayed allogeneic transplantation, living donor", "Finding of knee joint color", "Entire sinus venosus of fetus", "Blood (white blood cell) screen for GM1 gangliosidosis", "Cystic testicular dysplasia", "No active range of shoulder circumduction", "Product containing precisely promethazine hydrochloride 25 milligram/1 each conventional release oral tablet", "Spotted creeper", "Chilomastix intestinalis", "Extraprostatic extension of tumor present, non-focal", "Arteriovenous graft rupture", "Structure of superficial anterior cervical lymph node", "Myelodysplastic syndrome with excess blasts-1", "Administration of substance via buccal route",
    "Product containing only bendroflumethiazide and potassium chloride", "Transmission-based precautions", "Tooth disorder", "Physical", "Dislocation of ear ossicles", "Fracture dislocation of symphysis pubis", "Tache noire of sclera of left eye", "Volatile inhalant dependence, episodic", "Decreased N-3 fatty acid diet", "Carbapenemase-producing Lelliottia amnigena", "Polychromatophilic erythroblast", "Fuller's earth dust", "Stress fracture of sacrum", "Poor self-esteem", "Bathrocephaly", "Snuff user", "Generalized onset motor epileptic seizure", "Allergy treatment changed", "Erosion of pacemaker pocket due to and following implantation of cardiac pacemaker", "Head lifting exercise", "Entire corpus penis", "Oligopus diagrammus", "Deficiency of ribose-5-phosphate isomerase", "Progestogen overdose", "Cartilage graft - prominent ear", "Borrelia theileri", "Mood disorder with mixed depressive and manic symptoms caused by volatile inhalant", "Degloving injury buttock", "Attachment of bone anchored hearing prosthesis", "Air traffic controller", "Entire intervertebral symphysis between T2 and T3", "Protostrongylus brevispiculum", "Vasomotor rhinitis", "Able to open and close mouth", "Pseudoxanthomonas helianthi", "Repair of ventral hernia using graft", "Assessment using Short Form of the Informant Questionnaire on Cognitive Decline in the Elderly", "Acute inflammation", "Autosomal recessive hereditary spastic paraplegia", "Octomitus", "Toxoplasma tubulointerstitial nephropathy", "Notropis harperi", "History of toxoplasmosis", "Blood group antigen Sk^a^", "Wedge excision of skin of nail fold", "Finding of difference in location compared to previous radiologic examination", "Christo Inventory for Substance-misuse Services total score", "Aspidites melanocephalus", "General medical self-referral", "Artery of extremity transposition", "Occlusion of left central retinal artery",
    "Nutrient absorption inhibitor", "Laceration of thigh without foreign body", "Eimeria magna", "Enterocolic fistula", "Rehabilitation care plan", "Blood group antigen Naz", "Bilastine", "Appearance of oral mucosa", "Superior closed dislocation", "Khoshamian", "Urine codeine measurement", "Computed tomography of thigh for radiotherapy planning", "Application of long leg cast", "Acute duodenitis", "Nuclear medicine system table, powered", "Local tumor spread", "Family Kogiidae", "Able to transfer from wheelchair to chair", "Selectron therapy", "Structure of superficial part of superior levator palpebrae muscle", "Entire left wall of urinary bladder", "Urine specimen care assessment", "Structure of intervertebral disc of sacral vertebra", "Amino acids peritoneal dialysis solution", "Family Caryophyllaceae", "Umbilical catheter submitted as specimen", "Product containing precisely desmopressin acetate 100 microgram/1 milliliter conventional release nasal solution", "Syncytial cell", "Extension Namespace {1000085}", "Congenital absence of frontal bone", "Incision and exploration of rectum", "Needle catheter", "Percutaneous transluminal embolization of femoral artery", "Stung by cone shell", "Head artery injection", "True insight, function", "Disorder due to and following injury of muscle and tendon of upper limb", "Entire left foot", "Recurrent erosion of cornea of left eye", "Excision of salivary gland", "Adverse effect of prosthetic device", "Total excision of meningioma", "Cote d'Ivoire", "Fluoroscopic venography of hepatic vein with contrast and insertion of stent graft", "Amyloid deposition", "Measurement of serum thyroid stimulating hormone", "Excessive secretin secretion", "Cataract lens fragments in vitreous of left eye due to and following cataract surgery", "Toe joint crepitus", "Gynecological operating table", "Product containing hydrolase inhibitor", "Edoxaban tosylate", "Deep dog bite", "Family Kinosternidae",
    "Substance with sodium/potassium adenosine triphosphatase enzyme inhibitor mechanism of action", "Delayed suture of tendon", "Specimen source",
    "Microinvasive carcinoma", "Abdominal reflex delayed", "Product containing only phenazopyridine", "Incision and drainage of hematoma", "Skin structure of infratemporal region", "Soft tissue lesion of foot region", "Arthroscopic release of capsule of hip joint", "Traumatic joint hemarthrosis", "Long QT syndrome caused by drug", "Neocoger mucronatus", "Product containing bismuth subsalicylate and metronidazole and tetracycline", "Pulmonary artery pressure", "Arthroscopy of shoulder with complete synovectomy", "beta-Aminoisobutyrate measurement", "Injury of thymus gland", "Finding related to ability to balance", "Porphyrins, quantitation and fractionation, urine", "Trichostrongylus retortaeformis", "Extrinsic motivation", "Duodenal candidosis", "Take impression for lower removable orthodontic appliance", "Primary basal cell carcinoma of skin of left foot", "Rigid ankle-foot orthosis", "Closed fracture thumb proximal phalanx, head", "Acute suppuration of maxillary sinus", "Dodecenoyl-coenzyme A delta-isomerase", "Entire lymphatic vessel of orbit", "Blood group antigen Wallin", "Corynebacterium species not Corynebacterium jeikeium", "Nocardioides terrae", "Lactobacillus rhamnosus GG", "Pain of breast", "Structure of left cochlear", "Decreased active range of hip flexion"
]

print(f"Full list: {len(CONCEPT_TERMS)} concepts")


Full list: 400 concepts


In [8]:
if TEST_MODE:
    import random as _rng
    _rng.seed(42)
    CONCEPT_TERMS = _rng.sample(CONCEPT_TERMS, min(5, len(CONCEPT_TERMS)))
    print(f"TEST MODE: using {len(CONCEPT_TERMS)} random concepts")
print(f"Concepts to process: {len(CONCEPT_TERMS)}")


Concepts to process: 400


## Backup Concepts

Known-good SNOMED CT concepts used as replacements when a primary concept is NOT FOUND.

In [9]:
BACKUP_CONCEPTS = [
    "Diabetes mellitus",
    "Asthma",
    "Pneumonia",
    "Fracture of femur",
    "Appendectomy",
    "Aspirin allergy",
    "Migraine",
    "Osteoarthritis",
    "Chronic kidney disease",
    "Hypothyroidism",
    "Rheumatoid arthritis",
    "Epilepsy",
    "Pulmonary embolism",
    "Cellulitis",
    "Otitis media",
    "Cirrhosis of liver",
    "Gout",
    "Psoriasis",
    "Endoscopy",
    "Electrocardiogram",
    "Colonoscopy",
    "Tonsillectomy",
    "Blood glucose measurement",
    "Platelet count",
    "Hemoglobin A1c measurement",
    "Cerebrovascular accident",
    "Deep vein thrombosis",
    "Congestive heart failure",
    "Chronic obstructive lung disease",
    "Peptic ulcer",
]

# Remove any backup concepts that are already in the primary list
BACKUP_CONCEPTS = [c for c in BACKUP_CONCEPTS if c not in set(CONCEPT_TERMS)]
random.shuffle(BACKUP_CONCEPTS)
print(f"Primary concepts: {len(CONCEPT_TERMS)}, Backup pool: {len(BACKUP_CONCEPTS)}")


Primary concepts: 400, Backup pool: 30


## BioPortal API Setup

Configure API access and define helper functions for SNOMED CT queries.

In [10]:
LOG_PATH = LOG_DIR / "logs.txt"

# ============================================================
# BioPortal API configuration
# ============================================================
BIOPORTAL_BASE = "https://data.bioontology.org"
ONTOLOGY = "SNOMEDCT"

# Option 1: Set via environment variable (recommended)
# Option 2: Paste your key directly below if env var is not picked up
BIOPORTAL_API_KEY = os.environ.get("BIOPORTAL_API_KEY", "")

# Uncomment and paste your key here if the environment variable is not detected:
BIOPORTAL_API_KEY = "7d367420-b0d2-4646-aa43-90fd76bd47ed"

if not BIOPORTAL_API_KEY:
    raise EnvironmentError(
        "BIOPORTAL_API_KEY is not set. Either:\n"
        "  1. export BIOPORTAL_API_KEY='your-key' in your shell and restart the kernel, or\n"
        "  2. Paste your key directly in the cell above (uncomment the line)."
    )

API_DELAY = 0.5   # seconds between API calls
API_TIMEOUT = 30   # seconds per request
MAX_RETRIES = 2    # retry on transient failures

_api_cache = {}
_api_call_count = 0


def _class_uri(concept_id: str) -> str:
    """Build the BioPortal class URI for a SNOMED CT concept."""
    return f"http://purl.bioontology.org/ontology/SNOMEDCT/{concept_id}"


def _encode_uri(uri: str) -> str:
    """URL-encode a class URI for use in BioPortal path segments."""
    return quote(uri, safe="")


def _extract_concept_id(class_id: str) -> str:
    """Extract the SNOMED concept ID from a BioPortal class @id URI."""
    if "/" in class_id:
        return class_id.rsplit("/", 1)[-1]
    return class_id


def _api_get(url, params=None, timeout=None):
    """GET with caching, rate-limiting, and retry logic."""
    global _api_call_count
    cache_key = url + str(sorted((params or {}).items()))
    if cache_key in _api_cache:
        return _api_cache[cache_key]

    timeout = timeout or API_TIMEOUT

    if params is None:
        params = {}
    params.setdefault("apikey", BIOPORTAL_API_KEY)
    params.setdefault("display_links", "false")
    params.setdefault("display_context", "false")

    for attempt in range(MAX_RETRIES + 1):
        try:
            time.sleep(API_DELAY)
            resp = requests.get(url, params=params, timeout=timeout)
            resp.raise_for_status()
            data = resp.json()
            _api_cache[cache_key] = data
            _api_call_count += 1
            if _api_call_count % 20 == 0:
                print(f"  [{_api_call_count} API calls so far]")
            return data
        except requests.exceptions.Timeout as e:
            if attempt < MAX_RETRIES:
                wait = (attempt + 1) * 3
                print(f"    Timeout, retry {attempt+1}/{MAX_RETRIES} in {wait}s...")
                time.sleep(wait)
            else:
                raise
        except requests.exceptions.RequestException as e:
            if attempt < MAX_RETRIES:
                wait = (attempt + 1) * 3
                print(f"    Error: {e}, retry {attempt+1}/{MAX_RETRIES} in {wait}s...")
                time.sleep(wait)
            else:
                raise


def _csv_safe(x):
    if x is None:
        return ""
    return str(x).replace("\r", " ").replace("\n", " ").strip()


def list_to_pipe(items):
    """Convert a list of strings to pipe-separated string. No truncation."""
    items = [str(i).replace("|", " ").strip() for i in (items or []) if str(i).strip()]
    return "|".join(items)


def semantic_tag_from_fsn(fsn: str) -> str:
    m = re.search(r"\(([^()]*)\)\s*$", fsn or "")
    return m.group(1).strip() if m else "UNKNOWN"


# ============================================================
# SNOMED CT query functions
# ============================================================

def search_concept(term: str, exact=True):
    """Search for a concept by term.
    Returns (concept_id, prefLabel) or None.
    If exact=False, uses partial matching."""
    url = f"{BIOPORTAL_BASE}/search"
    params = {
        "q": term,
        "ontologies": ONTOLOGY,
        "pagesize": "10",
    }
    if exact:
        params["require_exact_match"] = "true"
    data = _api_get(url, params=params)
    for item in data.get("collection", []):
        if item.get("obsolete"):
            continue
        class_id = item.get("@id", "")
        pref_label = item.get("prefLabel", "")
        cid = _extract_concept_id(class_id)
        if cid:
            return cid, pref_label
    return None


def get_class_info(cid: str):
    """Get class info including FSN and definition status.
    Returns (fsn, definition_status).
    The FSN (Fully Specified Name) includes the semantic tag, e.g.
    'Selectron therapy (procedure)'. BioPortal's prefLabel returns
    the Preferred Term (no semantic tag), so we look through synonyms
    for the FSN which ends with '(tag)'."""
    try:
        encoded = _encode_uri(_class_uri(cid))
        url = f"{BIOPORTAL_BASE}/ontologies/{ONTOLOGY}/classes/{encoded}"
        data = _api_get(url, params={"include": "prefLabel,synonym,properties"})
        pref_label = data.get("prefLabel", "")
        synonyms = data.get("synonym", []) or []

        # --- Find the FSN (the synonym ending with a semantic tag in parens) ---
        fsn = pref_label  # fallback
        # Check if prefLabel itself already has the semantic tag
        if re.search(r"\([^()]+\)\s*$", pref_label):
            fsn = pref_label
        else:
            # Search through synonyms for one that ends with (semantic_tag)
            for syn in synonyms:
                if isinstance(syn, str) and re.search(r"\([^()]+\)\s*$", syn):
                    fsn = syn
                    break

        # --- Extract definition status ---
        props = data.get("properties", {})
        def_status_id = ""
        # Search through all property keys for definition status
        # BioPortal may use different key formats:
        #   - 'definitionStatusId'
        #   - 'DEFINITION_STATUS_ID'
        #   - URI like 'http://snomed.info/.../definitionStatusId'
        for key, val in props.items():
            key_lower = key.lower()
            if "definitionstatus" in key_lower or "definition_status" in key_lower:
                def_status_id = val[0] if isinstance(val, list) and val else str(val)
                break
        # Debug: log property keys for the first few concepts
        if _api_call_count <= 5:
            prop_keys = [k.rsplit('/', 1)[-1] if '/' in k else k for k in props.keys()]
            print(f"    [debug] properties keys for {cid}: {prop_keys}")
            if def_status_id:
                print(f"    [debug] definition_status raw value: {def_status_id}")
        if "900000000000073002" in def_status_id:
            def_status = "Fully defined"
        elif "900000000000074008" in def_status_id:
            def_status = "Primitive"
        else:
            # Try to interpret the value itself
            val_lower = def_status_id.lower()
            if "defined" in val_lower:
                def_status = "Fully defined"
            elif "primitive" in val_lower:
                def_status = "Primitive"
            else:
                def_status = "UNKNOWN"
        return fsn, def_status
    except Exception as e:
        print(f"    WARNING: get_class_info({cid}) failed: {e}")
        return "", "UNKNOWN"


def get_parents(cid: str):
    """Get parents. Returns list of (concept_id, prefLabel) tuples."""
    try:
        encoded = _encode_uri(_class_uri(cid))
        url = f"{BIOPORTAL_BASE}/ontologies/{ONTOLOGY}/classes/{encoded}/parents"
        data = _api_get(url)
        results = []
        if isinstance(data, list):
            items = data
        else:
            items = data.get("collection", data.get("results", []))
        for item in items:
            pid = _extract_concept_id(item.get("@id", ""))
            pref = item.get("prefLabel", "")
            if pid:
                results.append((pid, pref))
        return results
    except Exception as e:
        print(f"    WARNING: get_parents({cid}) failed: {e}")
        return []


def get_children(cid: str):
    """Get children. Returns list of (concept_id, prefLabel) tuples."""
    try:
        encoded = _encode_uri(_class_uri(cid))
        url = f"{BIOPORTAL_BASE}/ontologies/{ONTOLOGY}/classes/{encoded}/children"
        data = _api_get(url, params={"pagesize": "100"}, timeout=20)
        results = []
        items = data.get("collection", []) if isinstance(data, dict) else data
        for item in items:
            child_id = _extract_concept_id(item.get("@id", ""))
            pref = item.get("prefLabel", "")
            if child_id:
                results.append((child_id, pref))
        return results
    except requests.exceptions.Timeout:
        print(f"    (children timeout for {cid} - likely a leaf concept)")
        return []
    except Exception as e:
        print(f"    WARNING: get_children({cid}) failed: {e}")
        return []


def get_siblings(cid: str):
    """Get siblings (share a parent). Returns list of prefLabel strings."""
    sibs = set()
    for pid, _ in get_parents(cid):
        for child_id, child_pref in get_children(pid):
            if child_id != cid:
                sibs.add(child_pref)
    return list(sibs)


def get_grandparents(cid: str):
    """Get grandparents (parents of parents, depth -2). Returns list of prefLabel strings."""
    gps = set()
    for parent_id, _ in get_parents(cid):
        for gp_id, gp_pref in get_parents(parent_id):
            gps.add(gp_pref)
    return list(gps)


print("BioPortal API configured.")
print(f"API base: {BIOPORTAL_BASE}")
print(f"Ontology: {ONTOLOGY}")


BioPortal API configured.
API base: https://data.bioontology.org
Ontology: SNOMEDCT


## Concept Validation & Ground Truth Extraction

For each concept:
1. Try exact match in BioPortal
2. If not found, try partial/fuzzy match
3. If still not found, replace with a concept from the backup pool
4. Extract ground truth relationships

**Resume:** If you interrupt and re-run this cell, it loads existing
`ground_truth.csv` and `validated_concepts.csv`, skips concepts already done,
and continues from the next concept. Progress is saved every 10 concepts.

In [11]:
# ============================================================
# Validate concepts and extract ground truth
# ============================================================

backup_idx = 0
gt_rows = []
validation_rows = []
not_found_terms = []
replaced_terms = []

# ---- Resume: load existing progress if present ----
already_processed = set()
if VALIDATED_CONCEPTS_OUT.exists():
    _val = pd.read_csv(VALIDATED_CONCEPTS_OUT)
    already_processed = set(_val["original_term"].dropna().astype(str).str.strip())
    validation_rows = _val.to_dict("records")
    validation_rows = [{"concept_term": r.get("concept_term", ""), "original_term": r.get("original_term", ""), "snomed_id": str(r.get("snomed_id", "")), "status": r.get("status", "")} for r in validation_rows]
    backup_idx = sum(1 for r in validation_rows if r.get("status") == "replaced")
if GT_OUT.exists():
    _gt = pd.read_csv(GT_OUT)
    gt_rows = _gt.to_dict("records")
    gt_rows = [{k: ("" if pd.isna(v) else str(v)) for k, v in row.items()} for row in gt_rows]

concepts_to_process = [c for c in CONCEPT_TERMS if c not in already_processed]

if already_processed:
    print(f"Resuming: {len(already_processed)} concepts already done, {len(concepts_to_process)} remaining.")
print(f"Validating and extracting GT for {len(concepts_to_process)} concepts...")
print(f"Backup pool: {len(BACKUP_CONCEPTS)} concepts available (consumed so far: {backup_idx})\n")

for i, concept_term in enumerate(concepts_to_process):
    original_term = concept_term
    status = "found"

    try:
        # Step 1: Try exact match
        result = search_concept(concept_term, exact=True)

        # Step 2: Try partial match if exact fails
        if result is None:
            result = search_concept(concept_term, exact=False)
            if result:
                status = "partial_match"
                print(f"  [{i+1}/{len(concepts_to_process)}] {concept_term} -> partial match: {result[1]}")

        # Step 3: Replace with backup concept if still not found
        if result is None:
            not_found_terms.append(concept_term)
            # Try backup concepts until we find one that works
            replacement_found = False
            while backup_idx < len(BACKUP_CONCEPTS):
                backup_term = BACKUP_CONCEPTS[backup_idx]
                backup_idx += 1
                backup_result = search_concept(backup_term, exact=True)
                if backup_result is None:
                    backup_result = search_concept(backup_term, exact=False)
                if backup_result:
                    result = backup_result
                    concept_term = backup_term
                    status = "replaced"
                    replaced_terms.append((original_term, backup_term))
                    print(f"  [{i+1}/{len(concepts_to_process)}] {original_term} -> NOT FOUND, replaced with: {backup_term} ({result[0]})")
                    replacement_found = True
                    break
            if not replacement_found:
                print(f"  [{i+1}/{len(concepts_to_process)}] {original_term} -> NOT FOUND (no backup available)")
                validation_rows.append({
                    "concept_term": original_term,
                    "original_term": original_term,
                    "snomed_id": "",
                    "status": "not_found",
                })
                with LOG_PATH.open("a") as f:
                    f.write(f"{datetime.now().isoformat()}\t{original_term}\tNOT_FOUND\n")
                continue

        cid, pref_label = result

        # Extract ground truth
        fsn, def_status = get_class_info(cid)
        if not fsn:
            fsn = pref_label  # fallback to search prefLabel
        sem_tag = semantic_tag_from_fsn(fsn)

        parent_tuples = get_parents(cid)
        child_tuples = get_children(cid)
        parent_labels = [lbl for _, lbl in parent_tuples]
        child_labels = [lbl for _, lbl in child_tuples]
        sib_labels = get_siblings(cid)
        gp_labels = get_grandparents(cid)

        gt_rows.append({
            "timestamp": datetime.now().isoformat(),
            "concept_term": concept_term,
            "snomed_id": cid,
            "fsn": _csv_safe(fsn),
            "semantic_tag": _csv_safe(sem_tag),
            "definition_status": _csv_safe(def_status),
            "parents": list_to_pipe(parent_labels),
            "grandparents": list_to_pipe(gp_labels),
            "children": list_to_pipe(child_labels),
            "siblings": list_to_pipe(sib_labels),
        })

        validation_rows.append({
            "concept_term": concept_term,
            "original_term": original_term,
            "snomed_id": cid,
            "status": status,
        })

        if status == "found":
            print(f"  [{i+1}/{len(concepts_to_process)}] {concept_term} -> {cid} "
                  f"({len(parent_labels)}P, {len(gp_labels)}GP, {len(child_labels)}C, {len(sib_labels)}S)")

        with LOG_PATH.open("a") as f:
            f.write(
                "\n" + "=" * 80 + "\n" + datetime.now().isoformat() + "\n"
                + f"CONCEPT: {concept_term} (original: {original_term})\n"
                + f"STATUS: {status}\n"
                + f"SNOMED_ID: {cid}\nFSN: {fsn}\n"
                + f"TAG: {sem_tag}\nDEF_STATUS: {def_status}\n"
                + f"PARENTS: {parent_labels}\nGRANDPARENTS: {gp_labels}\n"
                + f"CHILDREN: {child_labels}\nSIBLINGS ({len(sib_labels)}): {sib_labels[:10]}...\n"
            )

    except Exception as e:
        with LOG_PATH.open("a") as f:
            f.write(f"{datetime.now().isoformat()}\t{original_term}\tERROR\t{e}\n")
        print(f"  [{i+1}/{len(concepts_to_process)}] {original_term} -> ERROR: {e}")

    # --- Incremental save every 10 concepts ---
    if gt_rows and (i + 1) % 10 == 0:
        _tmp_gt = pd.DataFrame(gt_rows)
        _tmp_gt = _tmp_gt[[
            "timestamp", "concept_term", "snomed_id",
            "fsn", "semantic_tag", "definition_status",
            "parents", "grandparents", "children", "siblings",
        ]]
        _tmp_gt.to_csv(GT_OUT, index=False)
        _tmp_val = pd.DataFrame(validation_rows)
        if not _tmp_val.empty:
            _tmp_val[["concept_term", "original_term", "snomed_id", "status"]].to_csv(VALIDATED_CONCEPTS_OUT, index=False)
        print(f"    [checkpoint] Saved {len(gt_rows)} concepts to CSV")

print(f"\nDone. Total API calls: {_api_call_count}")
print(f"Found: {sum(1 for r in validation_rows if r['status'] == 'found')}")
print(f"Partial match: {sum(1 for r in validation_rows if r['status'] == 'partial_match')}")
print(f"Replaced: {sum(1 for r in validation_rows if r['status'] == 'replaced')}")
print(f"Not found (no replacement): {sum(1 for r in validation_rows if r['status'] == 'not_found')}")


Validating and extracting GT for 400 concepts...
Backup pool: 30 concepts available (consumed so far: 0)

    [debug] properties keys for 57054005: ['TYPE_ID', 'CASE_SIGNIFICANCE_ID', 'core#notation', 'core#altLabel', 'EFFECTIVE_TIME', 'ACTIVE', 'has_finding_site', 'has_clinical_course', 'core#prefLabel', 'rdf-schema#subClassOf', 'hasSTY', 'cause_of', '22-rdf-syntax-ns#type', 'SUBSET_MEMBER', 'DEFINITION_STATUS_ID', 'tui', 'CTV3ID', 'cui', 'occurs_before', 'has_associated_morphology']
    [debug] definition_status raw value: 900000000000073002
  [1/400] Acute myocardial infarction -> 57054005 (2P, 3GP, 19C, 18S)
  [2/400] Atrial fibrillation -> 49436004 (3P, 4GP, 12C, 13S)
  [20 API calls so far]
  [3/400] Hypertensive disorder, systemic arterial -> 38341003 (2P, 2GP, 27C, 66S)
  [4/400] Atherosclerosis -> 38716007 (1P, 1GP, 0C, 0S)
  [5/400] Embolus -> 55584005 (1P, 2GP, 23C, 51S)
  [40 API calls so far]
  [6/400] Ventricular tachycardia -> 25569003 (2P, 4GP, 24C, 38S)
  [7/400] Apnea

## Save Results

In [12]:
# ============================================================
# Save ground truth CSV
# ============================================================
if gt_rows:
    gt_df = pd.DataFrame(gt_rows)
    gt_df = gt_df[[
        "timestamp", "concept_term", "snomed_id",
        "fsn", "semantic_tag", "definition_status",
        "parents", "grandparents", "children", "siblings",
    ]]
    gt_df.to_csv(GT_OUT, index=False)
    print(f"Ground truth saved: {GT_OUT} ({len(gt_df)} concepts)")
else:
    print("WARNING: No ground truth rows to save!")

# ============================================================
# Save validated concepts CSV
# ============================================================
if validation_rows:
    val_df = pd.DataFrame(validation_rows)
    val_df = val_df[["concept_term", "original_term", "snomed_id", "status"]]
    val_df.to_csv(VALIDATED_CONCEPTS_OUT, index=False)
    print(f"Validated concepts saved: {VALIDATED_CONCEPTS_OUT} ({len(val_df)} concepts)")
else:
    print("WARNING: No validated concepts to save!")

# ============================================================
# Summary (from full validation_rows so resume runs are included)
# ============================================================
_replaced = [(r["original_term"], r["concept_term"]) for r in validation_rows if r.get("status") == "replaced"]
_not_found = [r["original_term"] for r in validation_rows if r.get("status") == "not_found"]
if _replaced:
    print("\nReplaced concepts:")
    for orig, repl in _replaced:
        print(f"  {orig} -> {repl}")

if _not_found:
    print("\nCould not find or replace:")
    for t in _not_found:
        print(f"  {t}")

if not _replaced and not _not_found:
    pass  # no replaced/not_found to show

print("\n" + "=" * 80)
print("Step 1 complete. Ground truth is shared by all LLM folders.")
print("Now run step2_llm_queries.ipynb in each testing_* folder.")
print("=" * 80)


Ground truth saved: /Users/narenkhatwani/Documents/GitHub/llm-as-ontology-server/output/ground_truth/ground_truth.csv (400 concepts)
Validated concepts saved: /Users/narenkhatwani/Documents/GitHub/llm-as-ontology-server/output/ground_truth/validated_concepts.csv (400 concepts)

Step 1 complete. Ground truth is shared by all LLM folders.
Now run step2_llm_queries.ipynb in each testing_* folder.
