# Installing Necessary Packages

In [2]:
!pip -q install zai-sdk
!pip -q install openai
!pip -q install google-generativeai

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.6/125.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h

# Import Packages

In [3]:
import os
import sys
import re
import time
import json
import random
import logging
import warnings
import requests

import numpy as np
import pandas as pd

from google import genai
from google.genai.types import GenerateContentConfig

from abc import ABC, abstractmethod

from zai import ZaiClient
from openai import OpenAI
from tqdm import tqdm
from kaggle_secrets import UserSecretsClient

pd.set_option('display.max_colwidth', None)
warnings.filterwarnings('ignore')

# Logging

In [4]:
logger = logging.getLogger()      # root logger
logger.setLevel(logging.WARNING)

if not logger.handlers:
    handler = logging.StreamHandler(sys.stdout)
    formatter = logging.Formatter("%(asctime)s %(levelname)s: %(message)s")
    handler.setFormatter(formatter)
    logger.addHandler(handler)

# System Prompt

In [5]:
PROMPT_TEMPLATE = """
You are an expert clinical NLP annotator. Your task is to extract structured entities from CHEST CT radiology reports.

CRITICAL: You must output ONLY valid JSON. No markdown code blocks, no explanatory text, no comments.

OUTPUT SCHEMA:
[
  {
    "general_finding": "string or None",
    "specific_finding": "string or None",
    "finding_presence": "present OR absent OR uncertain OR None",
    "location": ["list of anatomical sites"],
    "degree": ["list of qualifiers like mild, moderate, severe, small, large"],
    "measurement": "exact numeric value from text with unit OR None",
    "comparison": "stable OR improved OR worsened OR None"
  }
]

EXTRACTION WORKFLOW (FOLLOW EXACTLY):

STEP 1: READ THE ENTIRE REPORT
Read from beginning to end without extracting yet. Identify all sections (FINDINGS, IMPRESSION, etc.).

STEP 2: IDENTIFY ALL FINDINGS
Go through sentence by sentence and identify:
- What is being described (the finding)
- Whether it's present, absent, or uncertain
- Where it's located
- Any size/severity descriptors
- Any measurements
- Any comparisons to prior imaging

STEP 2.5: DEDUPLICATE ACROSS SECTIONS
After identifying all findings, check if IMPRESSION repeats findings from FINDINGS:
- If the IMPRESSION restates a finding from FINDINGS (same entity, same measurement):
  → Extract it ONLY ONCE
  → Prefer the FINDINGS version unless IMPRESSION adds new clinical interpretation
- If IMPRESSION adds NEW information (e.g., clinical significance, recommendations):
  → Extract as additional context or merge into existing finding

Example:
FINDINGS: "4 cm mass in right lung"
IMPRESSION: "4 CM RIGHT LUNG MASS CONCERNING FOR MALIGNANCY"
→ ONE entity with degree: ["concerning for malignancy"]

VERSUS STATEMENT CLARIFICATION:
"atelectasis and/or scarring" = "atelectasis versus scarring"
→ specific_finding: "atelectasis versus scarring"
→ finding_presence: "uncertain"

GROUPED NEGATIVES:
When a sentence negates multiple findings together:
"No consolidation, effusion or edema"
→ Create ONE entity:
   general_finding: "parenchymal abnormality"
   specific_finding: "consolidation, effusion, or edema"
   finding_presence: "absent"
   location: ["lungs"]

STEP 3: MERGE RELATED INFORMATION
BEFORE creating final output, merge information that belongs together:

RULE 3A: MERGE DEVICE POSITIONS
Bad: Entity 1: "catheter present in right IJ" + Entity 2: "catheter tip in right atrium"
Good: ONE entity with location: ["right internal jugular", "right atrium"]

RULE 3B: MERGE COREFERENCES
When you see "this", "that", "these", "it", "again", "unchanged", "as above":
- These refer back to a previous finding
- Add the new information to that finding
- Do NOT create a separate entity

Example:
"Ground glass opacities are seen. These involve all lobes."
→ ONE entity: ground glass opacity in all lobes

RULE 3C: HANDLE "VERSUS" AND "AND/OR" STATEMENTS
Both "versus" and "and/or" indicate uncertainty between options.

Examples that mean UNCERTAIN:
- "consolidation versus atelectasis"
- "atelectasis and/or scarring"
- "atelectasis and / or scarring"

Create ONE entity:
- specific_finding: "consolidation versus atelectasis" (keep exact wording)
- finding_presence: "uncertain"
DO NOT create two separate entities.

Example:
"bibasilar atelectasis and/or scarring"
→ ONE entity:
   specific_finding: "atelectasis versus scarring"
   finding_presence: "uncertain"
   location: ["bilateral lung bases"]

RULE 3D: MERGE MEASUREMENTS WITH FINDINGS
"Nodule in right upper lobe measuring 8 mm"
→ ONE entity with location AND measurement together

RULE 3E: GROUPED NEGATIVES
When a sentence negates multiple findings together, merge them into ONE entity.

Example:
"No focal consolidation, effusion or edema is present"
→ Create ONE entity:
   general_finding: "parenchymal abnormality"
   specific_finding: "consolidation, effusion, or edema"
   finding_presence: "absent"
   location: ["lungs"]

DO NOT create three separate entities for consolidation, effusion, and edema.

Only split if they have different locations or contexts:
"No pleural effusion. No pericardial effusion."
→ TWO entities (different locations: pleura vs pericardium

STEP 4: HANDLE SPECIAL CASES

CASE A: DIFFERENTIAL DIAGNOSES
"opacity concerning for aspiration versus infection"
→ specific_finding: "opacity concerning for aspiration versus infection"
→ finding_presence: "uncertain"

CASE B: SEQUELAE AND CONTEXT
"calcified granulomas, likely sequela of prior infection"
→ Extract as TWO entities:
  1. calcified granulomas (present)
  2. sequela of prior granulomatous process (present)

CASE C: NORMAL/UNREMARKABLE FINDINGS
DO extract these - they provide clinical value:
"thyroid gland is unremarkable" → extract with finding_presence: "present", degree: ["unremarkable"]

CASE D: DEVICES AND TUBES
Include final position in location array AND distance measurements in measurement field:
"ET tube tip 2.5 cm above carina"
→ location: ["trachea"]
→ measurement: "2.5 cm above the carina"

PRESENCE DETERMINATION:

Present: "is seen", "demonstrates", "present", "identified", "noted", "appreciated", "there is/are"
Absent: "no", "without", "absent", "negative for", "no evidence of", "free of"
Uncertain: "versus", "possible", "suspicious for", "cannot exclude", "may represent", "likely"

COMPARISON DETERMINATION:
ONLY set if report explicitly compares to prior imaging:
- stable: "stable", "unchanged", "no change", "similar", "again seen"
- improved: "improved", "decreased", "smaller", "resolved", "less"
- worsened: "increased", "worsened", "new", "larger", "more", "progressed"

If no comparison to prior imaging is mentioned → comparison: "None"

ANATOMICAL LOCATION RULES:
- Use exact anatomical terms from the report
- Include laterality: "right lower lobe" not just "lower lobe"
- For bilateral findings: include "bilateral" OR list both sides
- For devices: include insertion site AND final position

DEGREE/QUALIFIER RULES:
Extract descriptors like: mild, moderate, severe, small, large, minimal, extensive, focal, diffuse, patchy, multifocal
These go in the degree array as strings.

MEASUREMENT RULES:
- Copy EXACTLY as written with units: "2.5 cm", "8 mm", "up to 1 cm"
- Include positional measurements: "3 cm above the carina"
- If no measurement → "None"

LYMPH NODE SPECIAL RULES:
- "no lymphadenopathy" → finding_presence: "absent"
- "subcentimeter nodes" or "nodes within normal limits" → DO NOT extract lymphadenopathy as present
- ONLY extract lymphadenopathy as present if: "enlarged", "prominent", "pathologic", "suspicious"

COMMON ERRORS TO AVOID:
❌ Creating separate entities for device + device tip
❌ Creating separate entities for "consolidation" and "atelectasis" in "versus" statements
❌ Missing coreferences (this, that, these, unchanged)
❌ Forgetting to merge multi-sentence descriptions
❌ Hallucinating measurements not in text
❌ Including findings from INDICATION/HISTORY that aren't confirmed in FINDINGS/IMPRESSION
❌ Setting comparison when no prior imaging is mentioned
❌ Extracting the same finding twice from FINDINGS and IMPRESSION
❌ Treating "and/or" as "present" instead of "uncertain"
❌ Splitting grouped negatives ("no X, Y, or Z") into multiple entities
❌ Using vague locations when specific ones are stated in text

VALIDATION BEFORE OUTPUT:
✓ Is it valid JSON (no markdown backticks)?
✓ Does every object have all 7 keys?
✓ Are location and degree arrays (even if empty [])?
✓ Is finding_presence one of: present, absent, uncertain, None?
✓ Is comparison one of: stable, improved, worsened, None?
✓ Did I merge related information correctly?
✓ Did I handle ALL "versus" AND "and/or" statements as uncertain?
✓ Did I check if IMPRESSION duplicates FINDINGS? (extract only once if duplicate)
✓ Did I merge grouped negatives ("no X, Y, or Z") into ONE entity?
✓ Are measurements EXACTLY as written in text?
✓ Are locations anatomically precise (not vague)?
✓ Did I avoid creating multiple entities for the same physical finding?

NOW EXTRACT FROM THIS REPORT:

<<<REPORT_TEXT>>>

Remember: Output ONLY the JSON array. No other text.

"""

# API Keys

In [6]:
user_secrets = UserSecretsClient()

API_KEYS = {
    "gemini": user_secrets.get_secret("gemini_api_key_0"),
    "gemma": user_secrets.get_secret("gemini_api_key_0"),
    #"glm": user_secrets.get_secret("glm_api_key"),
    #"deepseek": user_secrets.get_secret("deepseek_api_key"),
}

# LLM Classes

In [7]:
class AIBaseModel(ABC):
    def __init__(self, api_key: str, model_name: str):
        self.api_key = api_key
        self.model_name = model_name
    
    @abstractmethod
    def invoke(self, prompt: str, **kwargs):
        raise NotImplementedError

In [8]:
class GeminiModel(AIBaseModel):
    def __init__(self, api_key: str, model_name: str = "gemini-2.5-flash"):
        self.model_name = model_name
        self.client = genai.Client(api_key=api_key)
        self.sleep_time = self._get_time_to_sleep()
        
    def _get_time_to_sleep(self):
        requests_per_minute = 15  # default
        
        if self.model_name == "gemini-2.5-flash":
            requests_per_minute = 5
        elif self.model_name == "gemini-3-flash-preview":
            requests_per_minute = 5
        elif self.model_name == "gemini-2.5-flash-lite":
            requests_per_minute = 10
        elif self.model_name == "gemini-1.5-flash":
            requests_per_minute = 15
        elif "gemma" in self.model_name:
            requests_per_minute = 30
            
        return 60 / requests_per_minute
    
    def invoke(
        self, 
        prompt: str,
        system_prompt: str | None = None,
        temperature: float = 0, 
        top_p: float = 1, 
        max_tokens: int = 8192,
    ):
        try:
            response = self.client.models.generate_content(
                model=self.model_name,
                contents=prompt,
                config=GenerateContentConfig(
                    system_instruction=None,
                    temperature=temperature,
                    top_p=top_p,
                    max_output_tokens=max_tokens,
                ),
            )

            if hasattr(response, "candidates"):
                texts = []
                for c in response.candidates:
                    for p in getattr(c.content, "parts", []):
                        if getattr(p, "text", None):
                            texts.append(p.text)
                return "\n".join(texts) if texts else None
            return None
        except Exception as e:
            logger.error(f"Gemini API error: {e}")
        return None

In [9]:
class GLMModel(AIBaseModel):
    def __init__(self, api_key: str, model_name: str = "glm-4.5-flash"):
        super().__init__(api_key, model_name)
        self.client = ZaiClient(api_key=api_key)
        self.sleep_time = 6
        
    def invoke(
        self,
        prompt: str,
        system_prompt: str | None = None,
        temperature: float = 0.0,
        top_p: float = 1.0,
        max_tokens: int = 8192,
    ):
        try:
            messages = [
                {
                    "role": "system", 
                    "content": "You are a medical NLP system specialized in medical entity extraction from a given radiology report."
                },
                {
                    "role": "user",
                    "content": prompt
                }
            ]

            response = self.client.chat.completions.create(
                model=self.model_name,
                messages=messages,
                temperature=temperature,
                top_p=top_p,
                max_tokens=max_tokens,
                stream=False,
            )

            if response.choices:
                return response.choices[0].message.content.strip()

            return None
        except Exception as e:
            logger.error(f"GLM API error: {e}")
            if hasattr(e, "status_code"):
                logger.error(f"Status code: {e.status_code}")
            if hasattr(e, "body"):
                logger.error(f"Error body: {e.body}")
            return None

In [10]:
class DeepSeekModel(AIBaseModel):
    def __init__(self, api_key: str, model_name: str = "deepseek-chat"):
        super().__init__(api_key, model_name)
        self.client = OpenAI(
            api_key=api_key,
            base_url="https://api.deepseek.com"
        )
        self.sleep_time = 3
        
    def invoke(
        self,
        prompt: str,
        system_prompt: str | None = None,
        temperature: float = 0.0,
        top_p: float = 1.0,
        max_tokens: int = 8192,
    ):
        try:
            messages = [
                {
                    "role": "system", 
                    "content": "You are a medical NLP system specialized in medical entity extraction from a given radiology report."
                },
                {
                    "role": "user",
                    "content": prompt
                }
            ]

            response = self.client.chat.completions.create(
                model=self.model_name,
                messages=messages,
                temperature=temperature,
                top_p=top_p,
                max_tokens=max_tokens,
                stream=False,
            )

            return response.choices[0].message.content.strip()
        except Exception as e:
            logger.error(f"DeepSeek API error: {e}")
            return None


## Get Model

In [11]:
def get_ai_model(model: str, model_name: str):
    model_map = {
        "gemini": GeminiModel,
        "glm": GLMModel,
        "deepseek": DeepSeekModel
    }
    
    if model not in model_map:
        raise ValueError(f"Unvalid model: {model}. Choices: {list(model_map.keys())}")
    
    return model_map[model](API_KEYS[model], model_name)

In [12]:
def load_jsonl(path: str):
    with open(path, encoding="utf-8") as file:
        return [json.loads(line) for line in file if line.strip()]

def load_json(path: str):
    with open(path, encoding="utf-8") as file:
        return json.load(file)

In [13]:
def build_prompt(report: str) -> str:
    return PROMPT_TEMPLATE.replace("<<<REPORT_TEXT>>>", report)


In [14]:
def safe_parse_json(text: str):
    if not text:
        return None

    text = re.sub(r"```json|```", "", text, flags=re.IGNORECASE).strip()

    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass

    # önce array yakala
    m = re.search(r"\[.*\]", text, re.S)
    if m:
        try:
            return json.loads(m.group())
        except Exception:
            pass

    # sonra object
    m = re.search(r"\{.*\}", text, re.S)
    if m:
        try:
            return json.loads(m.group())
        except Exception:
            return None

    return None


In [15]:
def run_inference_radgraph(
    dataset,
    output_path: str,
    model_id: str,
    model_name: str,
):
    model = get_ai_model(model_id, model_name)

    results = []
    for idx, sample in tqdm(enumerate(dataset), total=len(dataset), desc="Processing Samples"):
        
        prompt = build_prompt(sample["report"])

        raw_output = model.invoke(
            prompt=prompt,
            system_prompt=None
        )

        parsed = safe_parse_json(raw_output)

        print(f"Report: {sample['report']}")
        print(f"Parsed Output: {parsed}")
        
        result = {
            "dataset": sample["dataset"],
            "doc_key": sample["doc_key"],
            "report": sample["report"],
            "model": model_name,
            "entities": parsed,
        }
        
        results.append(result)

        if (idx + 1) % 5 == 0:
            temp_path = output_path.replace(".json", "_temp.json")
            with open(temp_path, "w") as f:
                f.write(json.dumps(results))
        

        time.sleep(model.sleep_time + 0.2)

    with open(output_path, "w") as f:
        for r in results:
            f.write(json.dumps(r) + "\n")

    return results

def run_inference_ratener(
    dataset,
    output_path: str,
    model_id: str,
    model_name: str,
):
    model = get_ai_model(model_id, model_name)

    results = []
    for idx, sample in tqdm(enumerate(dataset), total=len(dataset), desc="Processing Samples"):
        
        prompt = build_prompt(sample["report"])

        raw_output = model.invoke(
            prompt=prompt,
            system_prompt=None
        )

        parsed = safe_parse_json(raw_output)

        print(f"Report: {sample['report']}")
        print(f"Raw Output: {raw_output}")
        
        result = {
            "note_id": sample["note_id"],
            "report": sample["report"],
            "model": model_name,
            "entities": parsed,
        }
        
        results.append(result)

        if (idx + 1) % 5 == 0:
            temp_path = output_path.replace(".json", "_temp.json")
            with open(temp_path, "w") as f:
                f.write(json.dumps(results))
        

        time.sleep(model.sleep_time + 0.2)

    with open(output_path, "w") as f:
        for r in results:
            f.write(json.dumps(r) + "\n")

    return results

In [16]:
INPUT_PATH = "/kaggle/input/chest-ct2/radgraphxl-chest-ct-reports.json"
OUTPUT_PATH = "/kaggle/working/chest-ct-schema.json"

dataset = load_json(INPUT_PATH)[2:]

results = run_inference_radgraph(
    dataset=dataset,
    output_path=OUTPUT_PATH,
    model_id="gemini",
    model_name="gemma-3-27b-it",
)

Processing Samples:   0%|          | 0/3 [00:00<?, ?it/s]

Report: FINDINGS: Evaluation of the pulmonary vasculature demonstrates no evidence of filling defects to suggest pulmonary emboli. Main pulmonary artery is at the upper limits of normal size. Right internal jugular dual-lumen catheter is in place with the tip in the proximal right atrium. Left internal jugular catheter is in place with the tip in the left innominate vein. Small subcutaneous hematoma is appreciated at the left internal jugular skin entry site. Thoracic aorta demonstrates normal contour and caliber with moderate atherosclerotic plaque. Four-vessel aortic arch is incidentally noted with the left vertebral artery arising directly from the aorta. Mild cardiomegaly is appreciated without evidence of pericardial effusion. Coronary artery calcifications are appreciated involving the left anterior descending and circumflex coronary arteries. Endotracheal tube is in place with the tip approximately 2.5 cm above the carina. Trachea and central bronchi otherwise patent. Multifocal

Processing Samples:  33%|███▎      | 1/3 [00:42<01:25, 42.53s/it]

Report: FINDINGS: Visualized thyroid gland is unremarkable. The heart is normal in size with a trace physiologic pericardial effusion. No coronary artery calcification is seen. No mediastinal, axillary, or hilar lymphadenopathy is present. The ascending aorta is normal in caliber. Bovine aortic arch anatomy is present. No surface irregularity is seen along thoracic aorta to suggest intimal flap, dissection, or atheromatous ulcer. Lack of noncontrast images limits ability to assess for intramural hematoma. The main pulmonary arteries normal in caliber. While not a dedicated evaluation for pulmonary embolus, no filling defects are seen within the pulmonary arterial vasculature. The trachea and central airways are patent. No focal consolidation, effusion or edema is present. Visualized portions of the upper abdomen demonstrate a 8-mm hypodense lesion at the hepatic dome, which is too small to characterize. Fatty liver is present. Cholelithiasis is present without evidence of cholecystitis

Processing Samples:  67%|██████▋   | 2/3 [01:37<00:50, 50.04s/it]

Report: FINDINGS: The thyroid gland is heterogeneous with multiple small nodules seen in the right lobe. The heart size is normal without a pericardial effusion. Although this was a non-gated examination, moderate to severe coronary arterial calcification is noted of the LAD and proximal circumflex vessels. The aorta and great vessels are normal in course and caliber. The main pulmonary artery is normal in course and caliber. While not a dedicated pulmonary embolism study, no filling defects are seen in the main or lobar pulmonary arteries to suggest pulmonary embolism. The lungs demonstrate bibasilar atelectasis and / or scarring. No focal consolidations or pleural effusions. No pneumothorax. An 8mm subpleural cyst is present in the right lower lobe (Series 3, Image 132) A 3 mm pulmonary nodule is seen in the right upper lobe (Series 3, Image 117). The airways are patent and of normal course and caliber. No mediastinal, hilar or axillary lymphadenopathy. A bilobed soft tissue lesion o

Processing Samples: 100%|██████████| 3/3 [02:08<00:00, 42.94s/it]


## Saving as Pretty JSON 

In [17]:
import json

INPUT_PATH = "/kaggle/working/chest-ct-schema.json"
OUTPUT_PATH = "/kaggle/working/chest-ct-schema.pretty.jsonl"

with open(INPUT_PATH, "r", encoding="utf-8") as fin, open(OUTPUT_PATH, "w", encoding="utf-8") as fout:
    for line in fin:
        line = line.strip()
        if not line:
            continue

        obj = json.loads(line)

        # her kaydı indent'li yaz
        fout.write(json.dumps(obj, ensure_ascii=False, indent=2))
        fout.write("\n\n")  # kayıtlar arası boşluk

print("Saved ->", OUTPUT_PATH)


Saved -> /kaggle/working/chest-ct-schema.pretty.jsonl


In [18]:
# INPUT_PATH = "/kaggle/input/radgraph/stanford-radgraph-XL-sentence.jsonl"
# OUTPUT_PATH = "/kaggle/working/stanford-radgraph-XL-mapped.jsonl"

# dataset = load_jsonl(INPUT_PATH)[1:]

# results = run_inference_radgraph(
#     dataset=dataset,
#     output_path=OUTPUT_PATH,
#     model_id="deepseek",
#     model_name="deepseek-chat",
# )