<a href="https://colab.research.google.com/github/herndoch/dermopath-ai-hub/blob/main/Knowledge_Pipeline_v4_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Read Me
<details>
<summary><strong>ü§ñ AI-to-AI Handover Protocol (Click to Expand)</strong></summary>

# ü§ñ AI-to-AI Handover Protocol (Read First)

**‚ö†Ô∏è CRITICAL SYSTEM INVARIANTS**
*Do not modify these settings without explicit human authorization. These constraints exist to prevent known failure modes.*

### 1. Model Hierarchy & Reasoning
*   **The "Monolith" Rule:** Architect Blocks (PDF Block 2 & Video Block 2) **MUST** utilize `gemini-1.5-pro-002` (or `gemini-3-pro-preview` if available).
    *   *Why:* We tested Flash; it hallucinates summaries and omits specific IHC stains (e.g., summarizing "CD45+, S100-" as just "ruled out melanoma"). Only Pro models maintain the high fidelity required for medical RAG.
*   **The "Flash" Rule:** Extraction and Consolidation (Block 1 & 3) **MUST** use `gemini-3-flash-preview`.
    *   *Why:* Pro models have strict Rate Limits (RPM). Using Pro for simple text merging or per-page extraction causes 429 loops and crashes.

### 2. Data Integrity Constraints
*   **The "Menu" Method (PDFs):** Never allow the AI to hallucinate image paths.
    *   *Invariant:* PDF Block 2 prompts **MUST** utilize the pre-validated "Figure Menu" generated by Block 1. If an image isn't in the menu, it does not exist.
*   **The "Chain of Custody" (Videos):**
    *   *Invariant:* Video Block 1 generates the `gs://` link. Video Block 2 must be instructed to copy that specific link field, not invent a filename.
*   **Zero-Loss Merging:** Block 3 (Consolidator) is purely additive.
    *   *Invariant:* When merging fragmented entities (e.g., "Lichen Planus" from Page 40 and Page 400), the AI must **concatenate** facts, never overwrite or summarize them away.

### 3. Operational Limits (The "Sweet Spots")
*   **Textbook Chunk Size:** `40 pages` (with `2 page overlap`).
    *   *Why:* >50 pages triggers HTTP timeouts. <30 pages loses context. 40 is the empirically tested maximum for stability.
*   **Pro Concurrency:** `Limit = 2`.
    *   *Why:* `gemini-1.5-pro` allows fewer concurrent requests than Flash. Increasing this >2 results in immediate 429 throttling.
*   **Flash Concurrency:** `Limit = 15-20`.
    *   *Why:* Safe zone for high-throughput image extraction.

### 4. File Format Logic
*   **Textbooks = PNG:** Lossless quality is required for OCR to correctly read tiny font sizes in medical diagrams.
*   **Lectures = JPG:** Compression is required to handle the volume (100+ slides per hour) without exploding storage costs.

---

### üîÑ Recovery Playbook (If Execution Fails)
*   **IF `429 Resource Exhausted`:** Do not restart. The script has auto-resume logic. Wait 60s and re-run.
*   **IF `Content-Generation Timeout`:** The Chunk Size is too large for the current model latency. Reduce `PAGES_PER_CHUNK` from 40 to 30.
*   **IF `KeyError: 'gcs_content_textbooks'`:** The environment is fresh. Run **Block 0** to re-initialize the `PATHS` map.

</details>

<details>
<summary><strong>üß¨ Pathology Knowledge Base Pipeline (SOP) (Click to Expand)</strong></summary>

**System Version:** v5.0 (High-Fidelity / Monolithic Architecture)
**Engine:** Google Gemini (1.5 Pro / 3 Flash)
**Infrastructure:** Google Colab $\leftrightarrow$ Google Cloud Storage (GCS)

## üìã Overview
This pipeline converts unstructured medical data (Textbooks and Video Lectures) into a strictly standardized, ontology-tagged JSON Knowledge Base. It uses a **"Monolith"** approach for reasoning (processing large contexts at once) and a **"Map-Reduce"** approach for consolidation.

---

## üõ†Ô∏è Block 0: Universal Setup
**Status:** ‚úÖ Mandatory (Run once per session)

This block installs dependencies, authenticates with Google Cloud, and establishes the global directory map (`PATHS`) to prevent file-not-found errors.

*   **Inputs:** None (requires Google Drive mount).
*   **Actions:**
    *   Installs `PyMuPDF` (PDFs), `openai-whisper` (Audio), `opencv` (Video), `aiohttp` (Async API).
    *   Authenticates via Colab Secrets (`GEMINI_API_KEY`).
    *   Sets up GCS Bucket paths.
*   **Key Variable:** `PATHS` dictionary (Routes data for both Textbooks and Lectures).

---

## üìö Workflow A: Textbooks (PDF)

### Block 1: The Extractor
**Goal:** Raw Data Acquisition & Normalization.
*   **Model:** `gemini-3-flash-preview` (Speed & Cost).
*   **Inputs:** Raw PDF files from Google Drive.
*   **Logic:**
    1.  **Text:** Cleans OCR errors page-by-page.
    2.  **Images:** Extracts images >5KB (saved as **PNG**).
    3.  **Panel-Aware Vision:** Detects if an image is "Figure 2.1 (A)" vs "(B)" and splits captions accordingly.
    4.  **Golden Links:** Generates permanent `gs://` links for every image.
*   **Outputs:** `_CONTENT.json` (Text), `_FIGURES.json` (Image Metadata).

### Block 2: The Architect (High-Fidelity)
**Goal:** Medical Reasoning & Schema Enforcement.
*   **Model:** `gemini-1.5-pro-002` or `gemini-3-pro-preview` (Deep Reasoning).
*   **Inputs:** `_CONTENT.json` + `_FIGURES.json` + `_Tags.txt`.
*   **Logic:**
    1.  **The Monolith:** Processes **40 pages** in a single context window to capture full disease descriptions.
    2.  **The Menu:** Forces AI to pick images from a pre-validated list of `gs://` links (prevents broken links).
    3.  **Strict Extraction:** Explicitly instructed to list **stains (CD45+)** and **genetics** without summarizing.
    4.  **Safety:** Auto-saves every 5 chunks; resumes if interrupted.
*   **Outputs:** `_MASTER.json` (High quality, but potentially fragmented entities).

### Block 3: The Consolidator
**Goal:** Map-Reduce / De-fragmentation.
*   **Model:** `gemini-3-flash-preview` (Logistics & Merging).
*   **Inputs:** `_MASTER.json`.
*   **Logic:**
    1.  **Map:** Groups entries by Tag (e.g., finds 3 separate entries for "Lichen Planus" from different chapters).
    2.  **Reduce:** Merges text, combines figure lists, and deduplicates data into one Super-Entry.
*   **Outputs:** `_CONSOLIDATED.json` (Final Database-Ready File).

---

## üé• Workflow B: Lectures (Video)

### Block 1: The Extractor
**Goal:** Audio Transcription & Slide Extraction.
*   **Model:** `whisper` (Audio) + `gemini-3-flash-preview` (Visuals).
*   **Inputs:** MP4/MOV files from Google Drive.
*   **Logic:**
    1.  **Audio:** Generates timestamped transcript.
    2.  **Visuals:** Extracts frames using **SSIM (Structural Similarity)** to deduplicate static slides (only 1 image per slide change). Saved as **JPG**.
    3.  **Analysis:** Vision model extracts text/titles visible on the slide.
*   **Outputs:** `_RAW.json` (List of slides with transcripts and GCS paths).

### Block 2: The Architect (The Monolith)
**Goal:** Synthesis & SOP Compliance.
*   **Model:** `gemini-1.5-pro-002` or `gemini-3-pro-preview`.
*   **Inputs:** `_RAW.json` + `_Tags.txt`.
*   **Logic:**
    1.  **Single Shot:** Feeds the **Entire Lecture** (Transcript + All Slide Images) in one massive request.
    2.  **Visual Fidelity:** Prompt forces extraction of text labels seen on slides (e.g., "TTF-1+", "CK20+") rather than just summarizing the diagnosis.
    3.  **Schema:** Maps the spoken lecture into the strict 18-field SOP (Clinical, Microscopic, etc.).
*   **Outputs:** `_MASTER.json`. *(Note: Lectures rarely need Block 3 consolidation as they usually discuss a topic linearly).*

---

## üìÇ Data Structure (Google Cloud)

```text
gs://pathology-hub-0/
‚îú‚îÄ‚îÄ Tags/                        # Source of Truth (Ontology)
‚îú‚îÄ‚îÄ _asset_library/
‚îÇ   ‚îú‚îÄ‚îÄ textbooks/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ [Book_Name]/
‚îÇ   ‚îÇ       ‚îî‚îÄ‚îÄ figure_images/   # Saved PNGs (Lossless)
‚îÇ   ‚îî‚îÄ‚îÄ lectures/
‚îÇ       ‚îî‚îÄ‚îÄ [Video_Name]/        # Saved JPGs (Compressed)
‚îî‚îÄ‚îÄ _content_library/
    ‚îú‚îÄ‚îÄ textbooks/
    ‚îÇ   ‚îú‚îÄ‚îÄ [Book]_CONTENT.json
    ‚îÇ   ‚îú‚îÄ‚îÄ [Book]_FIGURES.json
    ‚îÇ   ‚îú‚îÄ‚îÄ [Book]_MASTER.json
    ‚îÇ   ‚îî‚îÄ‚îÄ [Book]_CONSOLIDATED.json    # <--- FINAL PDF RESULT
    ‚îî‚îÄ‚îÄ lectures/
        ‚îú‚îÄ‚îÄ [Video]_RAW.json
        ‚îî‚îÄ‚îÄ [Video]_MASTER.json         # <--- FINAL VIDEO RESULT</details>

# Block 0

In [1]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 0: UNIVERSAL SETUP (Textbooks + Lectures)
# ==============================================================================
import os
import shutil
from google.colab import drive, userdata, auth
from google.cloud import storage
import google.generativeai as genai

print("--- STEP 0: INITIALIZATION ---")

# 1. Install & Configure System (Textbooks + Whisper/Video tools)
print("üì¶ Installing dependencies (PDF, Video, AI)...")
!sudo apt-get update -qq && sudo apt-get install -y ffmpeg > /dev/null 2>&1
!pip install -q -U google-generativeai PyMuPDF scikit-image aiohttp tqdm openai-whisper opencv-python-headless

# 2. Authentication
print("üîë Authenticating with Google Cloud...")
try:
    auth.authenticate_user()
    GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')
    genai.configure(api_key=GEMINI_API_KEY)
except Exception as e:
    raise SystemExit(f"‚ùå Authentication Failed: {e}")

# 3. Mount Drive (Source Storage)
try:
    drive.mount('/content/drive', force_remount=True)
except:
    pass

# 4. Universal Configuration
GCS_BUCKET_NAME = 'pathology-hub-0'
DRIVE_ROOT = '/content/drive/MyDrive/1-Projects/Knowledge_Pipeline'

# Initialize GCS Client
storage_client = storage.Client()
bucket = storage_client.bucket(GCS_BUCKET_NAME)

# --- THE MASTER PATH MAP ---
# This dictionary handles routing for BOTH workflows.
PATHS = {
    # --- SOURCES (Local Google Drive) ---
    "source_pdfs":      os.path.join(DRIVE_ROOT, '_source_materials', 'pdfs'),
    "source_videos":    os.path.join(DRIVE_ROOT, '_source_materials', 'videos'),

    # --- DESTINATIONS (GCS Bucket Paths) ---
    "gcs_bucket":       GCS_BUCKET_NAME,
    "gcs_tags":         "Tags",  # Where your _Tags.txt files live

    # Textbook Pipeline
    "gcs_asset_textbooks":   "_asset_library/textbooks",
    "gcs_content_textbooks": "_content_library/textbooks",

    # Lecture Pipeline
    "gcs_asset_lectures":    "_asset_library/lectures",
    "gcs_content_lectures":  "_content_library/lectures"
}

# 5. Verification
print(f"\n‚úÖ Connected to Bucket: gs://{GCS_BUCKET_NAME}")
print(f"‚úÖ Source PDFs:   {PATHS['source_pdfs']}")
print(f"‚úÖ Source Videos: {PATHS['source_videos']}")
print("\nüöÄ SYSTEM READY. You can now run Block 1 (Textbook) or Block 1 (Lecture).")

--- STEP 0: INITIALIZATION ---
üì¶ Installing dependencies (PDF, Video, AI)...
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m803.2/803.2 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m155.1/155.1 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m24.1/24.1 MB[

# PDF BLOCK 1: TEXTBOOK EXTRACTOR (Text + Figures)

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 1: TEXTBOOK EXTRACTOR (Gemini 3 Flash - Panel Aware)
# ==============================================================================
import base64
import fitz  # PyMuPDF
import json
import asyncio
import aiohttp
import re
import os
from tqdm.asyncio import tqdm_asyncio
from google.cloud import storage

# --- CONFIGURATION ---
TEXT_CONCURRENCY = 20
VISION_CONCURRENCY = 20
TEXT_MODEL_URL = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-3-flash-preview:generateContent?key={GEMINI_API_KEY}"
VISION_MODEL_NAME = "gemini-3-flash-preview"

# --- HELPER: GCS UTILS ---
def gcs_exists(blob_path):
    return bucket.blob(blob_path).exists()

def gcs_upload_bytes(data, blob_path, content_type):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(data, content_type=content_type)

def gcs_upload_json(data, blob_path):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(json.dumps(data, indent=2), content_type='application/json')

def gcs_load_json(blob_path):
    blob = bucket.blob(blob_path)
    if blob.exists():
        return json.loads(blob.download_as_string())
    return []

# --- AI HELPERS ---
async def clean_text_async(session, text, page_num, sem):
    async with sem:
        if not text.strip(): return page_num, ""
        prompt = (
            "Clean this medical text. Fix OCR errors. Keep structure. "
            "Preserve Figure Captions exactly. Return JSON: {\"markdown\": \"...\"}"
            f"\n\nRAW TEXT:\n{text}"
        )
        payload = {"contents": [{"parts": [{"text": prompt}]}]}
        try:
            async with session.post(TEXT_MODEL_URL, json=payload) as res:
                if res.status == 200:
                    dat = await res.json()
                    raw = dat['candidates'][0]['content']['parts'][0]['text']
                    match = re.search(r'\{.*\}', raw, re.DOTALL)
                    if match: return page_num, json.loads(match.group(0)).get("markdown", text)
                return page_num, text
        except: return page_num, text

async def analyze_figure_async(session, b64_img, context, sem):
    """
    Panel-Aware Vision Analysis.
    Tries to distinguish if the image is just Panel A or Panel B of a multipart figure.
    """
    async with sem:
        url = f"https://generativelanguage.googleapis.com/v1beta/models/{VISION_MODEL_NAME}:generateContent?key={GEMINI_API_KEY}"

        prompt = f"""
        PAGE CONTEXT:
        {context}

        TASK: Analyze the image below.
        1. Identify the Figure ID (e.g. "Fig 2.1") from the context that matches this image.
        2. **MULTI-PANEL CHECK:**
           - Does the caption describe multiple parts (e.g. "(A) ... (B) ...")?
           - If yes, determine if THIS specific image is Panel A, Panel B, etc.
           - If this image is ONLY Panel A, try to extract ONLY the caption text for (A).
           - If you cannot split the text, return the full caption but add "(Panel A)" to the ID.

        Return JSON: {{"figure_id": "Fig X.X (Panel A)", "matched_caption": "Specific caption..."}} or null.
        """

        parts = [
            {"text": prompt},
            {"inline_data": {"mime_type": "image/png", "data": b64_img}}
        ]

        try:
            async with session.post(url, json={"contents": [{"parts": parts}]}) as res:
                if res.status == 200:
                    dat = await res.json()
                    raw = dat['candidates'][0]['content']['parts'][0]['text']
                    match = re.search(r'\{.*\}', raw, re.DOTALL)
                    if match: return json.loads(match.group(0))
        except: return None
        return None

# --- MAIN PROCESSOR ---
async def process_textbook(pdf_path, start_p=1, end_p=None):
    fname = os.path.basename(pdf_path)
    book_name = os.path.splitext(fname)[0].replace(' ', '_')

    base_asset = f"{PATHS['gcs_asset_textbooks']}/{book_name}"
    path_fig_imgs = f"{base_asset}/figure_images"
    path_content = f"{PATHS['gcs_content_textbooks']}/{book_name}_CONTENT.json"
    path_figures = f"{PATHS['gcs_content_textbooks']}/{book_name}_FIGURES.json"

    print(f"\n{'='*60}\nüìò PROCESSING: {book_name}\n{'='*60}")

    doc = fitz.open(pdf_path)
    total = len(doc)
    final_p = min(end_p or total, total)

    # 1. TEXT (Skip if done)
    existing_content = gcs_load_json(path_content)
    if not existing_content:
        print(f"üìù Phase 1: Cleaning Text...")
        sem = asyncio.Semaphore(TEXT_CONCURRENCY)
        async with aiohttp.ClientSession() as sess:
            tasks = [clean_text_async(sess, doc.load_page(p).get_text("text"), p+1, sem) for p in range(start_p-1, final_p)]
            results = await tqdm_asyncio.gather(*tasks)
        content_data = sorted([{"source": fname, "page_number": p, "content": t} for p, t in results], key=lambda x: x['page_number'])
        gcs_upload_json(content_data, path_content)
    else:
        content_data = existing_content

    content_map = {c['page_number']: c['content'] for c in content_data}

    # 2. FIGURES
    print("üñºÔ∏è Phase 2: Figures & Vision (Panel-Aware)...")
    existing_figs = gcs_load_json(path_figures)
    processed_pages = {f['source_page'] for f in existing_figs}
    vision_tasks = []
    new_figures = []
    sem_vis = asyncio.Semaphore(VISION_CONCURRENCY)

    for p_idx in range(start_p-1, final_p):
        p_num = p_idx + 1
        if p_num in processed_pages: continue

        page = doc.load_page(p_idx)
        images = page.get_images(full=True)
        if not images: continue

        md_ctx = content_map.get(p_num, "")

        for i, img in enumerate(images):
            try:
                xref = img[0]
                base = doc.extract_image(xref)
                if len(base["image"]) < 5000: continue

                img_name = f"{book_name}_page_{p_num}_img_{i+1}.{base['ext']}"
                blob_path = f"{path_fig_imgs}/{img_name}"
                full_uri = f"gs://{GCS_BUCKET_NAME}/{blob_path}"

                if not gcs_exists(blob_path):
                    gcs_upload_bytes(base["image"], blob_path, f"image/{base['ext']}")

                b64 = base64.b64encode(base["image"]).decode('utf-8')
                vision_tasks.append({
                    "b64": b64, "ctx": md_ctx,
                    "meta": {"source_page": p_num, "gcs_path": full_uri}
                })
            except: pass

    if vision_tasks:
        print(f"   -> Analyzing {len(vision_tasks)} figures...")
        async with aiohttp.ClientSession() as sess:
            tasks = [analyze_figure_async(sess, t['b64'], t['ctx'], sem_vis) for t in vision_tasks]
            results = await tqdm_asyncio.gather(*tasks)

            for i, res in enumerate(results):
                if res and res.get('figure_id'):
                    meta = vision_tasks[i]['meta']
                    new_figures.append({
                        "source_document": fname,
                        "source_page": meta['source_page'],
                        "figure_id": res['figure_id'],
                        "description": res['matched_caption'],
                        "gcs_path": meta['gcs_path']
                    })

        final_list = existing_figs + new_figures
        final_list.sort(key=lambda x: x['source_page'])
        gcs_upload_json(final_list, path_figures)
        print(f"   -> Added {len(new_figures)} figures.")

# --- RUNNER ---
async def main():
    pdfs = sorted([f for f in os.listdir(PATHS['source_pdfs']) if f.endswith('.pdf')])
    if not pdfs: print("‚ùå No PDFs found."); return

    print("\n--- AVAILABLE TEXTBOOKS ---")
    for i, f in enumerate(pdfs): print(f"[{i+1}] {f}")

    sel = input("\nSelect book(s) (e.g. 1, 3): ")
    indices = [int(x)-1 for x in sel.split(',') if x.strip().isdigit()]

    for idx in indices:
        if 0 <= idx < len(pdfs):
            await process_textbook(os.path.join(PATHS['source_pdfs'], pdfs[idx]))

await main()

üîë Configured Google AI Studio API with Key.
üìÇ Reading local files...
üöÄ Processing 210 entities using gemini-3-flash-preview (API Key)...
   [Squamous cell carcinoma with s] -> Skin::Neoplastic::Epidermal::Malignant::Squamous_Cell_Carcinoma_Spindle_Cell
   [Spindle cell squamous cell car] -> Skin::Neoplastic::Epidermal::Malignant::Squamous_Cell_Carcinoma_Spindle_Cell
   [Clear cell squamous cell carci] -> Skin::Neoplastic::Epidermal::Malignant::Squamous_Cell_Carcinoma_Invasive_NOS
   [Lymphoepithelial carcinoma    ] -> Skin::Neoplastic::Epidermal::Malignant::Lymphoepithelioma_Like_Carcinoma
   [Acantholytic squamous cell car] -> Skin::Neoplastic::Epidermal::Malignant::Squamous_Cell_Carcinoma_Acantholytic
   [Verrucous squamous cell carcin] -> Skin::Neoplastic::Epidermal::Malignant::Squamous_Cell_Carcinoma_Verrucous
   [Squamous cell carcinomas      ] -> Skin::Neoplastic::Epidermal::Malignant::Squamous_Cell_Carcinoma_Invasive_NOS
   [Fibroepithelial basal cell car] -> Skin::Neop

# PDF BLOCK 2: TEXTBOOK ARCHITECT (High-Fidelity Monolith)


In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 2: TEXTBOOK ARCHITECT (FINAL ROBUST VERSION)
# ==============================================================================
import json
import asyncio
import aiohttp
import re
import difflib
import random
import os
from typing import List, Dict, Set, Any
from google.cloud import storage
from tqdm.asyncio import tqdm_asyncio

# --- CONFIGURATION ---
# We use 1.5 Pro because it is currently the most stable "Deep Reasoning" model for large context.
# You can swap this to 'gemini-3-pro-preview' if you have specific access.
MODEL_NAME = "gemini-3-pro-preview"
CONCURRENCY_LIMIT = 2           # Low concurrency is required for Pro models (Rate Limits)
PAGES_PER_CHUNK = 50            # Process 50 pages at once (The "Super-Chunk")
PAGE_OVERLAP = 2                # Context overlap
MAX_RETRIES = 10                # Aggressive retries for 429 errors
BATCH_SIZE = 5                  # Save to disk every 5 chunks

# --- HELPERS ---
def gcs_read_text(blob_path: str) -> str:
    blob = bucket.blob(blob_path)
    return blob.download_as_string().decode('utf-8') if blob.exists() else ""

def gcs_load_json(blob_path: str) -> List:
    blob = bucket.blob(blob_path)
    return json.loads(blob.download_as_string()) if blob.exists() else []

def gcs_upload_json(data: Any, blob_path: str):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(json.dumps(data, indent=2), content_type='application/json')

def validate_tag(tag_input: Any, valid_set: Set[str]) -> str:
    """Safely validates tags, handling None/Empty/Lists."""
    # 1. Handle non-string inputs (e.g. None or list)
    if not tag_input:
        return "Skin::Unclassified"

    if isinstance(tag_input, list):
        tag_str = tag_input[0] if len(tag_input) > 0 else "Skin::Unclassified"
    else:
        tag_str = str(tag_input)

    clean = tag_str.strip()

    # 2. Exact Match
    if clean in valid_set:
        return clean

    # 3. Fuzzy Match
    matches = difflib.get_close_matches(clean, list(valid_set), n=1, cutoff=0.7)
    return matches[0] if matches else clean

# ------------------------------------------------------------------------------
# 2. PROMPT ENGINEERING
# ------------------------------------------------------------------------------
def construct_textbook_prompt(text_content, figure_context, valid_tags_list):
    return f"""
Role: You are a Senior Dermatopathologist and Data Engineer.
Objective: Convert the provided TEXTBOOK CONTENT (Text + Figures) into a standardized Knowledge Base.

INPUT CONTEXT:
This input represents a ~{PAGES_PER_CHUNK} page section of a medical textbook.

INSTRUCTIONS:
1. **Identify Entities:** Extract every distinct disease/pathology entity discussed.
   - Ignore "Preface", "Contributors", "Index", or generic "Introduction" unless they contain specific disease data.
2. **High-Fidelity Extraction (CRITICAL):**
   - **Microscopic:** Do not summarize. Extract specific architectural features (e.g., "saw-toothing", "max-joseph spaces") and cytological features.
   - **Ancillary Studies:** You MUST list every specific stain mentioned (e.g., "CD45+", "S100-", "CK20 dot-like"). If genetics are mentioned (e.g., "t(11;22)"), include them.
3. **Figure Linking:**
   - Link relevant figures from the provided list to the correct entity.
   - Use the `gcs_path` provided.

REQUIRED JSON SCHEMA (List of Objects):
[
  {{
    "entity_name": "Disease Name",
    "definition": "...",
    "tags": ["Exact_Tag_From_List"],
    "html_gcs_path": null,

    "localization": "...",
    "clinical": "...",
    "pathogenesis": "...",
    "macroscopic": "...",
    "microscopic": "Detailed histology...",
    "cytology": "...",
    "ancillary_studies": "List specific IHC stains and molecular findings.",
    "differential_diagnosis": "...",
    "staging": "...",
    "prognosis_and_prediction": "...",

    "related_figures": [
        {{
            "id": "Use the ID from input",
            "src": "gs://...",
            "gcs_path": "gs://...",
            "diagnosis": "Disease Name",
            "legend": "Full caption from text + visual description."
        }}
    ]
  }}
]

REFERENCE TAGS:
{valid_tags_list}

--- FIGURES AVAILABLE IN THIS SECTION ---
{figure_context}

--- TEXTBOOK CONTENT ---
{text_content}
"""

# ------------------------------------------------------------------------------
# 3. CHUNK PROCESSOR
# ------------------------------------------------------------------------------
async def process_textbook_chunk(session, chunk_data, figures_in_chunk, valid_tags_text, valid_tags_set, sem):
    async with sem:
        full_text = "\n\n".join([f"--- Page {p['page_number']} ---\n{p['content']}" for p in chunk_data])

        fig_desc_list = []
        for f in figures_in_chunk:
            fig_desc_list.append(
                f"ID: {f.get('figure_id', 'Unknown')}\n"
                f"Page: {f['source_page']}\n"
                f"GCS Path: {f['gcs_path']}\n"
                f"Caption: {f.get('description', '')}\n"
            )
        fig_context = "\n".join(fig_desc_list)

        url = f"https://generativelanguage.googleapis.com/v1beta/models/{MODEL_NAME}:generateContent?key={GEMINI_API_KEY}"
        payload = {"contents": [{"parts": [{"text": construct_textbook_prompt(full_text, fig_context, valid_tags_text)}]}]}

        for attempt in range(MAX_RETRIES):
            try:
                # Long timeout for Pro models reading large chunks
                async with session.post(url, json=payload, timeout=600) as response:
                    if response.status == 200:
                        data = await response.json()
                        raw_txt = data['candidates'][0]['content']['parts'][0]['text']
                        raw_txt = raw_txt.replace("```json", "").replace("```", "")

                        match = re.search(r'\[.*\]', raw_txt, re.DOTALL)
                        if match:
                            entities = json.loads(match.group(0))
                            valid_entities = []
                            for ent in entities:
                                if not ent.get('entity_name'): continue

                                # Robust Tag Validation
                                raw_tags = ent.get('tags', [])
                                if isinstance(raw_tags, str): raw_tags = [raw_tags] # Handle AI returning string instead of list
                                ent['tags'] = [validate_tag(t, valid_tags_set) for t in raw_tags]

                                # Enforce Nulls for missing keys
                                required_keys = [
                                    "clinical", "microscopic", "ancillary_studies",
                                    "differential_diagnosis", "pathogenesis", "staging",
                                    "prognosis_and_prediction", "cytology", "macroscopic"
                                ]
                                for k in required_keys:
                                    if k not in ent: ent[k] = None
                                ent['html_gcs_path'] = None

                                valid_entities.append(ent)
                            return valid_entities
                        return []
                    elif response.status == 429:
                        wait = (2 ** attempt) + random.uniform(5, 15)
                        print(f"  ‚è≥ Rate Limit (Chunk {chunk_data[0]['page_number']})... Waiting {wait:.1f}s")
                        await asyncio.sleep(wait)
                        continue
                    else:
                        print(f"‚ùå Error {response.status} on chunk starting page {chunk_data[0]['page_number']}")
                        return []
            except Exception as e:
                # print(f"‚ùå Exception: {e}")
                await asyncio.sleep(15)
        return []

# ------------------------------------------------------------------------------
# 4. MAIN WORKFLOW
# ------------------------------------------------------------------------------
async def main_textbook_robust():
    # 1. Select Tags
    tag_files = [b.name for b in bucket.list_blobs(prefix="Tags/") if b.name.endswith('.txt')]
    print("\n--- SELECT TAG LIST ---")
    for i, f in enumerate(tag_files): print(f"[{i+1}] {f.split('/')[-1]}")
    t_idx = int(input("Choice: ")) - 1
    tags_text = gcs_read_text(tag_files[t_idx])
    tags_set = set(l.strip() for l in tags_text.splitlines() if l.strip())

    # 2. Select Textbook
    content_files = [b.name for b in bucket.list_blobs(prefix=PATHS['gcs_content_textbooks']) if "_CONTENT.json" in b.name]
    if not content_files: print("‚ùå No CONTENT files found."); return

    print("\n--- SELECT TEXTBOOK ---")
    for i, f in enumerate(content_files): print(f"[{i+1}] {f.split('/')[-1]}")
    c_idx = int(input("Choice: ")) - 1

    content_path = content_files[c_idx]
    book_base = content_path.split('/')[-1].replace("_CONTENT.json", "")
    fig_path = content_path.replace("_CONTENT.json", "_FIGURES.json")
    final_path = f"{PATHS['gcs_content_textbooks']}/{book_base}_MASTER.json"

    print(f"\nüöÄ Processing: {book_base}")
    print(f"   Model: {MODEL_NAME} (High Fidelity)")

    # 3. Resume vs Restart Logic
    master_kb = []
    if bucket.blob(final_path).exists():
        print(f"\n‚ö†Ô∏è Existing MASTER file found for {book_base}.")
        choice = input("Type 'RESUME' to continue or 'RESTART' to overwrite (Recommended if 0 entities): ").strip().upper()
        if choice == 'RESUME':
            master_kb = gcs_load_json(final_path)
            print(f"   -> Resuming with {len(master_kb)} existing entities.")
        else:
            print("   -> Starting fresh. Overwriting previous file.")
            master_kb = []

    # 4. Load Data & Chunk
    raw_content = gcs_load_json(content_path)
    raw_figures = gcs_load_json(fig_path)
    raw_content.sort(key=lambda x: x['page_number'])

    chunks = []
    total_pages = len(raw_content)
    for i in range(0, total_pages, PAGES_PER_CHUNK):
        end_idx = min(i + PAGES_PER_CHUNK + PAGE_OVERLAP, total_pages)
        chunks.append(raw_content[i : end_idx])

    print(f"üì¶ Total Chunks: {len(chunks)} (~{PAGES_PER_CHUNK} pages each)")

    # 5. Process Batches
    # Logic: If resuming, calculate where we left off based on entity count (approx)
    # Actually, simpler to just re-process if we suspect issues, OR skip chunks based on index.
    # For now, let's just process. The Dedupe handles overlaps.

    sem = asyncio.Semaphore(CONCURRENCY_LIMIT)

    for i in range(0, len(chunks), BATCH_SIZE):
        batch_chunks = chunks[i : i + BATCH_SIZE]
        print(f"\n--- Processing Batch {i//BATCH_SIZE + 1}/{(len(chunks)//BATCH_SIZE) + 1} ---")

        async with aiohttp.ClientSession() as session:
            tasks = []
            for chunk in batch_chunks:
                page_nums = {p['page_number'] for p in chunk}
                chunk_figs = [f for f in raw_figures if f['source_page'] in page_nums]
                tasks.append(process_textbook_chunk(session, chunk, chunk_figs, tags_text, tags_set, sem))

            results = await tqdm_asyncio.gather(*tasks)

            new_in_batch = 0
            for res_list in results:
                master_kb.extend(res_list)
                new_in_batch += len(res_list)

            # Incremental Save
            if new_in_batch > 0:
                gcs_upload_json(master_kb, final_path)
                print(f"üíæ Saved progress... (+{new_in_batch} entities)")
            else:
                print("‚ö†Ô∏è No entities found in this batch (or API error). Continuing...")

    # 6. Final Deduplication (Crash-Proof)
    print("\nüßπ Final Deduplication...")
    unique_kb = []
    seen_keys = set()
    for ent in master_kb:
        # Robust Key Gen: Handle None definition
        def_text = ent.get('definition')
        if def_text is None: def_text = ""

        key = (ent.get('entity_name'), def_text[:50])

        if key not in seen_keys:
            unique_kb.append(ent)
            seen_keys.add(key)

    gcs_upload_json(unique_kb, final_path)
    print(f"\n‚úÖ COMPLETE: gs://{GCS_BUCKET_NAME}/{final_path}")
    print(f"üìä Final Count: {len(unique_kb)} Entities")

await main_textbook_robust()


--- SELECT TAG LIST ---
[1] BST_Tags.txt
[2] Breast_Tags.txt
[3] Endo_Tags.txt
[4] GI_Tags.txt
[5] GYN_Tags.txt
[6] Skin_Tags.txt
Choice: 6

--- SELECT TEXTBOOK ---
[1] BST_Horvai_CONTENT.json
[2] Bone_Atlas_CONTENT.json
[3] Bone_Dorfman_CONTENT.json
[4] Bone_Pattern_CONTENT.json
[5] Breast_Atlas_CONTENT.json
[6] Breast_Biopsy_CONTENT.json
[7] Breast_FAQ_CONTENT.json
[8] Breast_Pattern_CONTENT.json
[9] Cyto_Breast_Yokohama_CONTENT.json
[10] Cyto_Cibas_CONTENT.json
[11] Cyto_Comprehensive_Part_One_CONTENT.json
[12] Cyto_Comprehensive_Part_Two_CONTENT.json
[13] Cyto_GU_Paris_CONTENT.json
[14] Cyto_Gyn_Bethesda_CONTENT.json
[15] Cyto_Milan_CONTENT.json
[16] Cyto_PSC_Lung_CONTENT.json
[17] Cyto_Pattern_CONTENT.json
[18] Cyto_Serous_Fluids_CONTENT.json
[19] Cyto_Thyroid_Bethesda_CONTENT.json
[20] Derm_Barnhill_CONTENT.json
[21] Derm_Elston_CONTENT.json
[22] Derm_Levers_CONTENT.json
[23] Derm_McKeeHY_CONTENT.json
[24] Derm_McKee_CONTENT.json
[25] Derm_McKee_High_Yield_CONTENT.json
[26] Derm

  0%|          | 0/5 [00:00<?, ?it/s]

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 2.5: THE TAG REFINER (Crash-Proof Version)
# ==============================================================================
import json
import asyncio
import aiohttp
import difflib
from google.cloud import storage
from tqdm.asyncio import tqdm_asyncio

# --- CONFIGURATION ---
MODEL_NAME = "gemini-3-flash-preview"
CONCURRENCY_LIMIT = 30

# --- HELPERS ---
def gcs_read_text(blob_path):
    blob = bucket.blob(blob_path)
    return blob.download_as_string().decode('utf-8') if blob.exists() else ""

def gcs_load_json(blob_path):
    blob = bucket.blob(blob_path)
    return json.loads(blob.download_as_string()) if blob.exists() else []

def gcs_upload_json(data, blob_path):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(json.dumps(data, indent=2), content_type='application/json')

# --- AI WORKER ---
async def retag_entity(session, entity, valid_tags_text, valid_tags_set, sem):
    # 1. Safe Tag Retrieval (The Fix)
    current_tags = entity.get('tags')
    if current_tags and isinstance(current_tags, list) and len(current_tags) > 0:
        current_tag = current_tags[0]
    else:
        current_tag = "Unknown/Unclassified"

    # 2. Quick Check: Is it already valid?
    if current_tag in valid_tags_set:
        return entity

    # 3. AI Fix
    async with sem:
        url = f"https://generativelanguage.googleapis.com/v1beta/models/{MODEL_NAME}:generateContent?key={GEMINI_API_KEY}"

        prompt = f"""
        Role: Pathology Taxonomist.
        Task: Map this entity to the BEST matching tag from the Authorized List.

        ENTITY: {entity.get('entity_name', 'Unknown')}
        DEFINITION: {str(entity.get('definition', ''))[:300]}
        CURRENT TAG: {current_tag}

        AUTHORIZED LIST:
        {valid_tags_text}

        INSTRUCTION: Return ONLY the exact string from the Authorized List.
        """

        payload = {"contents": [{"parts": [{"text": prompt}]}]}

        try:
            async with session.post(url, json=payload) as response:
                if response.status == 200:
                    data = await response.json()
                    new_tag = data['candidates'][0]['content']['parts'][0]['text'].strip()

                    # Fuzzy validation
                    if new_tag in valid_tags_set:
                        entity['tags'] = [new_tag]
                    else:
                        # Try fuzzy match if AI made a typo
                        matches = difflib.get_close_matches(new_tag, list(valid_tags_set), n=1, cutoff=0.7)
                        if matches:
                            entity['tags'] = [matches[0]]
                        # If still no match, keep original (or Unknown)
        except:
            pass

        return entity

# --- MAIN ---
async def main_retagger():
    # 1. Select Tags
    tag_files = [b.name for b in bucket.list_blobs(prefix="Tags/") if b.name.endswith('.txt')]
    print("\n--- SELECT NEW TAG LIST ---")
    for i, f in enumerate(tag_files): print(f"[{i+1}] {f.split('/')[-1]}")
    t_idx = int(input("Choice: ")) - 1
    tags_text = gcs_read_text(tag_files[t_idx])
    tags_set = set(l.strip() for l in tags_text.splitlines() if l.strip())

    # 2. Select Master File
    content_files = [b.name for b in bucket.list_blobs(prefix=PATHS['gcs_content_textbooks']) if "_MASTER.json" in b.name]
    print("\n--- SELECT MASTER FILE TO UPDATE ---")
    for i, f in enumerate(content_files): print(f"[{i+1}] {f.split('/')[-1]}")
    c_idx = int(input("Choice: ")) - 1

    master_path = content_files[c_idx]
    entities = gcs_load_json(master_path)
    print(f"\nüöÄ Refining Tags for {len(entities)} entities...")

    # 3. Process
    sem = asyncio.Semaphore(CONCURRENCY_LIMIT)
    updated_entities = []

    async with aiohttp.ClientSession() as session:
        tasks = [retag_entity(session, ent, tags_text, tags_set, sem) for ent in entities]
        updated_entities = await tqdm_asyncio.gather(*tasks)

    # 4. Save
    gcs_upload_json(updated_entities, master_path)
    print(f"\n‚úÖ UPDATED: {master_path}")
    print("   All entities have been re-aligned to the tag list.")

await main_retagger()


--- SELECT NEW TAG LIST ---
[1] BST_Tags.txt
[2] Breast_Tags.txt
[3] Endo_Tags.txt
[4] GI_Tags.txt
[5] GYN_Tags.txt
[6] Skin_Tags.txt
Choice: 6

--- SELECT MASTER FILE TO UPDATE ---
[1] Derm_Elston_MASTER.json
[2] Derm_McKee_MASTER.json
[3] Derm_Patterson_MASTER.json
[4] Skin_Elston_MASTER.json
Choice: 1

üöÄ Refining Tags for 1078 entities...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1078/1078 [01:31<00:00, 11.75it/s]


‚úÖ UPDATED: _content_library/textbooks/Derm_Elston_MASTER.json
   All entities have been re-aligned to the tag list.





# PDF BLOCK 2.1: THE CONSOLIDATOR (Map-Reduce Merge)

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 3: THE CONSOLIDATOR (Map-Reduce Merge)
# ==============================================================================
import json
import asyncio
import aiohttp
import re
from collections import defaultdict
from typing import List, Dict, Any  # <--- Added missing imports
from google.cloud import storage
from tqdm.asyncio import tqdm_asyncio

# --- CONFIGURATION ---
# Flash is perfect for merging text. It is fast and respects the data.
MODEL_NAME = "gemini-3-flash-preview"
CONCURRENCY_LIMIT = 15

# --- HELPERS ---
def gcs_load_json(blob_path: str) -> List:
    blob = bucket.blob(blob_path)
    return json.loads(blob.download_as_string()) if blob.exists() else []

def gcs_upload_json(data: Any, blob_path: str):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(json.dumps(data, indent=2), content_type='application/json')

# --- PROMPT ---
def construct_merge_prompt(entity_name, fragments):
    return f"""
Role: Medical Data Editor.
Task: Merge these fragmented records for "{entity_name}" into ONE comprehensive entry.

INPUT FRAGMENTS (from different chapters):
{json.dumps(fragments, indent=2)}

INSTRUCTIONS:
1. **Consolidate Text:** Combine 'clinical', 'microscopic', 'definition', etc. Do not lose details.
   - If Fragment A has "Clinical: Itchy" and Fragment B has "Clinical: Purple papules", the result must be "Itchy, purple papules."
   - **Crucial:** Preserve all specific stains (CD45+, S100) and genetic findings.
2. **Merge Figures:** Combine all 'related_figures' into one list. Remove duplicates if exact same ID.
3. **Preserve Tags:** Use the most specific tag available.
4. **Output:** A single JSON object.

REQUIRED SCHEMA:
{{
    "entity_name": "{entity_name}",
    "definition": "Merged...",
    "tags": ["..."],
    "html_gcs_path": null,
    "clinical": "...",
    "pathogenesis": "...",
    "macroscopic": "...",
    "microscopic": "...",
    "cytology": "...",
    "ancillary_studies": "...",
    "differential_diagnosis": "...",
    "staging": "...",
    "prognosis_and_prediction": "...",
    "related_figures": [...]
}}
"""

# --- AI WORKER ---
async def merge_entity_group(session, tag, group, sem):
    async with sem:
        # If only 1 entry, no merge needed
        if len(group) == 1:
            return group[0]

        # Construct Prompt
        entity_name = group[0]['entity_name']
        prompt = construct_merge_prompt(entity_name, group)

        url = f"https://generativelanguage.googleapis.com/v1beta/models/{MODEL_NAME}:generateContent?key={GEMINI_API_KEY}"
        payload = {"contents": [{"parts": [{"text": prompt}]}]}

        try:
            async with session.post(url, json=payload) as response:
                if response.status == 200:
                    data = await response.json()
                    raw = data['candidates'][0]['content']['parts'][0]['text']
                    match = re.search(r'\{.*\}', raw, re.DOTALL)
                    if match:
                        return json.loads(match.group(0))
        except:
            pass

        # Fallback: Just return the first one if AI fails (prevents data loss)
        print(f"‚ö†Ô∏è Merge failed for {entity_name}, keeping fragments.")
        return group[0]

# --- MAIN ---
async def main_consolidator():
    # 1. Select Content
    content_files = [b.name for b in bucket.list_blobs(prefix=PATHS['gcs_content_textbooks']) if "_MASTER.json" in b.name and "_CONSOLIDATED" not in b.name]
    if not content_files: print("‚ùå No MASTER files found. Run Block 2 first."); return

    print("\n--- SELECT MASTER FILE TO CONSOLIDATE ---")
    for i, f in enumerate(content_files): print(f"[{i+1}] {f.split('/')[-1]}")
    c_idx = int(input("Choice: ")) - 1

    master_path = content_files[c_idx]
    raw_entities = gcs_load_json(master_path)

    print(f"\nüöÄ Consolidating {len(raw_entities)} entities...")

    # 2. Group by Tag (or Name if Tag is generic)
    groups = defaultdict(list)
    for ent in raw_entities:
        # Key strategy: Use the first Tag as the primary key.
        # If tag is missing/generic, fallback to Entity Name.
        tag_list = ent.get('tags', [])
        key = tag_list[0] if tag_list else ent.get('entity_name', 'Unknown')
        groups[key].append(ent)

    print(f"   -> Found {len(groups)} unique topics (Tags/Names).")

    # 3. Process Groups
    sem = asyncio.Semaphore(CONCURRENCY_LIMIT)
    final_kb = []

    async with aiohttp.ClientSession() as session:
        tasks = []
        for key, group in groups.items():
            tasks.append(merge_entity_group(session, key, group, sem))

        results = await tqdm_asyncio.gather(*tasks)
        final_kb = results

    # 4. Save
    out_path = master_path.replace("_MASTER.json", "_CONSOLIDATED.json")
    gcs_upload_json(final_kb, out_path)
    print(f"\n‚úÖ CONSOLIDATED MASTER SAVED: gs://{GCS_BUCKET_NAME}/{out_path}")
    print(f"üìä Reduced {len(raw_entities)} fragments -> {len(final_kb)} unique entities.")

await main_consolidator()


--- SELECT MASTER FILE TO CONSOLIDATE ---
[1] Derm_Elston_MASTER.json
[2] Derm_Levers_MASTER.json
[3] Derm_McKee_MASTER.json
[4] Derm_Patterson_MASTER.json
[5] Skin_Elston_MASTER.json
Choice: 2

üöÄ Consolidating 494 entities...
   -> Found 349 unique topics (Tags/Names).


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 349/349 [02:06<00:00,  2.76it/s]


‚úÖ CONSOLIDATED MASTER SAVED: gs://pathology-hub-0/_content_library/textbooks/Derm_Levers_CONSOLIDATED.json
üìä Reduced 494 fragments -> 349 unique entities.





# VIDEO

## Block 1: LECTURE EXTRACTOR (Whisper + Gemini 3 Flash)

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 1: LECTURE EXTRACTOR (Whisper + Gemini 3 Flash)
# ==============================================================================
import shutil, cv2, whisper, json, os, io, base64, re, asyncio, aiohttp
import logging
from skimage.metrics import structural_similarity as ssim
from PIL import Image
from tqdm.notebook import tqdm
from tqdm.asyncio import tqdm_asyncio
from google.cloud import storage

# --- CONFIGURATION ---
logging.getLogger("urllib3").setLevel(logging.ERROR)
API_CONCURRENCY_LIMIT = 20
VISION_MODEL = "gemini-3-pro-preview" # Fast & Cheap for per-slide analysis

# --- HELPERS ---
def gcs_upload_file(local_path, blob_path):
    blob = bucket.blob(blob_path)
    blob.upload_from_filename(local_path)

def gcs_upload_json(data, blob_path):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(json.dumps(data, indent=2), content_type='application/json')

def gcs_exists(blob_path):
    return bucket.blob(blob_path).exists()

def get_comparison_frame(frame):
    h, w = frame.shape[:2]
    new_w = 200
    new_h = int(h * (new_w / w))
    small = cv2.resize(frame, (new_w, new_h), interpolation=cv2.INTER_AREA)
    gray = cv2.cvtColor(small, cv2.COLOR_BGR2GRAY)
    return cv2.GaussianBlur(gray, (5, 5), 0)

# --- AI ANALYST ---
async def analyze_slide_async(session, slide_data, local_img_path, sem):
    async with sem:
        if not os.path.exists(local_img_path): return slide_data

        try:
            # Prepare Image
            with Image.open(local_img_path) as img:
                buf = io.BytesIO()
                img.convert("RGB").save(buf, format="JPEG")
                b64_img = base64.b64encode(buf.getvalue()).decode("utf-8")

            url = f"https://generativelanguage.googleapis.com/v1beta/models/{VISION_MODEL}:generateContent?key={GEMINI_API_KEY}"

            # Prompt: Extract raw visual data. We don't need deep reasoning yet, just "What is on this slide?"
            prompt = (
                f"Transcript Context: \"{slide_data['raw_transcript'][:1000]}...\"\n\n"
                "TASK: Analyze this slide image. \n"
                "1. Extract the Title.\n"
                "2. Extract text labels verbatim (e.g. 'CD45', 'H&E', '40x').\n"
                "3. Summarize the visual content (e.g., 'Histology showing...').\n"
                "Return JSON: {\"slide_title\": \"...\", \"key_points\": [\"...\"], \"visual_desc\": \"...\"}"
            )

            payload = {"contents": [{"parts": [{"text": prompt}, {"inline_data": {"mime_type": "image/jpeg", "data": b64_img}}]}]}

            async with session.post(url, json=payload) as res:
                if res.status == 200:
                    dat = await res.json()
                    txt = dat['candidates'][0]['content']['parts'][0]['text']
                    match = re.search(r'\{.*\}', txt, re.DOTALL)
                    if match:
                        slide_data.update(json.loads(match.group(0)))
        except Exception as e:
            pass # Skip frame if AI fails

        return slide_data

# --- PIPELINE ---
async def process_video(video_path, counter, total):
    fname = os.path.basename(video_path)
    lecture_name = os.path.splitext(fname)[0].replace(" ", "_")

    # GCS Paths
    asset_base = f"{PATHS['gcs_asset_lectures']}/{lecture_name}"
    raw_json_path = f"{PATHS['gcs_content_lectures']}/{lecture_name}_RAW.json"

    print(f"\n{'='*60}\nüé• PROCESSING {counter}/{total}: {lecture_name}\n{'='*60}")

    if gcs_exists(raw_json_path):
        print("‚úÖ Already processed in GCS. Skipping.")
        return

    # 1. WHISPER TRANSCRIPTION
    print("üéôÔ∏è Step 1: Whisper Transcription...")
    model = whisper.load_model("base") # Use 'small' if you have GPU RAM, 'base' is fast
    result = model.transcribe(video_path, fp16=False)

    # 2. FRAME EXTRACTION & MERGING
    print("üéûÔ∏è Step 2: Extracting Slides...")
    cap = cv2.VideoCapture(video_path)
    slides = []
    curr_slide = None
    prev_cmp = None

    # We use TQDM to track progress through the audio segments
    for seg in tqdm(result['segments'], desc="Scanning", unit="seg"):
        cap.set(cv2.CAP_PROP_POS_MSEC, seg['start'] * 1000)
        ret, frame = cap.read()
        if not ret: continue

        curr_cmp = get_comparison_frame(frame)

        if curr_slide is None:
            curr_slide = {**seg, 'frame': frame}
            prev_cmp = curr_cmp
            continue

        # SSIM Check (Merge if > 85% similar)
        if ssim(prev_cmp, curr_cmp, data_range=255) >= 0.85:
            curr_slide['text'] += " " + seg['text']
            curr_slide['end'] = seg['end']
        else:
            slides.append(curr_slide)
            curr_slide = {**seg, 'frame': frame}
            prev_cmp = curr_cmp

    if curr_slide: slides.append(curr_slide)
    cap.release()
    print(f"   -> Consolidated into {len(slides)} unique slides.")

    # 3. UPLOAD & PREPARE
    print("‚òÅÔ∏è Step 3: Uploading Images...")
    final_data = []
    local_imgs = {} # Map id -> local path for AI step

    for i, slide in enumerate(slides):
        img_name = f"{lecture_name}_slide_{i+1:04d}.jpg"
        local_p = f"/tmp/{img_name}"
        gcs_p = f"{asset_base}/{img_name}"
        full_uri = f"gs://{GCS_BUCKET_NAME}/{gcs_p}"

        cv2.imwrite(local_p, slide['frame'])

        if not gcs_exists(gcs_p):
            gcs_upload_file(local_p, gcs_p)

        local_imgs[i] = local_p

        final_data.append({
            "id": i,
            "timestamp_start": slide['start'],
            "timestamp_end": slide['end'],
            "raw_transcript": slide['text'].strip(),
            "image_path": full_uri,
            "gcs_path": full_uri, # Important for Block 2
            "slide_title": "",
            "key_points": [],
            "visual_desc": ""
        })

    # 4. GEMINI ENHANCEMENT
    print("üß† Step 4: Gemini Vision Analysis...")
    sem = asyncio.Semaphore(API_CONCURRENCY_LIMIT)
    async with aiohttp.ClientSession() as sess:
        tasks = [analyze_slide_async(sess, d, local_imgs[d['id']], sem) for d in final_data]
        enhanced_data = await tqdm_asyncio.gather(*tasks)

    # 5. SAVE RAW JSON
    gcs_upload_json(enhanced_data, raw_json_path)
    print(f"‚úÖ Saved RAW data: {raw_json_path}")

    # Cleanup
    for p in local_imgs.values():
        if os.path.exists(p): os.remove(p)

# --- RUNNER ---
async def main_lectures():
    vid_files = sorted([f for f in os.listdir(PATHS['source_videos']) if f.lower().endswith(('.mp4', '.mov'))])
    if not vid_files: print("‚ùå No videos found."); return

    print("\n--- AVAILABLE LECTURES ---")
    for i, v in enumerate(vid_files): print(f"[{i+1}] {v}")

    sel = input("\nSelect (e.g. 1, 3-5, or 'all'): ")
    indices = set()
    if sel == 'all': indices = range(len(vid_files))
    else:
        for part in sel.split(','):
            if '-' in part:
                s, e = map(int, part.split('-'))
                indices.update(range(s-1, e))
            elif part.strip().isdigit():
                indices.add(int(part)-1)

    for idx in sorted(list(indices)):
        if 0 <= idx < len(vid_files):
            await process_video(os.path.join(PATHS['source_videos'], vid_files[idx]), idx+1, len(indices))

await main_lectures()


--- AVAILABLE LECTURES ---
[1] BST_Lecture_1_Grossing.mp4
[2] BST_Lecture_2_SoftTissue1.mp4
[3] BST_Lecture_3_SofTissue2.mp4
[4] BST_Lecture_4_SoftTissue3.mp4
[5] BST_Lecture_5_Bone1.mp4
[6] BST_Lecture_6_Bone2.mp4
[7] Breast_Lecture_Epithelial Part 1_Chen.mp4
[8] Breast_Lecture_Fibroepithelial.mp4
[9] Breast_Lecture_Grossing.mp4
[10] Breast_Lecture_IHC.mp4
[11] Breast_Lecture_Invasive.mp4
[12] Breast_Lecture_Lobular.mp4
[13] Breast_Lecture_Normal.mp4
[14] Breast_Lecture_Papillary.mp4
[15] Breast_Lecture_Prognostics.mp4
[16] Breast_Lecture_Rad-Path.mp4
[17] Breast_Lecture_SpindleCell.mp4
[18] Breast_Lecture_Treated.mp4
[19] Derm_Lecture_10_Folliculitis_SLIDE_SESSION.mp4
[20] Derm_Lecture_11_Granulomatous_dermatitis with Jeff North.mp4
[21] Derm_Lecture_12_Histiocytoses.mp4
[22] Derm_Lecture_13_pigment_disorders.mp4
[23] Derm_Lecture_14_inpatient_SLIDE_SESSION.mp4
[24] Derm_Lecture_15_interface_dermatitis.mp4
[25] Derm_Lecture_16_intraepidermal_vesicular_dermatitis.mp4
[26] Derm_Lectur

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 139M/139M [00:00<00:00, 163MiB/s]


üéûÔ∏è Step 2: Extracting Slides...


Scanning:   0%|          | 0/226 [00:00<?, ?seg/s]

   -> Consolidated into 118 unique slides.
‚òÅÔ∏è Step 3: Uploading Images...
üß† Step 4: Gemini Vision Analysis...


  0%|          | 0/118 [00:01<?, ?it/s]


CancelledError: 

## Block 2: BATCH LECTURE ARCHITECT (Flexible Input Support)

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 2: BATCH LECTURE ARCHITECT (Flexible Input Support)
# ==============================================================================
import json
import asyncio
import aiohttp
import re
import difflib
import random
from typing import List, Dict, Set, Any
from google.cloud import storage
from tqdm.asyncio import tqdm_asyncio

# --- CONFIGURATION ---
MODEL_NAME = "gemini-3-pro-preview"
MAX_RETRIES = 5

# --- SETUP ---
if 'PATHS' not in globals():
    raise NameError("‚ùå PATHS not found. Please run Block 0.")
bucket = storage.Client().bucket(PATHS['gcs_bucket'])

# --- HELPERS ---
def gcs_read_text(blob_path):
    blob = bucket.blob(blob_path)
    return blob.download_as_string().decode('utf-8') if blob.exists() else ""

def gcs_load_json(blob_path):
    blob = bucket.blob(blob_path)
    return json.loads(blob.download_as_string()) if blob.exists() else []

def gcs_upload_json(data, blob_path):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(json.dumps(data, indent=2), content_type='application/json')

def validate_tag(tag_input, valid_set):
    if not tag_input: return "Skin::Unclassified"
    clean = str(tag_input[0]) if isinstance(tag_input, list) and tag_input else str(tag_input)
    clean = clean.strip()
    if clean in valid_set: return clean
    matches = difflib.get_close_matches(clean, list(valid_set), n=1, cutoff=0.7)
    return matches[0] if matches else "Skin::Unclassified"

# --- LOGIC: THE ID SWAPPER ---
def inject_real_paths(entities, slide_lookup_map):
    for ent in entities:
        if 'related_figures' in ent:
            for fig in ent['related_figures']:
                # Handle ID variations (int vs string)
                slide_id = str(fig.get('id', '')).replace("Slide_", "")

                # Look up real data
                real_data = None
                # Try direct match or string match
                if slide_id in slide_lookup_map:
                    real_data = slide_lookup_map[slide_id]
                else:
                    # Fallback loop for mismatched types
                    for k, v in slide_lookup_map.items():
                        if str(k) == slide_id:
                            real_data = v
                            break

                if real_data:
                    fig['src'] = real_data.get('gcs_path') or real_data.get('image_path')
                    fig['gcs_path'] = fig['src']

                    # Add timestamp to legend
                    ts = real_data.get('timestamp_start') or real_data.get('start_time')
                    if ts is not None:
                        time_str = f"(Time: {float(ts):.0f}s)"
                        if time_str not in fig.get('legend', ''):
                            fig['legend'] = f"{fig.get('legend', '')} {time_str}".strip()
                else:
                    fig['src'] = None
                    fig['gcs_path'] = None
    return entities

# ------------------------------------------------------------------------------
# 2. PROMPT
# ------------------------------------------------------------------------------
def construct_lecture_prompt(transcript_data, valid_tags_list):
    return f"""
Role: Senior Dermatopathologist.
Objective: Convert the ENTIRE LECTURE provided below into a standardized Knowledge Base.

INPUT DATA:
- Chronological sequence of slides (ID, Visual Description, Transcript).

INSTRUCTIONS:
1. **Consolidate:** Merge discussion across multiple slides into single Disease Entities.
2. **Detail Extraction (CRITICAL):**
   - **Stains (IHC):** List every specific stain mentioned (e.g., "CK20+", "TTF-1 negative").
   - **Genetics:** List specific mutations/translocations.
3. **Tagging:** Select exactly ONE tag from the reference list.
4. **Figure Linking (ID SWAP):**
   - Select the BEST slides for 'related_figures'.
   - **IMPORTANT:** Use the exact ID provided (e.g., "Slide_5"). Leave `src` and `gcs_path` as "PLACEHOLDER".

REQUIRED JSON SCHEMA:
[
  {{
    "entity_name": "Disease Name",
    "definition": "...",
    "tags": ["Single_Exact_Tag"],
    "html_gcs_path": null,

    "clinical": "...",
    "pathogenesis": "...",
    "macroscopic": "...",
    "microscopic": "...",
    "ancillary_studies": "List ALL stains/molecular details.",
    "differential_diagnosis": "...",
    "staging": "...",
    "prognosis_and_prediction": "...",

    "related_figures": [
        {{
            "id": "Slide_X",
            "src": "PLACEHOLDER",
            "gcs_path": "PLACEHOLDER",
            "diagnosis": "Disease Name",
            "legend": "Specific description of this slide."
        }}
    ]
  }}
]

REFERENCE TAGS:
{valid_tags_list}

LECTURE CONTENT:
{transcript_data}
"""

# ------------------------------------------------------------------------------
# 3. SINGLE LECTURE PROCESSOR
# ------------------------------------------------------------------------------
async def process_lecture_async(session, raw_path, tags_text, tags_set):
    fname = raw_path.split('/')[-1]
    # Clean name: remove _RAW and extension
    lecture_name = fname.replace("_RAW.json", "").replace(".json", "")

    print(f"\nüöÄ Processing: {lecture_name}")

    # 1. Load Data
    slides = gcs_load_json(raw_path)
    if not slides:
        print("   ‚ö†Ô∏è Empty file. Skipping.")
        return

    # 2. Build Monolith Input
    formatted_input = ""
    slide_map = {}

    for s in slides:
        # Handle variations in key names (Block 1 versions differ)
        sid = s.get('id') if 'id' in s else s.get('segment_id')
        transcript = s.get('raw_transcript', '')
        visual = s.get('visual_desc') or s.get('slide_title', '')
        ts = s.get('timestamp_start') or s.get('start_time', 0)

        slide_map[str(sid)] = s

        formatted_input += f"\n--- ID: Slide_{sid} (Time: {float(ts):.0f}s) ---\n"
        formatted_input += f"VISUAL: {visual}\n"
        formatted_input += f"TRANSCRIPT: {transcript}\n"

    # 3. Call AI
    url = f"https://generativelanguage.googleapis.com/v1beta/models/{MODEL_NAME}:generateContent?key={GEMINI_API_KEY}"
    payload = {"contents": [{"parts": [{"text": construct_lecture_prompt(formatted_input, tags_text)}]}]}

    for attempt in range(MAX_RETRIES):
        try:
            async with session.post(url, json=payload, timeout=600) as response:
                if response.status == 200:
                    data = await response.json()
                    raw_txt = data['candidates'][0]['content']['parts'][0]['text']
                    match = re.search(r'\[.*\]', raw_txt.replace("```json", "").replace("```", ""), re.DOTALL)

                    if match:
                        entities = json.loads(match.group(0))

                        # ID Swap & Validation
                        entities = inject_real_paths(entities, slide_map)

                        valid_entities = []
                        for ent in entities:
                            ent['tags'] = [validate_tag(ent.get('tags'), tags_set)]
                            ent['html_gcs_path'] = None
                            for k in ["microscopic", "ancillary_studies", "differential_diagnosis", "clinical", "pathogenesis"]:
                                if k not in ent: ent[k] = None
                            valid_entities.append(ent)

                        # Save
                        out_path = f"{PATHS['gcs_content_lectures']}/{lecture_name}_MASTER.json"
                        gcs_upload_json(valid_entities, out_path)
                        print(f"   ‚úÖ Saved: {out_path} ({len(valid_entities)} entities)")
                        return

                elif response.status == 429:
                    print(f"   ‚è≥ Rate Limit... Waiting 30s")
                    await asyncio.sleep(30)
                    continue
                else:
                    print(f"   ‚ùå API Error: {response.status}")
                    break
        except Exception as e:
            print(f"   ‚ùå Exception: {e}")
            await asyncio.sleep(5)

# ------------------------------------------------------------------------------
# 4. MAIN BATCH RUNNER
# ------------------------------------------------------------------------------
async def main_batch_flexible():
    # 1. Tags
    tag_files = [b.name for b in bucket.list_blobs(prefix="Tags/") if b.name.endswith('.txt')]
    print("\n--- SELECT TAG LIST ---")
    for i, f in enumerate(tag_files): print(f"[{i+1}] {f.split('/')[-1]}")
    t_idx = int(input("Choice: ")) - 1
    tags_text = gcs_read_text(tag_files[t_idx])
    tags_set = set(l.strip() for l in tags_text.splitlines() if l.strip())

    # 2. Select Files (Flexible)
    blobs = list(bucket.list_blobs(prefix=PATHS['gcs_content_lectures']))
    candidates = []

    for b in blobs:
        if not b.name.endswith(".json"): continue
        # EXCLUDE Master/Consolidated files
        if "_MASTER" in b.name or "_CONSOLIDATED" in b.name or "_APP_READY" in b.name:
            continue
        candidates.append(b.name)

    if not candidates: print("‚ùå No input JSONs found."); return

    print("\n--- AVAILABLE LECTURES ---")
    for i, f in enumerate(candidates): print(f"[{i+1}] {f.split('/')[-1]}")

    sel = input("\nSelect (e.g. 1, 3-5, all): ")
    indices = set()
    if sel == 'all': indices = range(len(candidates))
    else:
        for part in sel.split(','):
            if '-' in part:
                s, e = map(int, part.split('-'))
                indices.update(range(s-1, e))
            elif part.strip().isdigit():
                indices.add(int(part)-1)

    print(f"\nüöÄ Queued {len(indices)} lectures...")

    async with aiohttp.ClientSession() as session:
        for idx in sorted(list(indices)):
            if 0 <= idx < len(candidates):
                await process_lecture_async(session, candidates[idx], tags_text, tags_set)

    print("\nüéâ Batch Complete.")

await main_batch_flexible()


--- SELECT TAG LIST ---
[1] BST_Tags.txt
[2] Breast_Tags.txt
[3] Endo_Tags.txt
[4] GI_Tags.txt
[5] GYN_Tags.txt
[6] Skin_Tags.txt
Choice: 6

--- AVAILABLE LECTURES ---
[1] BST_Lecture_1_Grossing.json
[2] final_ENHANCED_data.json
[3] final_structured_data.json
[4] BST_Lecture_2_SoftTissue1.json
[5] final_ENHANCED_data.json
[6] BST_Lecture_3_SofTissue2.json
[7] final_ENHANCED_data.json
[8] BST_Lecture_4_SoftTissue3.json
[9] final_ENHANCED_data.json
[10] BST_Lecture_5_Bone1.json
[11] final_ENHANCED_data.json
[12] BST_Lecture_6_Bone2.json
[13] final_ENHANCED_data.json
[14] Breast_Lecture_Epithelial.json
[15] final_ENHANCED_data.json
[16] Breast_Lecture_Fibroepithelial.json
[17] final_ENHANCED_data.json
[18] Breast_Lecture_Grossing.json
[19] final_ENHANCED_data.json
[20] Breast_Lecture_IHC.json
[21] final_ENHANCED_data.json
[22] Breast_Lecture_Invasive.json
[23] final_ENHANCED_data.json
[24] Breast_Lecture_Lobular.json
[25] final_ENHANCED_data.json
[26] Breast_Lecture_Normal.json
[27] fina

# BLOCK 3: THE GRAND UNIFIER (Textbooks + Lectures + WSI)

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 3: THE GRAND UNIFIER (Textbooks + Lectures + WSI)
# ==============================================================================
import json
import os
from google.cloud import storage
from tqdm.notebook import tqdm

# --- CONFIGURATION ---
OUTPUT_FILENAME = "GLOBAL_KNOWLEDGE_BASE.json"
WSI_FOLDER_PREFIX = "WSI_JSON/"
BLACKLIST = ["Copyright", "Preface", "Contents", "Index", "Contributors", "Dedication", "Title Page"]

# --- SETUP ---
if 'PATHS' not in globals():
    raise NameError("‚ùå PATHS not found. Please run Block 0.")
bucket = storage.Client().bucket(PATHS['gcs_bucket'])

# ------------------------------------------------------------------------------
# 1. TRANSFORMATION LOGIC (For Textbooks/Lectures)
# ------------------------------------------------------------------------------
def transform_raw_to_app_schema(entry, source_name, source_type):
    """
    Converts raw pipeline output (related_figures) into App Schema (media).
    """
    # 1. Filter Noise
    if any(x in entry.get('entity_name', '') for x in BLACKLIST): return None

    new_entry = entry.copy()

    # 2. Build Media Array
    media_list = []
    if 'related_figures' in entry:
        for fig in entry['related_figures']:
            diag = fig.get('diagnosis', '').strip()
            leg = fig.get('legend', '').strip()

            # Combine for rich legend
            if diag and leg and diag not in leg:
                final_legend = f"{diag}. {leg}"
            else:
                final_legend = leg or diag

            media_item = {
                "type": "figure",
                "path": fig.get('gcs_path') or fig.get('src'),
                "legend": final_legend
            }
            media_list.append(media_item)
        del new_entry['related_figures']

    new_entry['media'] = media_list

    # 3. Inject Metadata & Defaults
    new_entry['_meta_source'] = source_name
    new_entry['_meta_type'] = source_type

    for k in ["video", "html", "wsi", "definition", "clinical", "microscopic"]:
        if k not in new_entry: new_entry[k] = None

    # 4. Clean Internal Keys
    for k in ['html_gcs_path', 'gcs_origin', 'best_slide_id', 'source_document', 'source_type']:
        if k in new_entry: del new_entry[k]

    return new_entry

# ------------------------------------------------------------------------------
# 2. WSI LOADER (Already Formatted)
# ------------------------------------------------------------------------------
def load_app_ready_wsi(entry, source_name):
    """
    Loads WSI entries that are ALREADY in App Schema.
    Just injects metadata.
    """
    # Just ensure metadata exists
    entry['_meta_source'] = source_name
    entry['_meta_type'] = "WSI_Collection"
    return entry

# ------------------------------------------------------------------------------
# 3. FILE FINDERS
# ------------------------------------------------------------------------------
def get_best_pipeline_files(prefix):
    """Finds _CONSOLIDATED (preferred) or _MASTER files."""
    blobs = list(bucket.list_blobs(prefix=prefix))
    files_map = {}

    for b in blobs:
        if not b.name.endswith(".json"): continue
        fname = b.name.split('/')[-1]

        if "_MASTER" not in fname and "_CONSOLIDATED" not in fname: continue

        base = fname.replace("_MASTER.json", "").replace("_CONSOLIDATED.json", "")

        # Priority: Consolidated > Master
        if "_CONSOLIDATED" in fname:
            files_map[base] = b.name
        elif base not in files_map:
            files_map[base] = b.name
    return files_map

def get_wsi_files():
    """Finds _APP_READY files in the WSI folder."""
    blobs = list(bucket.list_blobs(prefix=WSI_FOLDER_PREFIX))
    return [b.name for b in blobs if b.name.endswith("_APP_READY.json")]

# ------------------------------------------------------------------------------
# 4. MAIN EXECUTION
# ------------------------------------------------------------------------------
def main_unifier():
    print(f"üöÄ Starting Grand Unification...")
    global_kb = []

    # --- A. TEXTBOOKS ---
    print("üìö Scanning Textbooks...")
    books = get_best_pipeline_files(PATHS['gcs_content_textbooks'])

    for name, path in tqdm(books.items(), desc="Textbooks"):
        try:
            data = json.loads(bucket.blob(path).download_as_string())
            for item in data:
                res = transform_raw_to_app_schema(item, name, "Textbook")
                if res: global_kb.append(res)
        except Exception as e:
            print(f"‚ö†Ô∏è Error in {name}: {e}")

    # --- B. LECTURES ---
    print("üé• Scanning Lectures...")
    lectures = get_best_pipeline_files(PATHS['gcs_content_lectures'])

    for name, path in tqdm(lectures.items(), desc="Lectures"):
        try:
            data = json.loads(bucket.blob(path).download_as_string())
            for item in data:
                res = transform_raw_to_app_schema(item, name, "Lecture")
                if res: global_kb.append(res)
        except: pass

    # --- C. WSI COLLECTIONS ---
    print("üî¨ Scanning WSI Collections...")
    wsi_files = get_wsi_files()

    for path in tqdm(wsi_files, desc="WSI"):
        try:
            data = json.loads(bucket.blob(path).download_as_string())
            name = path.split('/')[-1].replace("_APP_READY.json", "")
            for item in data:
                # WSI files are already formatted, just add metadata
                res = load_app_ready_wsi(item, name)
                if res: global_kb.append(res)
        except Exception as e:
            print(f"‚ö†Ô∏è Error in {path}: {e}")

    # --- SAVE ---
    print(f"\nüìä Total Entities: {len(global_kb)}")
    print(f"üíæ Saving to gs://{PATHS['gcs_bucket']}/{OUTPUT_FILENAME}...")

    bucket.blob(OUTPUT_FILENAME).upload_from_string(
        json.dumps(global_kb, indent=2),
        content_type='application/json'
    )
    print("‚úÖ SUCCESS!")

main_unifier()

# Temp / Utility

## BLOCK 1: ROBUST WSI IMPORTER (Case-Insensitive & Spelling Aware)

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 1: ROBUST WSI IMPORTER (Case-Insensitive & Spelling Aware)
# ==============================================================================
import json
import asyncio
import aiohttp
import re
import difflib
from google.cloud import storage
from tqdm.asyncio import tqdm_asyncio

# --- CONFIGURATION ---
MODEL_NAME = "gemini-3-flash-preview"
CONCURRENCY_LIMIT = 20
SOURCE_FOLDER_PREFIX = "WSI_JSON/"

# --- SETUP ---
if 'PATHS' not in globals():
    raise NameError("‚ùå PATHS not found. Please run Block 0.")
bucket = storage.Client().bucket(PATHS['gcs_bucket'])

# --- HELPERS ---
def gcs_load_json(blob_path):
    blob = bucket.blob(blob_path)
    return json.loads(blob.download_as_string()) if blob.exists() else []

def gcs_upload_json(data, blob_path):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(json.dumps(data, indent=2), content_type='application/json')

def gcs_read_text(blob_path):
    blob = bucket.blob(blob_path)
    return blob.download_as_string().decode('utf-8') if blob.exists() else ""

# --- KEY NORMALIZER (The Fix) ---
def get_val_case_insensitive(data_dict, target_key):
    """Finds a key in a dict regardless of casing."""
    # Direct match
    if target_key in data_dict: return data_dict[target_key]

    # Lowercase match
    target_lower = target_key.lower().replace(" ", "_")
    for k, v in data_dict.items():
        k_norm = k.lower().replace(" ", "_")
        if k_norm == target_lower:
            return v
    return None

# ------------------------------------------------------------------------------
# 1. NORMALIZERS
# ------------------------------------------------------------------------------
def normalize_who_chapter(entry):
    """Handles WHO structure with case-insensitive key lookup."""

    # Map text fields
    new_entry = {}
    fields = [
        "entity_name", "definition", "clinical", "pathogenesis",
        "macroscopic", "microscopic", "ancillary_studies",
        "differential_diagnosis", "staging", "prognosis_and_prediction",
        "cytology", "diagnostic_molecular_pathology",
        "related_terminology", "subtypes"
    ]

    for f in fields:
        new_entry[f] = get_val_case_insensitive(entry, f)

    # Process Figures
    media_list = []
    raw_figs = get_val_case_insensitive(entry, "related_figures") or []

    for fig in raw_figs:
        legend = get_val_case_insensitive(fig, "legend") or ""
        diag = get_val_case_insensitive(fig, "diagnosis") or ""

        if diag and diag not in legend:
            legend = f"{diag}. {legend}"

        media_item = {"legend": legend}

        is_wsi = get_val_case_insensitive(fig, "isWSI")
        fig_id = str(get_val_case_insensitive(fig, "id"))

        if is_wsi is True:
            media_item["type"] = "wsi"
            media_item["path"] = f"https://tumourclassification.iarc.who.int/static/dzi/{fig_id}_files/10/0_0.jpeg"
            media_item["url"] = f"https://tumourclassification.iarc.who.int/Viewer/Index2?fid={fig_id}"
        else:
            media_item["type"] = "figure"
            src = get_val_case_insensitive(fig, "src")
            gcs = get_val_case_insensitive(fig, "gcs_path")
            media_item["path"] = gcs or src
            if not media_item["path"]: continue

        media_list.append(media_item)

    new_entry['media'] = media_list
    return new_entry

def normalize_simple(entry, fmt):
    """Handles simple formats (Leeds, PP, etc)."""
    # Try common keys for diagnosis
    diag = get_val_case_insensitive(entry, "Diagnosis") or get_val_case_insensitive(entry, "Title")

    # Try common keys for URL
    url = get_val_case_insensitive(entry, "URL") or get_val_case_insensitive(entry, "Link")

    # Try common keys for Thumbnail
    thumb = get_val_case_insensitive(entry, "Thumbnail")
    if fmt == "PathPresenter": thumb = None # Force null

    return {
        "entity_name": diag,
        "media": [{
            "type": "wsi",
            "path": thumb,
            "url": url,
            "legend": diag
        }]
    }

def detect_format(data_entry):
    keys = [k.lower() for k in data_entry.keys()]
    if "related_figures" in keys or "microscopic" in keys: return "WHO"
    if "diagnosis" in keys or "url" in keys: return "Simple"
    return "Unknown"

# ------------------------------------------------------------------------------
# 2. AI TAGGING
# ------------------------------------------------------------------------------
async def assign_tag_async(session, entity, valid_tags_text, valid_tags_set, sem):
    async with sem:
        diag = entity.get('entity_name')
        if not diag:
            entity['tags'] = ["Skin::Unclassified"]
            return entity

        # Check exact match first
        if diag in valid_tags_set:
            entity['tags'] = [diag]
            return entity

        url = f"https://generativelanguage.googleapis.com/v1beta/models/{MODEL_NAME}:generateContent?key={GEMINI_API_KEY}"

        prompt = f"""
        Role: Pathology Taxonomist.
        Task: Map the diagnosis "{diag}" to the best tag from the list.

        RULES:
        1. "Naevus" = "Nevus".
        2. Ignore "(HE)", "(Actin)", or case numbers.
        3. Return ONLY the tag string from the list.

        VALID TAGS:
        {valid_tags_text}
        """

        for _ in range(3): # Retry logic
            try:
                async with session.post(url, json={"contents": [{"parts": [{"text": prompt}]}]}) as res:
                    if res.status == 200:
                        dat = await res.json()
                        tag = dat['candidates'][0]['content']['parts'][0]['text'].strip()

                        if tag in valid_tags_set:
                            entity['tags'] = [tag]
                        else:
                            # Fallback fuzzy match
                            matches = difflib.get_close_matches(tag, list(valid_tags_set), n=1, cutoff=0.7)
                            entity['tags'] = [matches[0]] if matches else ["Skin::Unclassified"]
                        return entity
                    elif res.status == 429:
                        await asyncio.sleep(2)
            except: pass

        entity['tags'] = ["Skin::Unclassified"]
        return entity

# ------------------------------------------------------------------------------
# 3. EXECUTION
# ------------------------------------------------------------------------------
async def main_wsi_robust():
    # 1. Tags
    tag_files = [b.name for b in bucket.list_blobs(prefix="Tags/") if b.name.endswith('.txt')]
    print("\n--- SELECT TAG LIST ---")
    for i, f in enumerate(tag_files): print(f"[{i+1}] {f.split('/')[-1]}")
    t_idx = int(input("Choice: ")) - 1
    tags_text = gcs_read_text(tag_files[t_idx])
    tags_set = set(l.strip() for l in tags_text.splitlines() if l.strip())

    # 2. Files
    print(f"\nüîç Scanning {SOURCE_FOLDER_PREFIX} ...")
    blobs = list(bucket.list_blobs(prefix=SOURCE_FOLDER_PREFIX))
    candidates = [b.name for b in blobs if b.name.endswith(".json") and "_APP_READY1" not in b.name]

    if not candidates: print("‚ùå No files found."); return

    print("\n--- AVAILABLE FILES ---")
    for i, f in enumerate(candidates): print(f"[{i+1}] {f}")

    sel = input("\nSelect files (e.g. 1, 3-5, all): ").strip().lower()
    indices = set(range(len(candidates))) if sel == 'all' else {int(x)-1 for x in re.findall(r'\d+', sel)}

    # 3. Process
    for idx in sorted(list(indices)):
        if idx < 0 or idx >= len(candidates): continue
        src_blob = candidates[idx]

        print(f"\n{'='*50}\nüöÄ Processing: {src_blob}")
        raw_data = gcs_load_json(src_blob)

        if not raw_data: print("‚ö†Ô∏è Empty file."); continue

        # Debug: Print keys of first entry
        print(f"   Sample Keys: {list(raw_data[0].keys())}")

        # Detect Format
        fmt = detect_format(raw_data[0])
        print(f"   Detected Format: {fmt}")

        normalized_data = []
        for item in raw_data:
            if fmt == "WHO": norm = normalize_who_chapter(item)
            else: norm = normalize_simple(item, fmt)

            # Fill schema nulls
            for k in ["video", "html", "wsi", "definition", "clinical", "microscopic"]:
                if k not in norm: norm[k] = None
            normalized_data.append(norm)

        # Tagging
        print(f"üß† Tagging {len(normalized_data)} entities...")
        sem = asyncio.Semaphore(CONCURRENCY_LIMIT)
        final_data = []
        async with aiohttp.ClientSession() as session:
            tasks = [assign_tag_async(session, ent, tags_text, tags_set, sem) for ent in normalized_data]
            final_data = await tqdm_asyncio.gather(*tasks)

        # Save
        out_name = f"{src_blob.replace('.json', '_APP_READY.json')}"
        gcs_upload_json(final_data, out_name)

        valid = sum(1 for e in final_data if "Unclassified" not in e['tags'][0])
        print(f"‚úÖ Saved: {out_name} (Tagged: {valid}/{len(final_data)})")

await main_wsi_robust()


--- SELECT TAG LIST ---
[1] BST_Tags.txt
[2] Breast_Tags.txt
[3] Endo_Tags.txt
[4] GI_Tags.txt
[5] GYN_Tags.txt
[6] Skin_Tags.txt
Choice: 6

üîç Scanning WSI_JSON/ ...

--- AVAILABLE FILES ---
[1] WSI_JSON/Leeds_WSI_Skin.json
[2] WSI_JSON/PP_Skin_1-500.json
[3] WSI_JSON/PP_Skin_1-500_APP_READY.json
[4] WSI_JSON/PP_Skin_1001-1500.json
[5] WSI_JSON/PP_Skin_1001-1500_APP_READY.json
[6] WSI_JSON/PP_Skin_1501-2000.json
[7] WSI_JSON/PP_Skin_1501-2000_APP_READY.json
[8] WSI_JSON/PP_Skin_2001-2500.json
[9] WSI_JSON/PP_Skin_2001-2500_APP_READY.json
[10] WSI_JSON/PP_Skin_2501-3000.json
[11] WSI_JSON/PP_Skin_2501-3000_APP_READY.json
[12] WSI_JSON/PP_Skin_3001-3500.json
[13] WSI_JSON/PP_Skin_3001-3500_APP_READY.json
[14] WSI_JSON/PP_Skin_3501-4000.json
[15] WSI_JSON/PP_Skin_3501-4000_APP_READY.json
[16] WSI_JSON/PP_Skin_4001-4500.json
[17] WSI_JSON/PP_Skin_4001-4500_APP_READY.json
[18] WSI_JSON/PP_Skin_4501-5000.json
[19] WSI_JSON/PP_Skin_4501-5000_APP_READY.json
[20] WSI_JSON/PP_Skin_5001-6000.

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 224/224 [01:44<00:00,  2.13it/s]


‚úÖ Saved: WSI_JSON/SKIN_APP_READY.json (Tagged: 74/224)


In [None]:
# @title üõ†Ô∏è Block 1: Single Tag Enforcer (Local Input: Skin_Tags.txt)
import json
import re
import difflib
import os
from typing import List, Any, Set
from google.colab import auth
from google.cloud import storage

# ==============================================================================
# CONFIGURATION
# ==============================================================================
BUCKET_NAME = "pathology-hub-0"

# 1. Source of VALID TAGS (Cloud)
TAGS_GCS_PATH = "Tags/Skin_Tags.txt"

# 2. Source of ENTITIES (Local File uploaded to Colab)
INPUT_FILENAME = "Skin_Tags.txt"

# 3. Output Destination (Cloud)
GCS_OUTPUT_PATH = "WHO/WHO_JSON_PROCESSED/SKIN.json"

# ==============================================================================
# 1. DOWNLOAD VALID TAGS FROM GCS
# ==============================================================================
def get_tags_from_gcs(bucket_name: str, blob_path: str) -> Set[str]:
    print(f"‚òÅÔ∏è Downloading Master Tag List from: gs://{bucket_name}/{blob_path} ...")
    try:
        client = storage.Client()
        bucket = client.bucket(bucket_name)
        blob = bucket.blob(blob_path)
        content = blob.download_as_string().decode('utf-8')
        # Build set of valid tags
        valid_tags = set(line.strip() for line in content.splitlines() if line.strip())
        return valid_tags
    except Exception as e:
        print(f"‚ùå Error downloading tags: {e}")
        return set()

# ==============================================================================
# 2. PARSE LOCAL ENTITIES (JSON)
# ==============================================================================
def parse_local_entities(filepath: str) -> List[dict]:
    print(f"üìÇ Reading local file: {filepath} ...")
    with open(filepath, 'r', encoding='utf-8') as f:
        content = f.read()

    # Attempt to find JSON block if file has mixed content
    json_match = re.search(
        r'--- START OF FILE application/json ---\n(.*)',
        content,
        re.DOTALL
    )

    if json_match:
        json_text = json_match.group(1)
        json_text = re.sub(r'\n---.*', '', json_text) # Remove footer if present
    else:
        # Assume whole file is JSON
        json_text = content

    try:
        data = json.loads(json_text)
        return data
    except json.JSONDecodeError as e:
        print(f"‚ùå JSON Decode Error: {e}")
        # Soft fix for cut-off files
        try:
            print("   Attempting auto-fix...")
            data = json.loads(json_text + "]")
            return data
        except:
            return []

# ==============================================================================
# 3. TAGGING LOGIC (Force 1 Tag)
# ==============================================================================
def resolve_single_tag(tag_input: Any, valid_set: Set[str], entity_name: str) -> List[str]:
    """
    Returns exactly ONE tag: ["Category::Subcategory::Specific"]
    """
    candidates = []

    # Normalize input to list of strings
    if isinstance(tag_input, str):
        candidates = [tag_input]
    elif isinstance(tag_input, list):
        candidates = [str(t) for t in tag_input]

    # Strategy 1: Look for Exact Match in existing tags
    for tag in candidates:
        clean = tag.strip()
        if clean in valid_set:
            return [clean]

    # Strategy 2: Look for Fuzzy Match in existing tags
    for tag in candidates:
        clean = tag.strip()
        matches = difflib.get_close_matches(clean, list(valid_set), n=1, cutoff=0.85)
        if matches:
            return [matches[0]]

    # Strategy 3: Match Entity Name to Tag List
    # e.g., Entity: "Basal cell carcinoma" -> Tag: "Skin::...::Basal_Cell_Carcinoma"
    if entity_name:
        name_clean = entity_name.replace(" ", "_").replace("-", "_")

        # A. Containment match (Case insensitive)
        # Find tags that contain the entity name
        possible_matches = [t for t in valid_set if name_clean.lower() in t.lower()]
        if possible_matches:
            # Sort by length; usually the shortest containing tag is the specific category header
            # or the longest might be the specific disease.
            # We pick the one with best text overlap.
            possible_matches.sort(key=lambda x: difflib.SequenceMatcher(None, name_clean.lower(), x.lower()).ratio(), reverse=True)
            return [possible_matches[0]]

        # B. Fuzzy match against all tags
        matches = difflib.get_close_matches(name_clean, list(valid_set), n=1, cutoff=0.6)
        if matches:
            return [matches[0]]

    return ["Skin::Unclassified"]

# ==============================================================================
# 4. MAIN
# ==============================================================================
def main():
    # 1. Auth
    print("üîë Authenticating...")
    try:
        auth.authenticate_user()
    except:
        print("‚ö†Ô∏è Skipped auth")

    # 2. Get Tags from GCS
    valid_tags = get_tags_from_gcs(BUCKET_NAME, TAGS_GCS_PATH)
    if not valid_tags:
        print("‚ùå ABORT: No tags found in GCS.")
        return
    print(f"   ‚úÖ Loaded {len(valid_tags)} valid tags.")

    # 3. Read Local JSON
    if not os.path.exists(INPUT_FILENAME):
        print(f"‚ùå ABORT: Local file '{INPUT_FILENAME}' not found. Please upload it.")
        return

    entities = parse_local_entities(INPUT_FILENAME)
    print(f"   ‚úÖ Loaded {len(entities)} entities.")

    # 4. Clean & Tag
    cleaned_kb = []
    keys_to_keep = [
        "entity_name", "definition", "localization", "clinical",
        "pathogenesis", "macroscopic", "microscopic", "cytology",
        "ancillary_studies", "differential_diagnosis", "staging",
        "prognosis_and_prediction", "diagnostic_molecular_pathology",
        "essential_and_desirable_diagnostic_criteria", "related_figures"
    ]

    for ent in entities:
        if not ent.get('entity_name'): continue

        # Fix Tag
        ent['tags'] = resolve_single_tag(
            ent.get('tags', []),
            valid_tags,
            ent.get('entity_name', '')
        )

        # Enforce Keys
        for k in keys_to_keep:
            if k not in ent: ent[k] = None
        ent['html_gcs_path'] = ent.get('html_gcs_path', None)

        cleaned_kb.append(ent)

    # 5. Upload Result
    print(f"\nüöÄ Uploading {len(cleaned_kb)} cleaned entities to GCS...")
    try:
        client = storage.Client()
        bucket = client.bucket(BUCKET_NAME)
        blob = bucket.blob(GCS_OUTPUT_PATH)
        blob.upload_from_string(
            json.dumps(cleaned_kb, indent=2, ensure_ascii=False),
            content_type='application/json'
        )
        print(f"‚úÖ DONE: gs://{BUCKET_NAME}/{GCS_OUTPUT_PATH}")
    except Exception as e:
        print(f"‚ùå Upload Error: {e}")

if __name__ == "__main__":
    main()

üîë Authenticating...
‚òÅÔ∏è Downloading Master Tag List from: gs://pathology-hub-0/Tags/Skin_Tags.txt ...
   ‚úÖ Loaded 723 valid tags.
‚ùå ABORT: Local file 'Skin_Tags.txt' not found. Please upload it.


In [None]:
# @title üöÄ Block 1: Local File Processor (Gemini 3 Flash Preview)
import json
import time
import os
from typing import List
from google.colab import auth
import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig, HarmCategory, HarmBlockThreshold

# ==============================================================================
# CONFIGURATION
# ==============================================================================
# ‚ö†Ô∏è Your Project ID
PROJECT_ID = "pathology-annotation-project"
LOCATION = "us-central1"

# ‚úÖ CORRECT MODEL ID (Released Dec 17, 2025)
MODEL_ID = "gemini-3-flash-preview"

# Local Files (Upload these to the "Files" folder on the left)
LOCAL_TAGS_FILE = "Skin_Tags.txt"
LOCAL_ENTITIES_FILE = "SKIN.json"
LOCAL_OUTPUT_FILE = "SKIN_Cleaned_Gemini3.json"

# ==============================================================================
# AI PROCESSING LOGIC
# ==============================================================================
def map_tags_with_gemini_3(entities: List[dict], valid_tags_text: str):
    print(f"üîå Connecting to Vertex AI...")
    print(f"   ‚Ä¢ Project: {PROJECT_ID}")
    print(f"   ‚Ä¢ Region:  {LOCATION}")
    print(f"   ‚Ä¢ Model:   {MODEL_ID}")

    try:
        vertexai.init(project=PROJECT_ID, location=LOCATION)
        # Initialize Gemini 3
        model = GenerativeModel(MODEL_ID)
    except Exception as e:
        print(f"\n‚ùå FATAL ERROR connecting to Vertex AI: {e}")
        return []

    # Safety settings: BLOCK_NONE is critical for medical pathology terms
    safety_settings = {
        HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
        HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
    }

    # Gemini 3 Flash has a massive context window (1M tokens).
    # We can process large batches (e.g., 50 entities) safely.
    BATCH_SIZE = 50
    total = len(entities)
    updated_entities = []

    print(f"üöÄ Processing {total} entities using {MODEL_ID}...")

    for i in range(0, total, BATCH_SIZE):
        batch = entities[i : i + BATCH_SIZE]

        # Build prompt context
        entities_str = ""
        for idx, ent in enumerate(batch):
            name = ent.get('entity_name', 'Unknown')
            # Truncate definition slightly to save tokens, though Gemini 3 handles it easily
            defin = ent.get('definition', '')[:500]
            entities_str += f"ITEM_{idx}: {name}\nDEFINITION: {defin}\n\n"

        prompt = f"""
        Role: Senior Dermatopathologist & Data Steward.
        Objective: Map specific skin disease entities to the standard WHO Classification Tags.

        --- MASTER TAG LIST START ---
        {valid_tags_text}
        --- MASTER TAG LIST END ---

        --- ENTITIES TO MAP ---
        {entities_str}

        INSTRUCTIONS:
        1. For each entity, select the ONE most accurate tag from the Master List.
        2. If the entity name implies a specific subtype (e.g., "Sarcomatoid SCC") and a specific tag exists ("...Squamous_Cell_Carcinoma_Sarcomatoid"), USE IT.
        3. If no specific tag exists, find the closest parent tag.
        4. Return a JSON list of strings (tags) in exactly the same order as the input items.

        OUTPUT FORMAT:
        ["Tag_1", "Tag_2", "Tag_3", ...]
        """

        # Retry logic
        max_retries = 3
        for attempt in range(max_retries):
            try:
                response = model.generate_content(
                    prompt,
                    generation_config=GenerationConfig(
                        temperature=0.0, # Deterministic for data cleaning
                        response_mime_type="application/json"
                    ),
                    safety_settings=safety_settings
                )

                tags_list = json.loads(response.text)

                if len(tags_list) != len(batch):
                    print(f"‚ö†Ô∏è Batch {i}: Count mismatch ({len(tags_list)} tags vs {len(batch)} items). Retrying...")
                    raise ValueError("Count mismatch")

                # Apply tags
                for ent, new_tag in zip(batch, tags_list):
                    ent['tags'] = [new_tag]
                    print(f"   [{ent['entity_name'][:30]:<30}] -> {new_tag}")

                updated_entities.extend(batch)
                break

            except Exception as e:
                print(f"   ‚ùå Error Batch {i} (Attempt {attempt+1}): {e}")

                # Check for the specific 403 that caused the previous issue
                if "403" in str(e) or "PERMISSION_DENIED" in str(e):
                    print(f"\nüõë PERMISSION ERROR DETECTED for {MODEL_ID}")
                    print("   1. Go to Google Cloud Console > Vertex AI > Model Garden")
                    print("   2. Search for 'Gemini 3 Flash'")
                    print("   3. Click 'ENABLE' to accept the Preview terms.")
                    # Stop script here so you can go fix it
                    return updated_entities

                time.sleep(2)

    return updated_entities

# ==============================================================================
# MAIN EXECUTION
# ==============================================================================
def main():
    # 1. Auth
    print("üîë Authenticating...")
    try:
        auth.authenticate_user()
    except:
        print("‚ö†Ô∏è Authentication skipped (Local runtime?).")

    # 2. Check Files
    if not os.path.exists(LOCAL_TAGS_FILE):
        print(f"‚ùå Error: '{LOCAL_TAGS_FILE}' not found. Please upload it.")
        return
    if not os.path.exists(LOCAL_ENTITIES_FILE):
        print(f"‚ùå Error: '{LOCAL_ENTITIES_FILE}' not found. Please upload it.")
        return

    # 3. Read Data
    print("üìÇ Reading local input files...")
    with open(LOCAL_TAGS_FILE, 'r', encoding='utf-8') as f:
        tags_text = f.read()

    with open(LOCAL_ENTITIES_FILE, 'r', encoding='utf-8') as f:
        # Robust load: handle if file has weird headers or just raw JSON
        content = f.read()
        try:
            entities = json.loads(content)
        except json.JSONDecodeError:
            # Fallback: try to extract JSON if it was pasted with headers
            import re
            match = re.search(r'--- START OF FILE application/json ---\n(.*)', content, re.DOTALL)
            if match:
                clean = re.sub(r'\n---.*', '', match.group(1))
                entities = json.loads(clean)
            else:
                print("‚ùå Failed to parse JSON file.")
                return

    # 4. Run AI
    final_data = map_tags_with_gemini_3(entities, tags_text)

    # 5. Save Output
    if final_data:
        with open(LOCAL_OUTPUT_FILE, 'w', encoding='utf-8') as f:
            json.dump(final_data, f, indent=2)
        print(f"\nüíæ Success! Saved to: {LOCAL_OUTPUT_FILE}")
        print("   (Download this file from the left sidebar)")

if __name__ == "__main__":
    main()

üîë Authenticating...
üìÇ Reading local input files...
üîå Connecting to Vertex AI...
   ‚Ä¢ Project: pathology-annotation-project
   ‚Ä¢ Region:  us-central1
   ‚Ä¢ Model:   gemini-3-flash-preview
üöÄ Processing 210 entities using gemini-3-flash-preview...




   ‚ùå Error Batch 0 (Attempt 1): 403 Permission 'aiplatform.endpoints.predict' denied on resource '//aiplatform.googleapis.com/projects/pathology-annotation-project/locations/us-central1/publishers/google/models/gemini-3-flash-preview' (or it may not exist). [reason: "IAM_PERMISSION_DENIED"
domain: "aiplatform.googleapis.com"
metadata {
  key: "resource"
  value: "projects/pathology-annotation-project/locations/us-central1/publishers/google/models/gemini-3-flash-preview"
}
metadata {
  key: "permission"
  value: "aiplatform.endpoints.predict"
}
]

üõë PERMISSION ERROR DETECTED for gemini-3-flash-preview
   1. Go to Google Cloud Console > Vertex AI > Model Garden
   2. Search for 'Gemini 3 Flash'
   3. Click 'ENABLE' to accept the Preview terms.
