<a href="https://colab.research.google.com/github/herndoch/dermopath-ai-hub/blob/main/Knowledge_Pipeline_v4_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Read Me
<details>
<summary><strong>ü§ñ AI-to-AI Handover Protocol (Click to Expand)</strong></summary>

# ü§ñ AI-to-AI Handover Protocol (Read First)

**‚ö†Ô∏è CRITICAL SYSTEM INVARIANTS**
*Do not modify these settings without explicit human authorization. These constraints exist to prevent known failure modes.*

### 1. Model Hierarchy & Reasoning
*   **The "Monolith" Rule:** Architect Blocks (PDF Block 2 & Video Block 2) **MUST** utilize `gemini-1.5-pro-002` (or `gemini-3-pro-preview` if available).
    *   *Why:* We tested Flash; it hallucinates summaries and omits specific IHC stains (e.g., summarizing "CD45+, S100-" as just "ruled out melanoma"). Only Pro models maintain the high fidelity required for medical RAG.
*   **The "Flash" Rule:** Extraction and Consolidation (Block 1 & 3) **MUST** use `gemini-3-flash-preview`.
    *   *Why:* Pro models have strict Rate Limits (RPM). Using Pro for simple text merging or per-page extraction causes 429 loops and crashes.

### 2. Data Integrity Constraints
*   **The "Menu" Method (PDFs):** Never allow the AI to hallucinate image paths.
    *   *Invariant:* PDF Block 2 prompts **MUST** utilize the pre-validated "Figure Menu" generated by Block 1. If an image isn't in the menu, it does not exist.
*   **The "Chain of Custody" (Videos):**
    *   *Invariant:* Video Block 1 generates the `gs://` link. Video Block 2 must be instructed to copy that specific link field, not invent a filename.
*   **Zero-Loss Merging:** Block 3 (Consolidator) is purely additive.
    *   *Invariant:* When merging fragmented entities (e.g., "Lichen Planus" from Page 40 and Page 400), the AI must **concatenate** facts, never overwrite or summarize them away.

### 3. Operational Limits (The "Sweet Spots")
*   **Textbook Chunk Size:** `40 pages` (with `2 page overlap`).
    *   *Why:* >50 pages triggers HTTP timeouts. <30 pages loses context. 40 is the empirically tested maximum for stability.
*   **Pro Concurrency:** `Limit = 2`.
    *   *Why:* `gemini-1.5-pro` allows fewer concurrent requests than Flash. Increasing this >2 results in immediate 429 throttling.
*   **Flash Concurrency:** `Limit = 15-20`.
    *   *Why:* Safe zone for high-throughput image extraction.

### 4. File Format Logic
*   **Textbooks = PNG:** Lossless quality is required for OCR to correctly read tiny font sizes in medical diagrams.
*   **Lectures = JPG:** Compression is required to handle the volume (100+ slides per hour) without exploding storage costs.

---

### üîÑ Recovery Playbook (If Execution Fails)
*   **IF `429 Resource Exhausted`:** Do not restart. The script has auto-resume logic. Wait 60s and re-run.
*   **IF `Content-Generation Timeout`:** The Chunk Size is too large for the current model latency. Reduce `PAGES_PER_CHUNK` from 40 to 30.
*   **IF `KeyError: 'gcs_content_textbooks'`:** The environment is fresh. Run **Block 0** to re-initialize the `PATHS` map.

</details>

<details>
<summary><strong>üß¨ Pathology Knowledge Base Pipeline (SOP) (Click to Expand)</strong></summary>

**System Version:** v5.0 (High-Fidelity / Monolithic Architecture)
**Engine:** Google Gemini (1.5 Pro / 3 Flash)
**Infrastructure:** Google Colab $\leftrightarrow$ Google Cloud Storage (GCS)

## üìã Overview
This pipeline converts unstructured medical data (Textbooks and Video Lectures) into a strictly standardized, ontology-tagged JSON Knowledge Base. It uses a **"Monolith"** approach for reasoning (processing large contexts at once) and a **"Map-Reduce"** approach for consolidation.

---

## üõ†Ô∏è Block 0: Universal Setup
**Status:** ‚úÖ Mandatory (Run once per session)

This block installs dependencies, authenticates with Google Cloud, and establishes the global directory map (`PATHS`) to prevent file-not-found errors.

*   **Inputs:** None (requires Google Drive mount).
*   **Actions:**
    *   Installs `PyMuPDF` (PDFs), `openai-whisper` (Audio), `opencv` (Video), `aiohttp` (Async API).
    *   Authenticates via Colab Secrets (`GEMINI_API_KEY`).
    *   Sets up GCS Bucket paths.
*   **Key Variable:** `PATHS` dictionary (Routes data for both Textbooks and Lectures).

---

## üìö Workflow A: Textbooks (PDF)

### Block 1: The Extractor
**Goal:** Raw Data Acquisition & Normalization.
*   **Model:** `gemini-3-flash-preview` (Speed & Cost).
*   **Inputs:** Raw PDF files from Google Drive.
*   **Logic:**
    1.  **Text:** Cleans OCR errors page-by-page.
    2.  **Images:** Extracts images >5KB (saved as **PNG**).
    3.  **Panel-Aware Vision:** Detects if an image is "Figure 2.1 (A)" vs "(B)" and splits captions accordingly.
    4.  **Golden Links:** Generates permanent `gs://` links for every image.
*   **Outputs:** `_CONTENT.json` (Text), `_FIGURES.json` (Image Metadata).

### Block 2: The Architect (High-Fidelity)
**Goal:** Medical Reasoning & Schema Enforcement.
*   **Model:** `gemini-1.5-pro-002` or `gemini-3-pro-preview` (Deep Reasoning).
*   **Inputs:** `_CONTENT.json` + `_FIGURES.json` + `_Tags.txt`.
*   **Logic:**
    1.  **The Monolith:** Processes **40 pages** in a single context window to capture full disease descriptions.
    2.  **The Menu:** Forces AI to pick images from a pre-validated list of `gs://` links (prevents broken links).
    3.  **Strict Extraction:** Explicitly instructed to list **stains (CD45+)** and **genetics** without summarizing.
    4.  **Safety:** Auto-saves every 5 chunks; resumes if interrupted.
*   **Outputs:** `_MASTER.json` (High quality, but potentially fragmented entities).

### Block 3: The Consolidator
**Goal:** Map-Reduce / De-fragmentation.
*   **Model:** `gemini-3-flash-preview` (Logistics & Merging).
*   **Inputs:** `_MASTER.json`.
*   **Logic:**
    1.  **Map:** Groups entries by Tag (e.g., finds 3 separate entries for "Lichen Planus" from different chapters).
    2.  **Reduce:** Merges text, combines figure lists, and deduplicates data into one Super-Entry.
*   **Outputs:** `_CONSOLIDATED.json` (Final Database-Ready File).

---

## üé• Workflow B: Lectures (Video)

### Block 1: The Extractor
**Goal:** Audio Transcription & Slide Extraction.
*   **Model:** `whisper` (Audio) + `gemini-3-flash-preview` (Visuals).
*   **Inputs:** MP4/MOV files from Google Drive.
*   **Logic:**
    1.  **Audio:** Generates timestamped transcript.
    2.  **Visuals:** Extracts frames using **SSIM (Structural Similarity)** to deduplicate static slides (only 1 image per slide change). Saved as **JPG**.
    3.  **Analysis:** Vision model extracts text/titles visible on the slide.
*   **Outputs:** `_RAW.json` (List of slides with transcripts and GCS paths).

### Block 2: The Architect (The Monolith)
**Goal:** Synthesis & SOP Compliance.
*   **Model:** `gemini-1.5-pro-002` or `gemini-3-pro-preview`.
*   **Inputs:** `_RAW.json` + `_Tags.txt`.
*   **Logic:**
    1.  **Single Shot:** Feeds the **Entire Lecture** (Transcript + All Slide Images) in one massive request.
    2.  **Visual Fidelity:** Prompt forces extraction of text labels seen on slides (e.g., "TTF-1+", "CK20+") rather than just summarizing the diagnosis.
    3.  **Schema:** Maps the spoken lecture into the strict 18-field SOP (Clinical, Microscopic, etc.).
*   **Outputs:** `_MASTER.json`. *(Note: Lectures rarely need Block 3 consolidation as they usually discuss a topic linearly).*

---

## üìÇ Data Structure (Google Cloud)

```text
gs://pathology-hub-0/
‚îú‚îÄ‚îÄ Tags/                        # Source of Truth (Ontology)
‚îú‚îÄ‚îÄ _asset_library/
‚îÇ   ‚îú‚îÄ‚îÄ textbooks/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ [Book_Name]/
‚îÇ   ‚îÇ       ‚îî‚îÄ‚îÄ figure_images/   # Saved PNGs (Lossless)
‚îÇ   ‚îî‚îÄ‚îÄ lectures/
‚îÇ       ‚îî‚îÄ‚îÄ [Video_Name]/        # Saved JPGs (Compressed)
‚îî‚îÄ‚îÄ _content_library/
    ‚îú‚îÄ‚îÄ textbooks/
    ‚îÇ   ‚îú‚îÄ‚îÄ [Book]_CONTENT.json
    ‚îÇ   ‚îú‚îÄ‚îÄ [Book]_FIGURES.json
    ‚îÇ   ‚îú‚îÄ‚îÄ [Book]_MASTER.json
    ‚îÇ   ‚îî‚îÄ‚îÄ [Book]_CONSOLIDATED.json    # <--- FINAL PDF RESULT
    ‚îî‚îÄ‚îÄ lectures/
        ‚îú‚îÄ‚îÄ [Video]_RAW.json
        ‚îî‚îÄ‚îÄ [Video]_MASTER.json         # <--- FINAL VIDEO RESULT</details>

# Block 0

In [2]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 0: UNIVERSAL SETUP (Textbooks + Lectures)
# ==============================================================================
import os
import shutil
from google.colab import drive, userdata, auth
from google.cloud import storage
import google.generativeai as genai

print("--- STEP 0: INITIALIZATION ---")

# 1. Install & Configure System (Textbooks + Whisper/Video tools)
print("üì¶ Installing dependencies (PDF, Video, AI)...")
!sudo apt-get update -qq && sudo apt-get install -y ffmpeg > /dev/null 2>&1
!pip install -q -U google-generativeai PyMuPDF scikit-image aiohttp tqdm openai-whisper opencv-python-headless

# 2. Authentication
print("üîë Authenticating with Google Cloud...")
try:
    auth.authenticate_user()
    GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')
    genai.configure(api_key=GEMINI_API_KEY)
except Exception as e:
    raise SystemExit(f"‚ùå Authentication Failed: {e}")

# 3. Mount Drive (Source Storage)
try:
    drive.mount('/content/drive', force_remount=True)
except:
    pass

# 4. Universal Configuration
GCS_BUCKET_NAME = 'pathology-hub-0'
DRIVE_ROOT = '/content/drive/MyDrive/1-Projects/Knowledge_Pipeline'

# Initialize GCS Client
storage_client = storage.Client()
bucket = storage_client.bucket(GCS_BUCKET_NAME)

# --- THE MASTER PATH MAP ---
# This dictionary handles routing for BOTH workflows.
PATHS = {
    # --- SOURCES (Local Google Drive) ---
    "source_pdfs":      os.path.join(DRIVE_ROOT, '_source_materials', 'pdfs'),
    "source_videos":    os.path.join(DRIVE_ROOT, '_source_materials', 'videos'),

    # --- DESTINATIONS (GCS Bucket Paths) ---
    "gcs_bucket":       GCS_BUCKET_NAME,
    "gcs_tags":         "Tags",  # Where your _Tags.txt files live

    # Textbook Pipeline
    "gcs_asset_textbooks":   "_asset_library/textbooks",
    "gcs_content_textbooks": "_content_library/textbooks",

    # Lecture Pipeline
    "gcs_asset_lectures":    "_asset_library/lectures",
    "gcs_content_lectures":  "_content_library/lectures"
}

# 5. Verification
print(f"\n‚úÖ Connected to Bucket: gs://{GCS_BUCKET_NAME}")
print(f"‚úÖ Source PDFs:   {PATHS['source_pdfs']}")
print(f"‚úÖ Source Videos: {PATHS['source_videos']}")
print("\nüöÄ SYSTEM READY. You can now run Block 1 (Textbook) or Block 1 (Lecture).")


All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  loader.exec_module(module)


--- STEP 0: INITIALIZATION ---
üì¶ Installing dependencies (PDF, Video, AI)...
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
üîë Authenticating with Google Cloud...
Mounted at /content/drive

‚úÖ Connected to Bucket: gs://pathology-hub-0
‚úÖ Source PDFs:   /content/drive/MyDrive/1-Projects/Knowledge_Pipeline/_source_materials/pdfs
‚úÖ Source Videos: /content/drive/MyDrive/1-Projects/Knowledge_Pipeline/_source_materials/videos

üöÄ SYSTEM READY. You can now run Block 1 (Textbook) or Block 1 (Lecture).


# PDF BLOCK 1: TEXTBOOK EXTRACTOR (Text + Figures)

In [1]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 1: TEXTBOOK EXTRACTOR (Gemini 3 Flash - Panel Aware)
# ==============================================================================
import base64
import fitz  # PyMuPDF
import json
import asyncio
import aiohttp
import re
import os
from tqdm.asyncio import tqdm_asyncio
from google.cloud import storage

# --- CONFIGURATION ---
TEXT_CONCURRENCY = 20
VISION_CONCURRENCY = 20
TEXT_MODEL_URL = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-3-flash-preview:generateContent?key={GEMINI_API_KEY}"
VISION_MODEL_NAME = "gemini-3-flash-preview"

# --- HELPER: GCS UTILS ---
def gcs_exists(blob_path):
    return bucket.blob(blob_path).exists()

def gcs_upload_bytes(data, blob_path, content_type):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(data, content_type=content_type)

def gcs_upload_json(data, blob_path):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(json.dumps(data, indent=2), content_type='application/json')

def gcs_load_json(blob_path):
    blob = bucket.blob(blob_path)
    if blob.exists():
        return json.loads(blob.download_as_string())
    return []

# --- AI HELPERS ---
async def clean_text_async(session, text, page_num, sem):
    async with sem:
        if not text.strip(): return page_num, ""
        prompt = (
            "Clean this medical text. Fix OCR errors. Keep structure. "
            "Preserve Figure Captions exactly. Return JSON: {\"markdown\": \"...\"}"
            f"\n\nRAW TEXT:\n{text}"
        )
        payload = {"contents": [{"parts": [{"text": prompt}]}]}
        try:
            async with session.post(TEXT_MODEL_URL, json=payload) as res:
                if res.status == 200:
                    dat = await res.json()
                    raw = dat['candidates'][0]['content']['parts'][0]['text']
                    match = re.search(r'\{.*\}', raw, re.DOTALL)
                    if match: return page_num, json.loads(match.group(0)).get("markdown", text)
                return page_num, text
        except: return page_num, text

async def analyze_figure_async(session, b64_img, context, sem):
    """
    Panel-Aware Vision Analysis.
    Tries to distinguish if the image is just Panel A or Panel B of a multipart figure.
    """
    async with sem:
        url = f"https://generativelanguage.googleapis.com/v1beta/models/{VISION_MODEL_NAME}:generateContent?key={GEMINI_API_KEY}"

        prompt = f"""
        PAGE CONTEXT:
        {context}

        TASK: Analyze the image below.
        1. Identify the Figure ID (e.g. "Fig 2.1") from the context that matches this image.
        2. **MULTI-PANEL CHECK:**
           - Does the caption describe multiple parts (e.g. "(A) ... (B) ...")?
           - If yes, determine if THIS specific image is Panel A, Panel B, etc.
           - If this image is ONLY Panel A, try to extract ONLY the caption text for (A).
           - If you cannot split the text, return the full caption but add "(Panel A)" to the ID.

        Return JSON: {{"figure_id": "Fig X.X (Panel A)", "matched_caption": "Specific caption..."}} or null.
        """

        parts = [
            {"text": prompt},
            {"inline_data": {"mime_type": "image/png", "data": b64_img}}
        ]

        try:
            async with session.post(url, json={"contents": [{"parts": parts}]}) as res:
                if res.status == 200:
                    dat = await res.json()
                    raw = dat['candidates'][0]['content']['parts'][0]['text']
                    match = re.search(r'\{.*\}', raw, re.DOTALL)
                    if match: return json.loads(match.group(0))
        except: return None
        return None

# --- MAIN PROCESSOR ---
async def process_textbook(pdf_path, start_p=1, end_p=None):
    fname = os.path.basename(pdf_path)
    book_name = os.path.splitext(fname)[0].replace(' ', '_')

    base_asset = f"{PATHS['gcs_asset_textbooks']}/{book_name}"
    path_fig_imgs = f"{base_asset}/figure_images"
    path_content = f"{PATHS['gcs_content_textbooks']}/{book_name}_CONTENT.json"
    path_figures = f"{PATHS['gcs_content_textbooks']}/{book_name}_FIGURES.json"

    print(f"\n{'='*60}\nüìò PROCESSING: {book_name}\n{'='*60}")

    doc = fitz.open(pdf_path)
    total = len(doc)
    final_p = min(end_p or total, total)

    # 1. TEXT (Skip if done)
    existing_content = gcs_load_json(path_content)
    if not existing_content:
        print(f"üìù Phase 1: Cleaning Text...")
        sem = asyncio.Semaphore(TEXT_CONCURRENCY)
        async with aiohttp.ClientSession() as sess:
            tasks = [clean_text_async(sess, doc.load_page(p).get_text("text"), p+1, sem) for p in range(start_p-1, final_p)]
            results = await tqdm_asyncio.gather(*tasks)
        content_data = sorted([{"source": fname, "page_number": p, "content": t} for p, t in results], key=lambda x: x['page_number'])
        gcs_upload_json(content_data, path_content)
    else:
        content_data = existing_content

    content_map = {c['page_number']: c['content'] for c in content_data}

    # 2. FIGURES
    print("üñºÔ∏è Phase 2: Figures & Vision (Panel-Aware)...")
    existing_figs = gcs_load_json(path_figures)
    processed_pages = {f['source_page'] for f in existing_figs}
    vision_tasks = []
    new_figures = []
    sem_vis = asyncio.Semaphore(VISION_CONCURRENCY)

    for p_idx in range(start_p-1, final_p):
        p_num = p_idx + 1
        if p_num in processed_pages: continue

        page = doc.load_page(p_idx)
        images = page.get_images(full=True)
        if not images: continue

        md_ctx = content_map.get(p_num, "")

        for i, img in enumerate(images):
            try:
                xref = img[0]
                base = doc.extract_image(xref)
                if len(base["image"]) < 5000: continue

                img_name = f"{book_name}_page_{p_num}_img_{i+1}.{base['ext']}"
                blob_path = f"{path_fig_imgs}/{img_name}"
                full_uri = f"gs://{GCS_BUCKET_NAME}/{blob_path}"

                if not gcs_exists(blob_path):
                    gcs_upload_bytes(base["image"], blob_path, f"image/{base['ext']}")

                b64 = base64.b64encode(base["image"]).decode('utf-8')
                vision_tasks.append({
                    "b64": b64, "ctx": md_ctx,
                    "meta": {"source_page": p_num, "gcs_path": full_uri}
                })
            except: pass

    if vision_tasks:
        print(f"   -> Analyzing {len(vision_tasks)} figures...")
        async with aiohttp.ClientSession() as sess:
            tasks = [analyze_figure_async(sess, t['b64'], t['ctx'], sem_vis) for t in vision_tasks]
            results = await tqdm_asyncio.gather(*tasks)

            for i, res in enumerate(results):
                if res and res.get('figure_id'):
                    meta = vision_tasks[i]['meta']
                    new_figures.append({
                        "source_document": fname,
                        "source_page": meta['source_page'],
                        "figure_id": res['figure_id'],
                        "description": res['matched_caption'],
                        "gcs_path": meta['gcs_path']
                    })

        final_list = existing_figs + new_figures
        final_list.sort(key=lambda x: x['source_page'])
        gcs_upload_json(final_list, path_figures)
        print(f"   -> Added {len(new_figures)} figures.")

# --- RUNNER ---
async def main():
    pdfs = sorted([f for f in os.listdir(PATHS['source_pdfs']) if f.endswith('.pdf')])
    if not pdfs: print("‚ùå No PDFs found."); return

    print("\n--- AVAILABLE TEXTBOOKS ---")
    for i, f in enumerate(pdfs): print(f"[{i+1}] {f}")

    sel = input("\nSelect book(s) (e.g. 1, 3): ")
    indices = [int(x)-1 for x in sel.split(',') if x.strip().isdigit()]

    for idx in indices:
        if 0 <= idx < len(pdfs):
            await process_textbook(os.path.join(PATHS['source_pdfs'], pdfs[idx]))

await main()

ModuleNotFoundError: No module named 'fitz'

# PDF BLOCK 2: TEXTBOOK ARCHITECT (High-Fidelity Monolith)


In [8]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 2: TEXTBOOK ARCHITECT (Robust + ID Swap + Batching)
# ==============================================================================
import json
import asyncio
import aiohttp
import re
import difflib
import random
import os
from typing import List, Dict, Set, Any
from google.cloud import storage
from tqdm.asyncio import tqdm_asyncio

# --- CONFIGURATION ---
MODEL_NAME = "gemini-3-pro-preview"
CONCURRENCY_LIMIT = 2
PAGES_PER_CHUNK = 40
PAGE_OVERLAP = 2
MAX_RETRIES = 10
BATCH_SIZE = 5

# --- HELPERS ---
def gcs_read_text(blob_path: str) -> str:
    blob = bucket.blob(blob_path)
    return blob.download_as_string().decode('utf-8') if blob.exists() else ""

def gcs_load_json(blob_path: str) -> List:
    blob = bucket.blob(blob_path)
    return json.loads(blob.download_as_string()) if blob.exists() else []

def gcs_upload_json(data: Any, blob_path: str):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(json.dumps(data, indent=2), content_type='application/json')

def validate_tag(tag_input: Any, valid_set: Set[str]) -> str:
    """Safely validates tags, handling None/Empty/Lists."""
    if not tag_input: return "Skin::Unclassified"
    if isinstance(tag_input, list):
        tag_str = tag_input[0] if len(tag_input) > 0 else "Skin::Unclassified"
    else:
        tag_str = str(tag_input)
    clean = tag_str.strip()
    if clean in valid_set: return clean
    matches = difflib.get_close_matches(clean, list(valid_set), n=1, cutoff=0.7)
    return matches[0] if matches else clean

# --- LOGIC: THE ID SWAPPER ---
def inject_real_paths(entities, figure_lookup_map):
    """Replaces AI placeholders with REAL GCS paths."""
    for ent in entities:
        if 'related_figures' in ent:
            for fig in ent['related_figures']:
                fig_id = fig.get('id')
                if fig_id in figure_lookup_map:
                    real_data = figure_lookup_map[fig_id]
                    fig['src'] = real_data['gcs_path']
                    fig['gcs_path'] = real_data['gcs_path']
                else:
                    fig['src'] = None
                    fig['gcs_path'] = None
    return entities

# ------------------------------------------------------------------------------
# 2. PROMPT ENGINEERING
# ------------------------------------------------------------------------------
def construct_textbook_prompt(text_content, figure_list_simple, valid_tags_list):
    # Only show ID and Caption to AI
    fig_context = "\n".join([f"ID: {f.get('figure_id', 'Unknown')} | Caption: {f.get('description','')}" for f in figure_list_simple])

    return f"""
Role: Senior Dermatopathologist.
Task: Extract disease entities from this textbook section.

INSTRUCTIONS:
1. **Extraction:** Extract Definition, Clinical, Microscopic, etc.
2. **Tagging:** Use the exact tag from the list.
3. **Figure Linking (CRITICAL):**
   - I have provided a list of Figures with IDs.
   - If a figure is relevant, add it to `related_figures`.
   - **IMPORTANT:** In the JSON, put the ID in the `id` field. Leave `src` and `gcs_path` as "PLACEHOLDER".

REQUIRED JSON SCHEMA:
[
  {{
    "entity_name": "Disease Name",
    "definition": "...",
    "tags": ["Tag"],
    "html_gcs_path": null,
    "clinical": "...",
    "microscopic": "...",
    "ancillary_studies": "...",
    "related_figures": [
        {{
            "id": "COPY_EXACT_ID_FROM_LIST",
            "src": "PLACEHOLDER",
            "gcs_path": "PLACEHOLDER",
            "diagnosis": "Disease Name",
            "legend": "Full caption."
        }}
    ]
  }}
]

REFERENCE TAGS:
{valid_tags_list}

AVAILABLE FIGURES:
{fig_context}

TEXT CONTENT:
{text_content}
"""

# ------------------------------------------------------------------------------
# 3. CHUNK PROCESSOR
# ------------------------------------------------------------------------------
async def process_textbook_chunk(session, chunk_data, figures_in_chunk, valid_tags_text, valid_tags_set, sem):
    async with sem:
        full_text = "\n\n".join([f"--- Page {p['page_number']} ---\n{p['content']}" for p in chunk_data])

        url = f"https://generativelanguage.googleapis.com/v1beta/models/{MODEL_NAME}:generateContent?key={GEMINI_API_KEY}"
        payload = {"contents": [{"parts": [{"text": construct_textbook_prompt(full_text, figures_in_chunk, valid_tags_text)}]}]}

        for attempt in range(MAX_RETRIES):
            try:
                async with session.post(url, json=payload, timeout=600) as response:
                    if response.status == 200:
                        data = await response.json()
                        raw_txt = data['candidates'][0]['content']['parts'][0]['text']
                        match = re.search(r'\[.*\]', raw_txt.replace("```json", "").replace("```", ""), re.DOTALL)
                        if match:
                            entities = json.loads(match.group(0))
                            valid_entities = []

                            # ID Swap Logic
                            chunk_map = {f.get('figure_id'): f for f in figures_in_chunk}
                            entities = inject_real_paths(entities, chunk_map)

                            for ent in entities:
                                if not ent.get('entity_name'): continue

                                # Tag & Null Validation
                                ent['tags'] = [validate_tag(ent.get('tags', []), valid_tags_set)]
                                for k in ["clinical", "microscopic", "ancillary_studies", "differential_diagnosis", "pathogenesis", "staging", "cytology"]:
                                    if k not in ent: ent[k] = None
                                ent['html_gcs_path'] = None

                                valid_entities.append(ent)
                            return valid_entities
                        return []
                    elif response.status == 429:
                        wait = (2 ** attempt) + random.uniform(5, 15)
                        await asyncio.sleep(wait)
                        continue
            except:
                await asyncio.sleep(15)
        return []

# ------------------------------------------------------------------------------
# 4. MAIN WORKFLOW
# ------------------------------------------------------------------------------
async def main_definitive():
    # 1. Select Tags
    tag_files = [b.name for b in bucket.list_blobs(prefix="Tags/") if b.name.endswith('.txt')]
    print("\n--- SELECT TAG LIST ---")
    for i, f in enumerate(tag_files): print(f"[{i+1}] {f.split('/')[-1]}")
    t_idx = int(input("Choice: ")) - 1
    tags_text = gcs_read_text(tag_files[t_idx])
    tags_set = set(l.strip() for l in tags_text.splitlines() if l.strip())

    # 2. Select Textbook
    content_files = [b.name for b in bucket.list_blobs(prefix=PATHS['gcs_content_textbooks']) if "_CONTENT.json" in b.name]
    if not content_files: print("‚ùå No CONTENT files found."); return

    print("\n--- SELECT TEXTBOOK ---")
    for i, f in enumerate(content_files): print(f"[{i+1}] {f.split('/')[-1]}")
    c_idx = int(input("Choice: ")) - 1

    content_path = content_files[c_idx]
    book_base = content_path.split('/')[-1].replace("_CONTENT.json", "")
    fig_path = content_path.replace("_CONTENT.json", "_FIGURES.json")
    final_path = f"{PATHS['gcs_content_textbooks']}/{book_base}_MASTER.json"

    print(f"\nüöÄ Processing: {book_base}")
    raw_content = gcs_load_json(content_path)
    raw_figures = gcs_load_json(fig_path)
    raw_content.sort(key=lambda x: x['page_number'])

    # 3. Resume Check
    master_kb = []
    if bucket.blob(final_path).exists():
        print(f"\n‚ö†Ô∏è Existing MASTER file found.")
        choice = input("Type 'RESUME' to continue or 'RESTART' to overwrite: ").strip().upper()
        if choice == 'RESUME':
            master_kb = gcs_load_json(final_path)
            print(f"   -> Resuming with {len(master_kb)} existing entities.")
        else:
            master_kb = []

    # 4. Chunking
    chunks = []
    total_pages = len(raw_content)
    for i in range(0, total_pages, PAGES_PER_CHUNK):
        end_idx = min(i + PAGES_PER_CHUNK + PAGE_OVERLAP, total_pages)
        chunks.append(raw_content[i : end_idx])
    print(f"üì¶ Total Chunks: {len(chunks)}")

    # 5. Batched Execution
    sem = asyncio.Semaphore(CONCURRENCY_LIMIT)

    for i in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[i : i + BATCH_SIZE]
        print(f"\n--- Batch {i//BATCH_SIZE + 1} ---")

        async with aiohttp.ClientSession() as session:
            tasks = []
            for chunk in batch:
                page_nums = {p['page_number'] for p in chunk}
                chunk_figs = [f for f in raw_figures if f['source_page'] in page_nums]
                tasks.append(process_textbook_chunk(session, chunk, chunk_figs, tags_text, tags_set, sem))

            results = await tqdm_asyncio.gather(*tasks)

            new_count = 0
            for res_list in results:
                master_kb.extend(res_list)
                new_count += len(res_list)

            if new_count > 0:
                gcs_upload_json(master_kb, final_path)
                print(f"üíæ Saved (+{new_count})")

    # 6. Final Dedupe
    print("\nüßπ Final Deduplication...")
    unique_kb = []
    seen = set()
    for ent in master_kb:
        def_text = ent.get('definition') or ""
        key = (ent.get('entity_name'), def_text[:50])
        if key not in seen:
            unique_kb.append(ent)
            seen.add(key)

    gcs_upload_json(unique_kb, final_path)
    print(f"\n‚úÖ DONE: {final_path}")
    print(f"üìä Final Entities: {len(unique_kb)}")

await main_definitive()


--- SELECT TAG LIST ---
[1] BST_Tags.txt
[2] Breast_Tags.txt
[3] Endo_Tags.txt
[4] GI_Tags.txt
[5] GYN_Tags.txt
[6] Skin_Tags.txt
Choice: 6

--- SELECT TEXTBOOK ---
[1] BST_Horvai_CONTENT.json
[2] Bone_Atlas_CONTENT.json
[3] Bone_Dorfman_CONTENT.json
[4] Bone_Pattern_CONTENT.json
[5] Breast_Atlas_CONTENT.json
[6] Breast_Biopsy_CONTENT.json
[7] Breast_FAQ_CONTENT.json
[8] Breast_Pattern_CONTENT.json
[9] Cyto_Breast_Yokohama_CONTENT.json
[10] Cyto_Cibas_CONTENT.json
[11] Cyto_Comprehensive_Part_One_CONTENT.json
[12] Cyto_Comprehensive_Part_Two_CONTENT.json
[13] Cyto_GU_Paris_CONTENT.json
[14] Cyto_Gyn_Bethesda_CONTENT.json
[15] Cyto_Milan_CONTENT.json
[16] Cyto_PSC_Lung_CONTENT.json
[17] Cyto_Pattern_CONTENT.json
[18] Cyto_Serous_Fluids_CONTENT.json
[19] Cyto_Thyroid_Bethesda_CONTENT.json
[20] Derm_Barnhill_CONTENT.json
[21] Derm_Elston_CONTENT.json
[22] Derm_Levers_CONTENT.json
[23] Derm_McKeeHY_CONTENT.json
[24] Derm_McKee_CONTENT.json
[25] Derm_McKee_High_Yield_CONTENT.json
[26] Derm

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [08:41<00:00, 104.24s/it]


üíæ Saved progress... (+197 entities)

--- Processing Batch 2/3 ---


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [09:10<00:00, 110.18s/it]


üíæ Saved progress... (+224 entities)

üßπ Final Deduplication...

‚úÖ COMPLETE: gs://pathology-hub-0/_content_library/textbooks/Derm_Barnhill_MASTER.json
üìä Final Count: 421 Entities


In [5]:

--- SELECT TAG LIST ---
[1] BST_Tags.txt
[2] Breast_Tags.txt
[3] Endo_Tags.txt
[4] GI_Tags.txt
[5] GYN_Tags.txt
[6] Skin_Tags.txt
Choice: 6

--- SELECT TEXTBOOK ---
[1] BST_Horvai_CONTENT.json
[2] Bone_Atlas_CONTENT.json
[3] Bone_Dorfman_CONTENT.json
[4] Bone_Pattern_CONTENT.json
[5] Breast_Atlas_CONTENT.json
[6] Breast_Biopsy_CONTENT.json
[7] Breast_FAQ_CONTENT.json
[8] Breast_Pattern_CONTENT.json
[9] Cyto_Breast_Yokohama_CONTENT.json
[10] Cyto_Cibas_CONTENT.json
[11] Cyto_Comprehensive_Part_One_CONTENT.json
[12] Cyto_Comprehensive_Part_Two_CONTENT.json
[13] Cyto_GU_Paris_CONTENT.json
[14] Cyto_Gyn_Bethesda_CONTENT.json
[15] Cyto_Milan_CONTENT.json
[16] Cyto_PSC_Lung_CONTENT.json
[17] Cyto_Pattern_CONTENT.json
[18] Cyto_Serous_Fluids_CONTENT.json
[19] Cyto_Thyroid_Bethesda_CONTENT.json
[20] Derm_Elston_CONTENT.json
[21] Derm_Levers_CONTENT.json
[22] Derm_McKee_CONTENT.json
[23] Derm_McKee_High_Yield_CONTENT.json
[24] Derm_Patterson_CONTENT.json
[25] Derm_Weedon_CONTENT.json
[26] Endo_Atlas_CONTENT.json
[27] GI_Atlas_CONTENT.json
[28] GI_Biopsy_Interpretation_(Neoplastic)_CONTENT.json
[29] GI_Biopsy_Interpretation_(Non_Neoplastic)_CONTENT.json
[30] GI_Intestinal_Atlas1_CONTENT.json
[31] GI_Liver_Macsween_CONTENT.json
[32] GI_Non-Neoplastic_Zhang_CONTENT.json
[33] GU_Biopsy_Interpretation_(Prostate)_CONTENT.json
[34] Gyn_Atlas_Part_One_CONTENT.json
[35] Gyn_Atlas_Part_Two_CONTENT.json
[36] Gyn_Essentials_CONTENT.json
[37] HN_Thompson_CONTENT.json
[38] Peds_Course_review_CONTENT.json
[39] Skin_Elston_CONTENT.json
[40] Skin_Levers_CONTENT.json
[41] SoftTissue_Enzinger_CONTENT.json
[42] SoftTissue_Pattern_CONTENT.json
Choice: 20

üöÄ Processing: Derm_Elston
üîÑ Found existing MASTER file. Resuming...
üì¶ Total Chunks: 12 (~40 pages each)

--- Processing Batch 1/3 ---
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:04<00:00,  1.19it/s]
üíæ Saving progress... (+0 entities)

--- Processing Batch 2/3 ---
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:03<00:00,  1.30it/s]
üíæ Saving progress... (+0 entities)

--- Processing Batch 3/3 ---
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:01<00:00,  1.25it/s]
üíæ Saving progress... (+0 entities)

üßπ Final Deduplication...

‚úÖ COMPLETE: gs://pathology-hub-0/_content_library/textbooks/Derm_Elston_MASTER.json
üìä Final Count: 1078 Entities


--- SELECT TAG LIST ---
[1] BST_Tags.txt
[2] Breast_Tags.txt
[3] Endo_Tags.txt
[4] GI_Tags.txt
[5] GYN_Tags.txt
[6] Skin_Tags.txt
Choice: 6

--- SELECT TEXTBOOK ---
[1] BST_Horvai_CONTENT.json
[2] Bone_Atlas_CONTENT.json
[3] Bone_Dorfman_CONTENT.json
[4] Bone_Pattern_CONTENT.json
[5] Breast_Atlas_CONTENT.json
[6] Breast_Biopsy_CONTENT.json
[7] Breast_FAQ_CONTENT.json
[8] Breast_Pattern_CONTENT.json
[9] Cyto_Breast_Yokohama_CONTENT.json
[10] Cyto_Cibas_CONTENT.json
[11] Cyto_Comprehensive_Part_One_CONTENT.json
[12] Cyto_Comprehensive_Part_Two_CONTENT.json
[13] Cyto_GU_Paris_CONTENT.json
[14] Cyto_Gyn_Bethesda_CONTENT.json
[15] Cyto_Milan_CONTENT.json
[16] Cyto_PSC_Lung_CONTENT.json
[17] Cyto_Pattern_CONTENT.json
[18] Cyto_Serous_Fluids_CONTENT.json
[19] Cyto_Thyroid_Bethesda_CONTENT.json
[20] Derm_Elston_CONTENT.json
[21] Derm_Levers_CONTENT.json
[22] Derm_McKee_CONTENT.json
[23] Derm_McKee_High_Yield_CONTENT.json
[24] Derm_Patterson_CONTENT.json
[25] Derm_Weedon_CONTENT.json
[26] Endo

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [10:43<00:00, 128.77s/it]


üíæ Saving progress... (+105 entities)

--- Processing Batch 2/10 ---


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [12:57<00:00, 155.55s/it]


üíæ Saving progress... (+151 entities)

--- Processing Batch 3/10 ---


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [13:39<00:00, 163.98s/it]


üíæ Saving progress... (+167 entities)

--- Processing Batch 4/10 ---


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [11:33<00:00, 138.77s/it]


üíæ Saving progress... (+112 entities)

--- Processing Batch 5/10 ---


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [10:22<00:00, 124.46s/it]


üíæ Saving progress... (+144 entities)

--- Processing Batch 6/10 ---


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [11:54<00:00, 142.95s/it]


üíæ Saving progress... (+118 entities)

--- Processing Batch 7/10 ---


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [10:06<00:00, 121.32s/it]


üíæ Saving progress... (+105 entities)

--- Processing Batch 8/10 ---


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [12:03<00:00, 144.78s/it]


üíæ Saving progress... (+126 entities)

--- Processing Batch 9/10 ---


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [15:03<00:00, 180.60s/it]


üíæ Saving progress... (+156 entities)

--- Processing Batch 10/10 ---


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [01:50<00:00, 110.33s/it]


üíæ Saving progress... (+31 entities)

üßπ Final Deduplication...

‚úÖ DONE: gs://pathology-hub-0/_content_library/textbooks/Derm_McKee_MASTER.json
üìä Final Count: 1207 Entities


# PDF BLOCK 2.1: THE CONSOLIDATOR (Map-Reduce Merge)

In [7]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 3: THE CONSOLIDATOR (Logic-Based Figure Merging)
# ==============================================================================
import json
import asyncio
import aiohttp
import re
from collections import defaultdict
from typing import List, Any
from google.cloud import storage
from tqdm.asyncio import tqdm_asyncio

# --- CONFIGURATION ---
MODEL_NAME = "gemini-3-flash-preview"
CONCURRENCY_LIMIT = 20  # Flash is fast, we can push this

# --- HELPERS ---
def gcs_load_json(blob_path: str) -> List:
    blob = bucket.blob(blob_path)
    return json.loads(blob.download_as_string()) if blob.exists() else []

def gcs_upload_json(data: Any, blob_path: str):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(json.dumps(data, indent=2), content_type='application/json')

# --- PROMPT ---
def construct_merge_prompt(entity_name, fragments_text_only):
    """
    Note: We only send TEXT to the AI. We handle figures in Python.
    """
    return f"""
Role: Medical Data Editor.
Task: Merge these text fragments for "{entity_name}" into ONE comprehensive entry.

INPUT FRAGMENTS:
{json.dumps(fragments_text_only, indent=2)}

INSTRUCTIONS:
1. **Consolidate Text:** Combine 'clinical', 'microscopic', 'definition', etc.
   - DO NOT SUMMARIZE. Preserve all distinct details (e.g., if one fragment lists a stain and another lists a gene, keep BOTH).
2. **Preserve Tags:** Use the most specific tag available.
3. **Output:** A single JSON object.

REQUIRED SCHEMA:
{{
    "entity_name": "{entity_name}",
    "definition": "Merged...",
    "tags": ["..."],
    "html_gcs_path": null,
    "localization": "...",
    "clinical": "...",
    "pathogenesis": "...",
    "macroscopic": "...",
    "microscopic": "...",
    "cytology": "...",
    "ancillary_studies": "...",
    "diagnostic_molecular_pathology": "...",
    "differential_diagnosis": "...",
    "staging": "...",
    "prognosis_and_prediction": "...",
    "subtypes": "...",
    "related_terminology": "...",
    "essential_and_desirable_diagnostic_criteria": "..."
}}
"""

# --- LOGIC WORKER ---
async def merge_entity_group(session, tag, group, sem):
    # 1. HARVEST & DEDUPLICATE FIGURES (Python Logic)
    seen_urls = set()
    all_unique_figures = []

    # We also prepare a "Text Only" version for the AI to save tokens/confusion
    text_only_group = []

    for ent in group:
        # Collect Figures
        if ent.get('related_figures'):
            for fig in ent['related_figures']:
                url = fig.get('gcs_path')
                if url and url not in seen_urls:
                    seen_urls.add(url)
                    all_unique_figures.append(fig)

        # Prepare Text Payload (Strip figures to focus AI on text)
        clean_ent = ent.copy()
        clean_ent.pop('related_figures', None)
        text_only_group.append(clean_ent)

    # 2. AI TEXT MERGE
    async with sem:
        # If only 1 entry, just return it (but ensure figures are clean)
        if len(group) == 1:
            result = group[0]
            result['related_figures'] = all_unique_figures
            return result

        entity_name = group[0].get('entity_name', 'Unknown')
        prompt = construct_merge_prompt(entity_name, text_only_group)

        url = f"https://generativelanguage.googleapis.com/v1beta/models/{MODEL_NAME}:generateContent?key={GEMINI_API_KEY}"
        payload = {"contents": [{"parts": [{"text": prompt}]}]}

        try:
            async with session.post(url, json=payload) as response:
                if response.status == 200:
                    data = await response.json()
                    raw = data['candidates'][0]['content']['parts'][0]['text']
                    match = re.search(r'\{.*\}', raw, re.DOTALL)
                    if match:
                        merged_ent = json.loads(match.group(0))

                        # 3. INJECT FIGURES BACK
                        merged_ent['related_figures'] = all_unique_figures
                        return merged_ent
        except:
            pass

        # Fallback: Return first entry with ALL figures appended
        print(f"‚ö†Ô∏è Merge failed for {entity_name}, stacking figures on first entry.")
        fallback = group[0]
        fallback['related_figures'] = all_unique_figures
        return fallback

# --- MAIN ---
async def main_consolidator():
    # 1. Select Content
    content_files = [b.name for b in bucket.list_blobs(prefix=PATHS['gcs_content_textbooks']) if "_MASTER.json" in b.name and "_CONSOLIDATED" not in b.name]
    if not content_files: print("‚ùå No MASTER files found. Run Block 2 first."); return

    print("\n--- SELECT MASTER FILE TO CONSOLIDATE ---")
    for i, f in enumerate(content_files): print(f"[{i+1}] {f.split('/')[-1]}")
    c_idx = int(input("Choice: ")) - 1

    master_path = content_files[c_idx]
    raw_entities = gcs_load_json(master_path)

    print(f"\nüöÄ Consolidating {len(raw_entities)} entities...")

    # 2. Group by Tag OR Entity Name
    groups = defaultdict(list)
    for ent in raw_entities:
        # Priority: Tag > Entity Name > "Unclassified"
        if ent.get('tags') and len(ent['tags']) > 0:
            key = ent['tags'][0]
        elif ent.get('entity_name'):
            key = ent['entity_name']
        else:
            key = "Unclassified"

        groups[key].append(ent)

    print(f"   -> Found {len(groups)} unique topics.")

    # 3. Process Groups
    sem = asyncio.Semaphore(CONCURRENCY_LIMIT)
    final_kb = []

    async with aiohttp.ClientSession() as session:
        tasks = []
        for key, group in groups.items():
            tasks.append(merge_entity_group(session, key, group, sem))

        results = await tqdm_asyncio.gather(*tasks)
        final_kb = results

    # 4. Save
    out_path = master_path.replace("_MASTER.json", "_CONSOLIDATED.json")
    gcs_upload_json(final_kb, out_path)

    # Count total figures to verify
    total_figs = sum(len(e.get('related_figures', [])) for e in final_kb)

    print(f"\n‚úÖ CONSOLIDATED SAVED: gs://{GCS_BUCKET_NAME}/{out_path}")
    print(f"üìä Entities: {len(raw_entities)} -> {len(final_kb)}")
    print(f"üñºÔ∏è Total Preserved Figures: {total_figs}")

await main_consolidator()


--- SELECT MASTER FILE TO CONSOLIDATE ---
[1] Derm_Barnhill_MASTER.json
[2] Derm_Elston_MASTER.json
[3] Derm_Levers_MASTER.json
[4] Derm_McKeeHY_MASTER.json
[5] Derm_McKee_MASTER.json
[6] Derm_Patterson_MASTER.json
[7] Skin_Elston_MASTER.json
Choice: 1

üöÄ Consolidating 421 entities...
   -> Found 334 unique topics.


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 334/334 [00:39<00:00,  8.41it/s]



‚úÖ CONSOLIDATED SAVED: gs://pathology-hub-0/_content_library/textbooks/Derm_Barnhill_CONSOLIDATED.json
üìä Entities: 421 -> 334
üñºÔ∏è Total Preserved Figures: 352


# BLOCK 3: SINGLE FILE TRANSFORMER (Target: WHO Schema)

In [8]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 3: SINGLE FILE TRANSFORMER (Target: WHO Schema)
# ==============================================================================
import json
import os
from google.cloud import storage

# --- SETUP ---
if 'PATHS' not in globals():
    raise NameError("‚ùå PATHS not found. Please run Block 0.")

bucket = storage.Client().bucket(PATHS['gcs_bucket'])

# --- TRANSFORMATION LOGIC ---
def transform_to_app_schema(entry, filename):
    """
    Reshapes a single entity into the WHO App format.
    """
    # 1. Clean Metadata / Noise
    if any(x in entry.get('entity_name', '') for x in ["Copyright", "Preface", "Index", "Contributors"]):
        return None

    new_entry = entry.copy()

    # 2. TRANSFORM MEDIA (The Core Requirement)
    # Move 'related_figures' -> 'media' array
    media_list = []

    if 'related_figures' in entry:
        for fig in entry['related_figures']:
            # Construct Legend (Diagnosis + Description)
            diag = fig.get('diagnosis', '').strip()
            desc = fig.get('legend', '').strip()

            # Avoid "Lichen Planus. Lichen Planus." redundancy
            if diag and desc and diag not in desc:
                final_legend = f"{diag}. {desc}"
            elif desc:
                final_legend = desc
            else:
                final_legend = diag

            # Build Object
            media_item = {
                "type": "figure", # Default for extracted images
                "path": fig.get('gcs_path') or fig.get('src'),
                "legend": final_legend
            }

            # Handle WSI (if present from other sources)
            if fig.get('type') == 'wsi' or fig.get('isWSI'):
                media_item['type'] = 'wsi'
                media_item['url'] = fig.get('url') or fig.get('wsi_link')

            media_list.append(media_item)

        # Remove old key
        del new_entry['related_figures']

    new_entry['media'] = media_list

    # 3. ENSURE REQUIRED KEYS (Even if null)
    # This matches the WHO JSON structure you provided
    required_keys = ["video", "html", "wsi"]
    for k in required_keys:
        if k not in new_entry:
            new_entry[k] = None

    # 4. MAP SPECIFIC FIELDS
    # Map internal html_gcs_path to the 'html' field if it exists
    if new_entry.get('html_gcs_path'):
        new_entry['html'] = new_entry['html_gcs_path']
    elif 'html' not in new_entry or new_entry['html'] is None:
        # Optional: Generate a placeholder path if you need one, or keep null
        new_entry['html'] = None

    # 5. CLEANUP INTERNAL KEYS
    keys_to_purge = ['html_gcs_path', 'gcs_origin', 'best_slide_id', 'source_type']
    for k in keys_to_purge:
        if k in new_entry: del new_entry[k]

    return new_entry

# --- MAIN EXECUTION ---
def main_single_transformer():
    # 1. Gather Available Files
    print("üîç Scanning for Master/Consolidated files...")
    all_files = []

    # Check Textbooks
    blobs_t = list(bucket.list_blobs(prefix=PATHS['gcs_content_textbooks']))
    for b in blobs_t:
        if "_MASTER.json" in b.name or "_CONSOLIDATED.json" in b.name:
            all_files.append(b.name)

    # Check Lectures
    blobs_l = list(bucket.list_blobs(prefix=PATHS['gcs_content_lectures']))
    for b in blobs_l:
        if "_MASTER.json" in b.name or "_CONSOLIDATED.json" in b.name:
            all_files.append(b.name)

    if not all_files:
        print("‚ùå No processed files found. Run Block 2 or 3 first.")
        return

    # 2. User Selection
    print("\n--- SELECT FILE TO TRANSFORM ---")
    for i, f in enumerate(all_files):
        print(f"[{i+1}] {f}")

    try:
        idx = int(input("\nEnter number: ")) - 1
        selected_path = all_files[idx]
    except:
        print("‚ùå Invalid selection.")
        return

    # 3. Load & Process
    print(f"\nüöÄ Loading: {selected_path}")
    blob = bucket.blob(selected_path)
    data = json.loads(blob.download_as_string())

    transformed_data = []
    for item in data:
        res = transform_to_app_schema(item, selected_path)
        if res:
            transformed_data.append(res)

    # 4. Save
    # Naming convention: replace _MASTER or _CONSOLIDATED with _APP_READY
    new_filename = selected_path.replace("_MASTER.json", "_APP_READY.json").replace("_CONSOLIDATED.json", "_APP_READY.json")

    print(f"üíæ Saving {len(transformed_data)} entries...")
    out_blob = bucket.blob(new_filename)
    out_blob.upload_from_string(json.dumps(transformed_data, indent=2), content_type='application/json')

    print(f"\n‚úÖ DONE. Output saved to:")
    print(f"   gs://{PATHS['gcs_bucket']}/{new_filename}")

main_single_transformer()

üîç Scanning for Master/Consolidated files...

--- SELECT FILE TO TRANSFORM ---
[1] _content_library/textbooks/Derm_Barnhill_CONSOLIDATED.json
[2] _content_library/textbooks/Derm_Barnhill_MASTER.json
[3] _content_library/textbooks/Derm_Elston_CONSOLIDATED.json
[4] _content_library/textbooks/Derm_Elston_MASTER.json
[5] _content_library/textbooks/Derm_Levers_CONSOLIDATED.json
[6] _content_library/textbooks/Derm_Levers_MASTER.json
[7] _content_library/textbooks/Derm_McKeeHY_MASTER.json
[8] _content_library/textbooks/Derm_McKee_CONSOLIDATED.json
[9] _content_library/textbooks/Derm_McKee_MASTER.json
[10] _content_library/textbooks/Derm_Patterson_CONSOLIDATED.json
[11] _content_library/textbooks/Derm_Patterson_MASTER.json
[12] _content_library/textbooks/Skin_Elston_MASTER.json
[13] _content_library/lectures/Derm_Lecture_30_miscellaneous_MASTER.json

Enter number: 1

üöÄ Loading: _content_library/textbooks/Derm_Barnhill_CONSOLIDATED.json
üíæ Saving 334 entries...

‚úÖ DONE. Output saved to

# VIDEO BLOCK 1

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 1: LECTURE EXTRACTOR (Whisper + Gemini 3 Flash)
# ==============================================================================
import shutil, cv2, whisper, json, os, io, base64, re, asyncio, aiohttp
import logging
from skimage.metrics import structural_similarity as ssim
from PIL import Image
from tqdm.notebook import tqdm
from tqdm.asyncio import tqdm_asyncio
from google.cloud import storage

# --- CONFIGURATION ---
logging.getLogger("urllib3").setLevel(logging.ERROR)
API_CONCURRENCY_LIMIT = 20
VISION_MODEL = "gemini-3-pro-preview" # Fast & Cheap for per-slide analysis

# --- HELPERS ---
def gcs_upload_file(local_path, blob_path):
    blob = bucket.blob(blob_path)
    blob.upload_from_filename(local_path)

def gcs_upload_json(data, blob_path):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(json.dumps(data, indent=2), content_type='application/json')

def gcs_exists(blob_path):
    return bucket.blob(blob_path).exists()

def get_comparison_frame(frame):
    h, w = frame.shape[:2]
    new_w = 200
    new_h = int(h * (new_w / w))
    small = cv2.resize(frame, (new_w, new_h), interpolation=cv2.INTER_AREA)
    gray = cv2.cvtColor(small, cv2.COLOR_BGR2GRAY)
    return cv2.GaussianBlur(gray, (5, 5), 0)

# --- AI ANALYST ---
async def analyze_slide_async(session, slide_data, local_img_path, sem):
    async with sem:
        if not os.path.exists(local_img_path): return slide_data

        try:
            # Prepare Image
            with Image.open(local_img_path) as img:
                buf = io.BytesIO()
                img.convert("RGB").save(buf, format="JPEG")
                b64_img = base64.b64encode(buf.getvalue()).decode("utf-8")

            url = f"https://generativelanguage.googleapis.com/v1beta/models/{VISION_MODEL}:generateContent?key={GEMINI_API_KEY}"

            # Prompt: Extract raw visual data. We don't need deep reasoning yet, just "What is on this slide?"
            prompt = (
                f"Transcript Context: \"{slide_data['raw_transcript'][:1000]}...\"\n\n"
                "TASK: Analyze this slide image. \n"
                "1. Extract the Title.\n"
                "2. Extract text labels verbatim (e.g. 'CD45', 'H&E', '40x').\n"
                "3. Summarize the visual content (e.g., 'Histology showing...').\n"
                "Return JSON: {\"slide_title\": \"...\", \"key_points\": [\"...\"], \"visual_desc\": \"...\"}"
            )

            payload = {"contents": [{"parts": [{"text": prompt}, {"inline_data": {"mime_type": "image/jpeg", "data": b64_img}}]}]}

            async with session.post(url, json=payload) as res:
                if res.status == 200:
                    dat = await res.json()
                    txt = dat['candidates'][0]['content']['parts'][0]['text']
                    match = re.search(r'\{.*\}', txt, re.DOTALL)
                    if match:
                        slide_data.update(json.loads(match.group(0)))
        except Exception as e:
            pass # Skip frame if AI fails

        return slide_data

# --- PIPELINE ---
async def process_video(video_path, counter, total):
    fname = os.path.basename(video_path)
    lecture_name = os.path.splitext(fname)[0].replace(" ", "_")

    # GCS Paths
    asset_base = f"{PATHS['gcs_asset_lectures']}/{lecture_name}"
    raw_json_path = f"{PATHS['gcs_content_lectures']}/{lecture_name}_RAW.json"

    print(f"\n{'='*60}\nüé• PROCESSING {counter}/{total}: {lecture_name}\n{'='*60}")

    if gcs_exists(raw_json_path):
        print("‚úÖ Already processed in GCS. Skipping.")
        return

    # 1. WHISPER TRANSCRIPTION
    print("üéôÔ∏è Step 1: Whisper Transcription...")
    model = whisper.load_model("base") # Use 'small' if you have GPU RAM, 'base' is fast
    result = model.transcribe(video_path, fp16=False)

    # 2. FRAME EXTRACTION & MERGING
    print("üéûÔ∏è Step 2: Extracting Slides...")
    cap = cv2.VideoCapture(video_path)
    slides = []
    curr_slide = None
    prev_cmp = None

    # We use TQDM to track progress through the audio segments
    for seg in tqdm(result['segments'], desc="Scanning", unit="seg"):
        cap.set(cv2.CAP_PROP_POS_MSEC, seg['start'] * 1000)
        ret, frame = cap.read()
        if not ret: continue

        curr_cmp = get_comparison_frame(frame)

        if curr_slide is None:
            curr_slide = {**seg, 'frame': frame}
            prev_cmp = curr_cmp
            continue

        # SSIM Check (Merge if > 85% similar)
        if ssim(prev_cmp, curr_cmp, data_range=255) >= 0.85:
            curr_slide['text'] += " " + seg['text']
            curr_slide['end'] = seg['end']
        else:
            slides.append(curr_slide)
            curr_slide = {**seg, 'frame': frame}
            prev_cmp = curr_cmp

    if curr_slide: slides.append(curr_slide)
    cap.release()
    print(f"   -> Consolidated into {len(slides)} unique slides.")

    # 3. UPLOAD & PREPARE
    print("‚òÅÔ∏è Step 3: Uploading Images...")
    final_data = []
    local_imgs = {} # Map id -> local path for AI step

    for i, slide in enumerate(slides):
        img_name = f"{lecture_name}_slide_{i+1:04d}.jpg"
        local_p = f"/tmp/{img_name}"
        gcs_p = f"{asset_base}/{img_name}"
        full_uri = f"gs://{GCS_BUCKET_NAME}/{gcs_p}"

        cv2.imwrite(local_p, slide['frame'])

        if not gcs_exists(gcs_p):
            gcs_upload_file(local_p, gcs_p)

        local_imgs[i] = local_p

        final_data.append({
            "id": i,
            "timestamp_start": slide['start'],
            "timestamp_end": slide['end'],
            "raw_transcript": slide['text'].strip(),
            "image_path": full_uri,
            "gcs_path": full_uri, # Important for Block 2
            "slide_title": "",
            "key_points": [],
            "visual_desc": ""
        })

    # 4. GEMINI ENHANCEMENT
    print("üß† Step 4: Gemini Vision Analysis...")
    sem = asyncio.Semaphore(API_CONCURRENCY_LIMIT)
    async with aiohttp.ClientSession() as sess:
        tasks = [analyze_slide_async(sess, d, local_imgs[d['id']], sem) for d in final_data]
        enhanced_data = await tqdm_asyncio.gather(*tasks)

    # 5. SAVE RAW JSON
    gcs_upload_json(enhanced_data, raw_json_path)
    print(f"‚úÖ Saved RAW data: {raw_json_path}")

    # Cleanup
    for p in local_imgs.values():
        if os.path.exists(p): os.remove(p)

# --- RUNNER ---
async def main_lectures():
    vid_files = sorted([f for f in os.listdir(PATHS['source_videos']) if f.lower().endswith(('.mp4', '.mov'))])
    if not vid_files: print("‚ùå No videos found."); return

    print("\n--- AVAILABLE LECTURES ---")
    for i, v in enumerate(vid_files): print(f"[{i+1}] {v}")

    sel = input("\nSelect (e.g. 1, 3-5, or 'all'): ")
    indices = set()
    if sel == 'all': indices = range(len(vid_files))
    else:
        for part in sel.split(','):
            if '-' in part:
                s, e = map(int, part.split('-'))
                indices.update(range(s-1, e))
            elif part.strip().isdigit():
                indices.add(int(part)-1)

    for idx in sorted(list(indices)):
        if 0 <= idx < len(vid_files):
            await process_video(os.path.join(PATHS['source_videos'], vid_files[idx]), idx+1, len(indices))

await main_lectures()

# VIDEO BLOCK 2

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 2: LECTURE ARCHITECT (Monolith + ID Swap)
# ==============================================================================
import json
import asyncio
import aiohttp
import re
import difflib
import random
from typing import List, Dict, Set, Any
from google.cloud import storage
from tqdm.asyncio import tqdm_asyncio

# --- CONFIGURATION ---
# Using 1.5 Pro for deep reasoning on the whole transcript
MODEL_NAME = "gemini-3-pro-preview"
MAX_RETRIES = 5

# --- HELPERS ---
def gcs_read_text(blob_path: str) -> str:
    blob = bucket.blob(blob_path)
    return blob.download_as_string().decode('utf-8') if blob.exists() else ""

def gcs_load_json(blob_path: str) -> List:
    blob = bucket.blob(blob_path)
    return json.loads(blob.download_as_string()) if blob.exists() else []

def gcs_upload_json(data: Any, blob_path: str):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(json.dumps(data, indent=2), content_type='application/json')

def validate_tag(tag_input: Any, valid_set: Set[str]) -> str:
    if not tag_input: return "Skin::Unclassified"
    tag_str = tag_input[0] if isinstance(tag_input, list) and tag_input else str(tag_input)
    clean = tag_str.strip()
    if clean in valid_set: return clean
    matches = difflib.get_close_matches(clean, list(valid_set), n=1, cutoff=0.7)
    return matches[0] if matches else clean

# --- LOGIC: THE ID SWAPPER ---
def inject_real_paths(entities, slide_lookup_map):
    """
    Replaces AI 'PLACEHOLDER' with real GCS paths based on Slide ID.
    """
    for ent in entities:
        if 'related_figures' in ent:
            for fig in ent['related_figures']:
                # The AI returns "Slide_1", we look up the real object
                slide_id = fig.get('id')

                if slide_id in slide_lookup_map:
                    real_data = slide_lookup_map[slide_id]
                    fig['src'] = real_data['gcs_path']
                    fig['gcs_path'] = real_data['gcs_path']
                    # Append timestamp to legend if missing
                    time_str = f"(Time: {real_data['timestamp_start']:.0f}s)"
                    if time_str not in fig.get('legend', ''):
                        fig['legend'] = f"{fig.get('legend', '')} {time_str}".strip()
                else:
                    fig['src'] = None
                    fig['gcs_path'] = None
    return entities

# ------------------------------------------------------------------------------
# 2. PROMPT ENGINEERING
# ------------------------------------------------------------------------------
def construct_lecture_prompt(transcript_data, valid_tags_list):
    return f"""
Role: You are a Senior Dermatopathologist and Data Engineer.
Objective: Convert the ENTIRE LECTURE provided below into a standardized Knowledge Base.

INPUT DATA:
- Chronological sequence of slides (ID, Visual Description, Transcript).

INSTRUCTIONS:
1. **Consolidate:** Merge discussion across multiple slides into single Disease Entities.
2. **Detail Extraction (CRITICAL):**
   - **Stains (IHC):** List every specific stain mentioned (e.g., "CK20+", "TTF-1 negative").
   - **Genetics:** List specific mutations/translocations.
3. **Figure Linking (ID SWAP):**
   - Select the BEST slides for 'related_figures'.
   - **IMPORTANT:** Use the exact ID provided (e.g., "Slide_5"). Leave `src` and `gcs_path` as "PLACEHOLDER".

REQUIRED JSON SCHEMA:
[
  {{
    "entity_name": "Disease Name",
    "definition": "...",
    "tags": ["Exact_Tag"],
    "html_gcs_path": null,

    "clinical": "...",
    "pathogenesis": "...",
    "macroscopic": "...",
    "microscopic": "...",
    "ancillary_studies": "List ALL stains/molecular details.",
    "differential_diagnosis": "...",
    "staging": "...",
    "prognosis_and_prediction": "...",

    "related_figures": [
        {{
            "id": "Slide_X",
            "src": "PLACEHOLDER",
            "gcs_path": "PLACEHOLDER",
            "diagnosis": "Disease Name",
            "legend": "Specific description of this slide (e.g. 'CK20 dot-like positivity')."
        }}
    ]
  }}
]

REFERENCE TAGS:
{valid_tags_list}

LECTURE CONTENT:
{transcript_data}
"""

# ------------------------------------------------------------------------------
# 3. AI PROCESSING (SINGLE SHOT)
# ------------------------------------------------------------------------------
async def process_full_lecture(session, slides, valid_tags_text, valid_tags_set):
    # 1. Build Monolith Input (Hiding URLs)
    formatted_input = ""
    slide_map = {} # ID -> Real Data

    for s in slides:
        slide_id = f"Slide_{s['id']}"
        slide_map[slide_id] = s

        formatted_input += f"\n--- ID: {slide_id} (Time: {s['timestamp_start']:.0f}s) ---\n"
        formatted_input += f"VISUAL: {s.get('visual_desc', '')}\n"
        formatted_input += f"KEY POINTS: {s.get('key_points', [])}\n"
        formatted_input += f"TRANSCRIPT: {s['raw_transcript']}\n"

    print(f"üì¶ Payload: {len(formatted_input)} chars. Sending to Gemini...")

    url = f"https://generativelanguage.googleapis.com/v1beta/models/{MODEL_NAME}:generateContent?key={GEMINI_API_KEY}"
    payload = {"contents": [{"parts": [{"text": construct_lecture_prompt(formatted_input, valid_tags_text)}]}]}

    for attempt in range(MAX_RETRIES):
        try:
            # 10 min timeout for full lecture processing
            async with session.post(url, json=payload, timeout=600) as response:
                if response.status == 200:
                    data = await response.json()
                    raw_txt = data['candidates'][0]['content']['parts'][0]['text']
                    raw_txt = raw_txt.replace("```json", "").replace("```", "")

                    match = re.search(r'\[.*\]', raw_txt, re.DOTALL)
                    if match:
                        entities = json.loads(match.group(0))

                        # 2. INJECT REAL PATHS
                        entities = inject_real_paths(entities, slide_map)

                        # 3. Validation
                        valid_entities = []
                        for ent in entities:
                            ent['tags'] = [validate_tag(ent.get('tags', []), valid_tags_set)]
                            ent['html_gcs_path'] = None

                            # Null Filling
                            req_keys = ["microscopic", "ancillary_studies", "differential_diagnosis"]
                            for k in req_keys:
                                if k not in ent: ent[k] = None

                            valid_entities.append(ent)

                        return valid_entities
                    return []
                else:
                    print(f"‚ùå API Error {response.status}: {await response.text()}")
                    await asyncio.sleep(5)
        except Exception as e:
            print(f"‚ùå Exception: {e}")
            await asyncio.sleep(5)

    return []

# ------------------------------------------------------------------------------
# 4. MAIN WORKFLOW
# ------------------------------------------------------------------------------
async def main_lecture_definitive():
    # 1. Tags
    tag_files = [b.name for b in bucket.list_blobs(prefix="Tags/") if b.name.endswith('.txt')]
    print("\n--- SELECT TAG LIST ---")
    for i, f in enumerate(tag_files): print(f"[{i+1}] {f.split('/')[-1]}")
    t_idx = int(input("Choice: ")) - 1
    tags_text = gcs_read_text(tag_files[t_idx])
    tags_set = set(l.strip() for l in tags_text.splitlines() if l.strip())

    # 2. Lecture
    raw_files = [b.name for b in bucket.list_blobs(prefix=PATHS['gcs_content_lectures']) if "_RAW.json" in b.name]
    if not raw_files: print("‚ùå No RAW files."); return

    print("\n--- SELECT LECTURE ---")
    for i, f in enumerate(raw_files): print(f"[{i+1}] {f.split('/')[-1]}")
    c_idx = int(input("Choice: ")) - 1

    raw_path = raw_files[c_idx]
    lecture_name = raw_path.split('/')[-1].replace("_RAW.json", "")
    slides_data = gcs_load_json(raw_path)

    print(f"\nüöÄ Processing: {lecture_name}")
    print(f"   Mode: MONOLITH + ID SWAP")

    async with aiohttp.ClientSession() as session:
        master_kb = await process_full_lecture(session, slides_data, tags_text, tags_set)

    if master_kb:
        final_path = f"{PATHS['gcs_content_lectures']}/{lecture_name}_MASTER.json"
        gcs_upload_json(master_kb, final_path)
        print(f"\n‚úÖ MASTER SAVED: gs://{GCS_BUCKET_NAME}/{final_path}")
        print(f"üìä Extracted {len(master_kb)} High-Quality Entities")
    else:
        print("‚ùå Architecture failed.")

await main_lecture_definitive()

# BLOCK 3: THE GRAND UNIFIER (Concatenate All Knowledge)

In [3]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 3: THE GRAND UNIFIER (Concatenate All Knowledge)
# ==============================================================================
import json
import os
from google.cloud import storage
from tqdm.notebook import tqdm

# --- CONFIGURATION ---
OUTPUT_FILENAME = "GLOBAL_KNOWLEDGE_BASE.json"
# Terms to filter out (Final Polish)
BLACKLIST = ["Copyright", "Preface", "Contents", "Index", "Contributors", "Dedication", "Title Page"]

# --- SETUP ---
# Ensure PATHS exists (Run Block 0 if this fails)
if 'PATHS' not in globals():
    raise NameError("‚ùå PATHS not found. Please run Block 0.")

bucket = storage.Client().bucket(PATHS['gcs_bucket'])

# --- HELPERS ---
def get_best_files(prefix):
    """
    Scans a directory and picks the best version of each book/lecture.
    Priority: _CONSOLIDATED.json > _MASTER.json
    """
    blobs = list(bucket.list_blobs(prefix=prefix))
    files_map = {}

    for b in blobs:
        if not b.name.endswith(".json"): continue

        # Parse filename
        fname = b.name.split('/')[-1]
        if "_MASTER" not in fname and "_CONSOLIDATED" not in fname: continue

        # Get the base name (e.g., "Derm_Weedon")
        base_name = fname.replace("_MASTER.json", "").replace("_CONSOLIDATED.json", "")

        # Logic: If we already have a CONSOLIDATED version, keep it.
        # If we see a CONSOLIDATED version now, overwrite whatever we had.
        # Otherwise, take MASTER.
        if "_CONSOLIDATED" in fname:
            files_map[base_name] = b.name
        elif base_name not in files_map:
            files_map[base_name] = b.name

    return files_map

def load_and_tag(blob_name, source_type):
    """Loads JSON and injects source metadata."""
    blob = bucket.blob(blob_name)
    try:
        data = json.loads(blob.download_as_string())

        valid_entries = []
        filename = blob_name.split('/')[-1]
        clean_name = filename.replace("_MASTER.json", "").replace("_CONSOLIDATED.json", "")

        for ent in data:
            # 1. Clean Noise
            if any(x in ent.get('entity_name', '') for x in BLACKLIST):
                continue

            # 2. Inject Metadata (Critical for RAG citations)
            ent['source_document'] = clean_name
            ent['source_type'] = source_type
            ent['gcs_origin'] = f"gs://{PATHS['gcs_bucket']}/{blob_name}"

            valid_entries.append(ent)

        return valid_entries
    except Exception as e:
        print(f"‚ö†Ô∏è Error loading {blob_name}: {e}")
        return []

# --- MAIN EXECUTION ---
def main_unifier():
    print(f"üöÄ Starting Grand Unification...")
    global_kb = []

    # 1. Scan Textbooks
    print("   Scanning Textbooks...")
    textbooks = get_best_files(PATHS['gcs_content_textbooks'])
    print(f"   -> Found {len(textbooks)} unique textbooks.")

    for base, path in tqdm(textbooks.items(), desc="Textbooks"):
        entries = load_and_tag(path, "Textbook")
        global_kb.extend(entries)

    # 2. Scan Lectures
    print("   Scanning Lectures...")
    lectures = get_best_files(PATHS['gcs_content_lectures'])
    print(f"   -> Found {len(lectures)} unique lectures.")

    for base, path in tqdm(lectures.items(), desc="Lectures"):
        entries = load_and_tag(path, "Lecture")
        global_kb.extend(entries)

    # 3. Final Stats
    print(f"\nüìä Total Entities Collected: {len(global_kb)}")

    # 4. Upload
    print(f"üíæ Saving Global Database...")
    final_blob = bucket.blob(OUTPUT_FILENAME)
    final_blob.upload_from_string(json.dumps(global_kb, indent=2), content_type='application/json')

    print(f"\n‚úÖ SUCCESS! The Master of Masters is ready.")
    print(f"   gs://{PATHS['gcs_bucket']}/{OUTPUT_FILENAME}")

main_unifier()

üöÄ Starting Grand Unification...
   Scanning Textbooks...
   -> Found 6 unique textbooks.


Textbooks:   0%|          | 0/6 [00:00<?, ?it/s]

‚ö†Ô∏è Error loading _content_library/textbooks/Skin_Elston_MASTER.json: 'str' object has no attribute 'get'
   Scanning Lectures...
   -> Found 1 unique lectures.


Lectures:   0%|          | 0/1 [00:00<?, ?it/s]


üìä Total Entities Collected: 2153
üíæ Saving Global Database...

‚úÖ SUCCESS! The Master of Masters is ready.
   gs://pathology-hub-0/GLOBAL_KNOWLEDGE_BASE.json


# Utilities

# BLOCK 6: WSI & FIGURE IMPORTER (Fixed Selector + WHO Logic)

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 6: SMART BATCH IMPORTER (Save to Input Folder)
# ==============================================================================
import json
import asyncio
import aiohttp
import re
import difflib
import random
from google.cloud import storage
from tqdm.asyncio import tqdm_asyncio

# --- CONFIGURATION ---
MODEL_NAME = "gemini-3-flash-preview"
CONCURRENCY_LIMIT = 10  # Moderate concurrency
SOURCE_FOLDER_PREFIX = "WSI_JSON/"

# --- SETUP ---
if 'PATHS' not in globals():
    raise NameError("‚ùå PATHS not found. Please run Block 0.")
bucket = storage.Client().bucket(PATHS['gcs_bucket'])

# --- HELPERS ---
def gcs_read_text(blob_path):
    blob = bucket.blob(blob_path)
    return blob.download_as_string().decode('utf-8') if blob.exists() else ""

def gcs_load_json(blob_path):
    blob = bucket.blob(blob_path)
    return json.loads(blob.download_as_string()) if blob.exists() else []

def gcs_upload_json(data, blob_path):
    blob = bucket.blob(blob_path)
    blob.upload_from_string(json.dumps(data, indent=2), content_type='application/json')

# ------------------------------------------------------------------------------
# 1. NORMALIZERS
# ------------------------------------------------------------------------------
def normalize_who_chapter(entry):
    new_entry = {k: entry.get(k) for k in [
        "entity_name", "definition", "clinical", "pathogenesis",
        "macroscopic", "microscopic", "ancillary_studies",
        "differential_diagnosis", "staging", "prognosis_and_prediction",
        "cytology", "diagnostic_molecular_pathology",
        "related_terminology", "subtypes"
    ]}

    media_list = []
    for fig in entry.get('related_figures', []):
        legend = fig.get('legend', '')
        diag = fig.get('diagnosis', '')
        if diag and diag not in legend:
            legend = f"{diag}. {legend}"
        media_item = {"legend": legend}

        if fig.get('isWSI') is True:
            wsi_id = str(fig.get('id'))
            media_item["type"] = "wsi"
            media_item["path"] = f"https://tumourclassification.iarc.who.int/static/dzi/{wsi_id}_files/10/0_0.jpeg"
            media_item["url"] = f"https://tumourclassification.iarc.who.int/Viewer/Index2?fid={wsi_id}"
        else:
            media_item["type"] = "figure"
            media_item["path"] = fig.get('src') or fig.get('gcs_path')
            if not media_item["path"]: continue
        media_list.append(media_item)

    new_entry['media'] = media_list
    return new_entry

def normalize_simple(entry, source_type):
    thumb = entry.get('Thumbnail')
    if source_type == "PathPresenter": thumb = None
    elif not thumb: thumb = None

    return {
        "entity_name": entry.get('Diagnosis'),
        "media": [{
            "type": "wsi",
            "path": thumb,
            "url": entry.get('URL'),
            "legend": entry.get('Diagnosis')
        }]
    }

def detect_format(filename):
    f = filename.lower()
    if "who" in f: return "WHO"
    if "pp_" in f or "pathpresenter" in f: return "PathPresenter"
    if "leeds" in f: return "Leeds"
    if "mgh" in f: return "MGH"
    if "rosai" in f: return "Rosai"
    return "Unknown"

# ------------------------------------------------------------------------------
# 2. ROBUST AI TAGGING
# ------------------------------------------------------------------------------
async def assign_tag_async(session, entity, valid_tags_text, valid_tags_set, sem):
    async with sem:
        diag = entity.get('entity_name', '')
        if not diag:
            entity['tags'] = ["Skin::Unclassified"]
            return entity

        # Check existing tags first (Save API calls if already tagged properly)
        if entity.get('tags') and isinstance(entity['tags'], list) and entity['tags'][0] in valid_tags_set:
            return entity

        url = f"https://generativelanguage.googleapis.com/v1beta/models/{MODEL_NAME}:generateContent?key={GEMINI_API_KEY}"

        prompt = f"""
        Role: Pathology Taxonomist.
        Task: Map this diagnosis to the EXACT tag from the list.

        DIAGNOSIS: "{diag}"

        INSTRUCTIONS:
        1. Ignore suffixes like "(HE)", "(Actin)", or case numbers.
        2. Find the best match in the VALID TAGS list.
        3. Return ONLY the tag string.

        VALID TAGS:
        {valid_tags_text}
        """

        for attempt in range(3):
            try:
                async with session.post(url, json={"contents": [{"parts": [{"text": prompt}]}]}) as response:
                    if response.status == 200:
                        data = await response.json()
                        tag = data['candidates'][0]['content']['parts'][0]['text'].strip()

                        if tag in valid_tags_set:
                            entity['tags'] = [tag]
                        else:
                            matches = difflib.get_close_matches(tag, list(valid_tags_set), n=1, cutoff=0.7)
                            if matches:
                                entity['tags'] = [matches[0]]
                            elif "::" in tag:
                                entity['tags'] = [tag] # Accept if it looks like a valid hierarchy
                            else:
                                entity['tags'] = ["Skin::Unclassified"]
                        return entity

                    elif response.status == 429:
                        await asyncio.sleep(2 * (attempt + 1))
                        continue
                    else:
                        print(f"‚ùå API Error {response.status}: {await response.text()}")
                        return entity

            except Exception as e:
                await asyncio.sleep(1)

        entity['tags'] = ["Skin::Unclassified"]
        return entity

# ------------------------------------------------------------------------------
# 3. MAIN EXECUTION
# ------------------------------------------------------------------------------
async def main_smart_batch():
    # 1. Tags
    tag_files = [b.name for b in bucket.list_blobs(prefix="Tags/") if b.name.endswith('.txt')]
    print("\n--- SELECT TAG LIST ---")
    for i, f in enumerate(tag_files): print(f"[{i+1}] {f.split('/')[-1]}")
    t_idx = int(input("Choice: ")) - 1
    tags_text = gcs_read_text(tag_files[t_idx])
    tags_set = set(l.strip() for l in tags_text.splitlines() if l.strip())

    # 2. File Scanner
    print(f"\nüîç Scanning {SOURCE_FOLDER_PREFIX} ...")
    all_blobs = list(bucket.list_blobs(prefix=SOURCE_FOLDER_PREFIX))
    candidates = []
    ignore = ["_APP_READY", "_MASTER", "_CONSOLIDATED"]

    for b in all_blobs:
        if not b.name.endswith(".json"): continue
        if any(x in b.name for x in ignore): continue
        candidates.append(b.name)

    if not candidates: print("‚ùå No raw files found."); return

    print("\n--- AVAILABLE FILES ---")
    for i, f in enumerate(candidates): print(f"[{i+1}] {f}")

    sel_input = input("\nSelect files (e.g. 1, 3-5, all): ").strip().lower()

    selected_indices = set()
    if sel_input == 'all':
        selected_indices = set(range(len(candidates)))
    else:
        for p in sel_input.split(','):
            if '-' in p:
                start, end = map(int, p.split('-'))
                selected_indices.update(range(start-1, end))
            elif p.strip().isdigit():
                selected_indices.add(int(p)-1)

    # 3. Process Loop
    for idx in sorted(list(selected_indices)):
        if idx < 0 or idx >= len(candidates): continue

        src_blob = candidates[idx]
        fmt = detect_format(src_blob)

        print(f"\n{'='*60}")
        print(f"üöÄ Processing: {src_blob}")
        print(f"   Detected Format: {fmt}")
        print(f"{'='*60}")

        if fmt == "Unknown":
            print(f"‚ö†Ô∏è Skipping (Unknown Format)")
            continue

        raw_data = gcs_load_json(src_blob)
        normalized_data = []
        for item in raw_data:
            norm = normalize_who_chapter(item) if fmt == "WHO" else normalize_simple(item, fmt)
            if norm:
                for k in ["video", "html", "wsi", "definition", "clinical", "microscopic"]:
                    if k not in norm: norm[k] = None
                normalized_data.append(norm)

        print(f"üß† Tagging {len(normalized_data)} entities...")
        sem = asyncio.Semaphore(CONCURRENCY_LIMIT)
        final_data = []
        async with aiohttp.ClientSession() as session:
            tasks = [assign_tag_async(session, ent, tags_text, tags_set, sem) for ent in normalized_data]
            final_data = await tqdm_asyncio.gather(*tasks)

        # SAVE TO SAME FOLDER
        # Replaces .json with _APP_READY.json, keeping full path
        out_name = src_blob.replace(".json", "_APP_READY.json")
        gcs_upload_json(final_data, out_name)

        valid_count = sum(1 for e in final_data if "Unclassified" not in e['tags'][0])
        print(f"‚úÖ Saved: {out_name}")
        print(f"   Tagged: {valid_count}/{len(final_data)}")

    print("\nüéâ Batch Complete.")

await main_smart_batch()


--- SELECT TAG LIST ---
[1] BST_Tags.txt
[2] Breast_Tags.txt
[3] Endo_Tags.txt
[4] GI_Tags.txt
[5] GYN_Tags.txt
[6] Skin_Tags.txt
Choice: 6

üîç Scanning WSI_JSON/ ...

--- AVAILABLE FILES ---
[1] WSI_JSON/Leeds_WSI_Skin.json
[2] WSI_JSON/PP_Skin_1-500.json
[3] WSI_JSON/PP_Skin_1001-1500.json
[4] WSI_JSON/PP_Skin_1501-2000.json
[5] WSI_JSON/PP_Skin_2001-2500.json
[6] WSI_JSON/PP_Skin_2501-3000.json
[7] WSI_JSON/PP_Skin_3001-3500.json
[8] WSI_JSON/PP_Skin_3501-4000.json
[9] WSI_JSON/PP_Skin_4001-4500.json
[10] WSI_JSON/PP_Skin_4501-5000.json
[11] WSI_JSON/PP_Skin_5001-6000.json
[12] WSI_JSON/PP_Skin_501-1000.json
[13] WSI_JSON/Rosai_Skin_Links.json
[14] WSI_JSON/Skin_MGH_Links.json

Select files (e.g. 1, 3-5, all): 2

üöÄ Processing: WSI_JSON/PP_Skin_1-500.json
   Detected Format: PathPresenter
üß† Tagging 501 entities...


 22%|‚ñà‚ñà‚ñè       | 112/501 [02:31<27:54,  4.31s/it]