<a href="https://colab.research.google.com/github/herndoch/pathology_hub/blob/main/Knowledge_Pipeline_v3_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Block 0: Project Initialization & Setup**
<!--This is the foundational setup block for the entire pipeline. It must be run once at the beginning of every session.

**Core Purpose:**
*   To create a stable, predictable, and maintainable environment for all subsequent processing blocks.

**Key Architectural Decisions & Learnings:**
1.  **Centralized `PATHS` Dictionary:** Early development was plagued by inconsistent, hard-coded file paths, which made the code brittle and difficult to maintain. The solution was to create a single, definitive "address book" called `PATHS`. This global dictionary holds the absolute path for every important folder in the project. **All subsequent blocks MUST reference this `PATHS` dictionary** and should never define their own paths. This was a critical step in professionalizing the pipeline.
2.  **Robust Secret Management:** The script now uses Colab's integrated `userdata` to securely fetch the `GEMINI_API_KEY`. This is superior to pasting the key directly into the code and includes clear error messaging to guide the user if the key is not found.
3.  **Unified Installation:** All necessary libraries for every part of the pipeline (PDF processing, video transcription, AI analysis, etc.) are installed in this single block. This ensures the environment is always complete and avoids dependency issues later on.
4.  **Automatic Directory Creation:** The script verifies that the required Google Drive folder structure exists. If any folders are missing, it creates them automatically, preventing errors caused by a disorganized directory.

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 0: PROJECT INITIALIZATION & SETUP (v1.3 - CENTRALIZED PATHS)
# ==============================================================================
#
# PURPOSE:
# This definitive version combines stable secret fetching with a centralized PATHS
# dictionary, creating a robust and easily maintainable "address book" for the
# entire project. All subsequent blocks will reference this PATHS dictionary.
# ==============================================================================

# ------------------------------------------------------------------------------
# STEP 0.1: MOUNT GOOGLE DRIVE & LOAD API KEY
# ------------------------------------------------------------------------------
print("--- STEP 0.1: MOUNTING GOOGLE DRIVE & LOADING API KEY ---")
from google.colab import drive, userdata
import os
import time

from google.colab import userdata
userdata.get('GEMINI_API_KEY')

try:
    drive.mount('/content/drive', force_remount=True)
    print("✅ Google Drive mounted successfully.")
except Exception as e:
    print(f"❌ ERROR: Could not mount Google Drive. {e}"); raise SystemExit()

try:
    GEMINI_API_KEY = userdata.get('GEMINI_API_KEY')
    if not GEMINI_API_KEY:
        raise ValueError("API Key is empty or not found in Colab Secrets.")
    print("✅ Successfully fetched Gemini API Key.")
except Exception as e:
    print(f"\n❌ CRITICAL ERROR: Requesting secret GEMINI_API_KEY failed. {e}")
    print("    Please ensure the secret has the EXACT name 'GEMINI_API_KEY' and 'Notebook access' is enabled.")
    raise SystemExit()

# ------------------------------------------------------------------------------
# STEP 0.2: AUTHENTICATE USER & INSTALL LIBRARIES
# ------------------------------------------------------------------------------
print("\n--- STEP 0.2: AUTHENTICATING USER & INSTALLING LIBRARIES ---")
# Note: google_auth is not strictly necessary if using direct API calls with a key,
# but it's good practice to keep for potential future use with other Google Cloud services.
from google.colab import auth as google_auth
import google.auth

google_auth.authenticate_user()

print("\n--> Installing all required libraries...")
!pip install -q -U google-generativeai PyMuPDF git+https://github.com/openai/whisper.git scikit-image sentence-transformers faiss-cpu thefuzz[speedup] requests > /dev/null 2>&1
!sudo apt-get install -y ffmpeg > /dev/null 2>&1
print("✅ All necessary libraries are installed.")

# ------------------------------------------------------------------------------
# STEP 0.3: CONFIGURE GEMINI CLIENT
# ------------------------------------------------------------------------------
print("\n--- STEP 0.3: CONFIGURING GEMINI CLIENT ---")
import google.generativeai as genai

try:
    genai.configure(api_key=GEMINI_API_KEY)
    print("✅ Gemini client configured successfully.")
except Exception as e:
    print(f"\n❌ CRITICAL ERROR: Could not configure Gemini client. Error: {e}")

# ------------------------------------------------------------------------------
# STEP 0.4: DEFINE & VERIFY DIRECTORY STRUCTURE
# ------------------------------------------------------------------------------
print("\n--- STEP 0.4: DEFINING AND VERIFYING PROJECT DIRECTORIES ---")

# Define the single master project root path
KNOWLEDGE_PIPELINE_ROOT = '/content/drive/MyDrive/1-Projects/Knowledge_Pipeline'
print(f"--> Project Root set to: {KNOWLEDGE_PIPELINE_ROOT}")

# The centralized "address book" for the entire project
PATHS = {
    "root":             KNOWLEDGE_PIPELINE_ROOT,
    "source_materials": os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_source_materials'),
    "source_pdfs":      os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_source_materials', 'pdfs'),
    "source_videos":    os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_source_materials', 'videos'),
    "content_library":  os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_content_library'),
    "content_textbooks":os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_content_library', 'textbooks'),
    "content_lectures": os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_content_library', 'lectures'),
    "asset_library":    os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_asset_library'),
    "asset_textbooks":  os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_asset_library', 'textbooks'),
    "asset_lectures":   os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_asset_library', 'lectures'),
    "outputs":          os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_outputs'),
    "notebooks":        '/content/drive/MyDrive/1-Projects/Pathology_Notebook' # Default notebook location
}

# Loop through and create/verify each required directory
print("\n--> Verifying directory structure...")
for key, dir_path in PATHS.items():
    # The 'notebooks' path is external, so we just check if it exists.
    if key == "notebooks":
        if not os.path.exists(dir_path):
            print(f"    - ⚠️ WARNING: The default notebook directory was not found at {dir_path}")
        continue

    # For all other paths within our project, create them if they don't exist.
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)
        print(f"    - Created missing directory: {os.path.relpath(dir_path, '/content/drive/MyDrive/')}")

print("✅ Project directory structure is verified and ready.")

# ------------------------------------------------------------------------------
# FINAL CONFIRMATION
# ------------------------------------------------------------------------------
print("\n" + "="*80)
print("✅ ENVIRONMENT INITIALIZATION COMPLETE!")
print("All subsequent blocks will now use the centralized PATHS dictionary.")
print("="*80)

# **Block 1: Process PDF Textbooks**

<!--**Core Purpose:**
*   To extract all text and image assets from a PDF, clean the text, and use AI to link figures to their correct captions.

**Key Architectural Decisions & Learnings:**
This block has undergone the most extensive debugging and refinement of any part of the pipeline. The final version is a result of overcoming numerous real-world challenges:

1.  **Chunk-Based Structure:** Early versions created one JSON object per page, which was a critical flaw. This lost the hierarchical context of the document. The final architecture is **chunk-based**, creating one JSON object per heading/subheading. This is the gold standard for preparing data for a Retrieval-Augmented Generation (RAG) system.
2.  **Strategic AI Model Selection:**
    *   **Text Cleaning:** Uses `gemini-2.5-flash-lite` for its high speed and low cost, which is ideal for this high-volume, relatively simple task.
    *   **Figure Linking:** Uses `gemini-2.5-flash` as a cost-effective but powerful model for the more complex multimodal task of linking a figure to its caption.
3.  **Stability over Convenience (Direct API Calls):** After encountering persistent and cryptic `localhost` errors, a critical decision was made to **abandon the `google-generativeai` client library for vision calls.** The final version uses direct, stable web `requests` for the vision task, which proved to be far more reliable in the Colab environment. The text-cleaning task, however, was found to be stable with the library.
4.  **"Fail Fast" Timeout:** To prevent the script from hanging on a single problematic page, the text-cleaning function (`get_gemini_enhanced_text_direct`) has a **strict 20-second timeout.** If a page takes too long, it is marked with an error and the script immediately moves on, ensuring the entire process does not stall.
5.  **Intelligent Stage Resuming:** This is the most important stability feature. The script is now idempotent.
    *   It checks if the `_CONTENT.json` file already exists. If it does, the entire time-consuming **Stage A (text extraction and cleaning) is skipped.**
    *   It then checks the `_FIGURES.json` file and will only analyze images for pages that have not already been processed. This allows you to stop and resume a long job without losing any progress.
6.  **Reliable Batch Saving:** The script now saves the `_FIGURES.json` file every 10 pages, and the logging correctly reports this progress even during a resume run, providing clear confirmation that work is being secured.

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 1: PROCESS PDF TEXTBOOKS (v6.2 - FINAL ENHANCED PROMPTS)
# ==============================================================================
#
# PURPOSE:
# This definitive version uses direct HTTP requests for all API calls to bypass
# environment issues. It includes enhanced, detailed prompts for both text
# enhancement and figure analysis to maximize quality and debuggability.
# ==============================================================================

import base64
import fitz
from typing import Dict, List, Optional, Tuple
import re
import os
import json
import asyncio
import aiohttp
from tqdm.asyncio import tqdm_asyncio

# This block assumes Block 0 has been run and its variables are available.

# --- Concurrency Settings ---
TEXT_CONCURRENCY_LIMIT = 20
VISION_CONCURRENCY_LIMIT = 8

# ------------------------------------------------------------------------------
# STEP 1.1: DEFINE ASYNC AI HELPER FUNCTIONS
# ------------------------------------------------------------------------------
print("--- STEP 1.1: DEFINING ASYNC AI HELPER FUNCTIONS (v6.2) ---")

async def get_gemini_enhanced_text_async(
    session: aiohttp.ClientSession,
    page_text: str,
    page_num: int,
    semaphore: asyncio.Semaphore
) -> Tuple[int, str]:
    """Asynchronously cleans text using a raw aiohttp call with a detailed prompt."""
    async with semaphore:
        if not page_text.strip():
            return page_num, ""

        url = f"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-lite:generateContent?key={GEMINI_API_KEY}"
        headers = {'Content-Type': 'application/json'}

        prompt = (
            "You are an expert AI assistant that cleans and formats raw OCR text from medical textbooks into high-quality, well-structured Markdown.\n"
            "Perform the following tasks on the RAW TEXT provided:\n"
            "1. **Correct OCR Errors:** Fix common errors like misrecognized characters (e.g., 'f1' instead of 'fi'), run-on words, and incorrect punctuation.\n"
            "2. **Establish Structure:** Identify headings and subheadings. Use `##` for main chapter-level headings and `###` for major subheadings. Use `**bold**` for key terms that are clearly emphasized.\n"
            "3. **Preserve Paragraphs:** Maintain logical paragraph breaks. Do not merge distinct paragraphs.\n"
            "4. **Format Lists:** Correctly format any bulleted or numbered lists.\n"
            "5. **Crucially, do not omit any text, especially figure captions (e.g., 'Figure 12-3 ...') or table titles.** Ensure these are preserved perfectly as they are critical for later processing.\n\n"
            "Return ONLY a single, valid JSON object with one key, `markdown`, containing the cleaned text.\n\n"
            f"RAW TEXT:\n---\n{page_text}"
        )
        payload = {"contents": [{"parts": [{"text": prompt}]}]}

        try:
            async with session.post(url, headers=headers, json=payload, timeout=30) as response:
                if response.status == 200:
                    data = await response.json()
                    raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', '')
                    json_match = re.search(r'```json\s*(\{.*?\})\s*```|(\{.*?\})', raw_text, re.DOTALL)
                    if not json_match:
                        raise ValueError("No valid JSON object found in API response.")
                    json_str = json_match.group(1) if json_match.group(1) else json_match.group(2)
                    return page_num, json.loads(json_str).get("markdown", f"Error: Raw text was {page_text}")
                else:
                    error_text = await response.text()
                    return page_num, f"Error on page {page_num}: API returned status {response.status}. Response: {error_text}"
        except Exception as e:
            return page_num, f"Error on page {page_num}: {e}. Raw text: {page_text}"

def image_to_base64(image_path: str) -> Optional[str]:
    try:
        with open(image_path, "rb") as image_file: return base64.b64encode(image_file.read()).decode('utf-8')
    except FileNotFoundError: return None

async def call_gemini_vision_async(
    session: aiohttp.ClientSession,
    figure_b64: str,
    page_context_b64s: List[str],
    page_context_markdowns: List[str],
    semaphore: asyncio.Semaphore
) -> Optional[Dict]:
    """Asynchronously links figures using a smarter, more descriptive prompt."""
    async with semaphore:
        try:
            model_name = "gemini-2.5-flash"
            url = f"https://generativelanguage.googleapis.com/v1beta/models/{model_name}:generateContent?key={GEMINI_API_KEY}"

            prompt = (
                "You are an expert pathology figure analyst. Your task is to find the specific caption that corresponds to the **TARGET FIGURE** provided at the end.\n"
                "CONTEXT: You will be given images and markdown text from the current page and possibly the previous page of a textbook.\n"
                "INSTRUCTIONS:\n"
                "1. Scrutinize the text to find a caption that clearly corresponds to the TARGET FIGURE (e.g., 'Figure 1.2', 'Fig. 3-4').\n"
                "2. If you find a clear match, return a JSON object with two keys: `figure_id` and `matched_caption`.\n"
                "3. **If you CANNOT find a clear caption, you MUST explain why.** Return a JSON object where `figure_id` is `null` and `matched_caption` is a brief string explaining the reason (e.g., 'No caption-like text found near the image.', 'Text refers to a table, not a figure.', etc.).\n"
                "ALWAYS return a valid JSON object."
            )

            parts = [{"text": prompt}]
            for md in page_context_markdowns: parts.append({"text": md})
            for img_b64 in page_context_b64s: parts.append({"inline_data": {"mime_type": "image/png", "data": img_b64}})
            parts.append({"text": "--- TARGET FIGURE ---"})
            parts.append({"inline_data": {"mime_type": "image/png", "data": figure_b64}})
            payload = {"contents": [{"parts": parts}]}
            headers = {'Content-Type': 'application/json'}

            async with session.post(url, headers=headers, json=payload, timeout=90) as response:
                if response.status == 200:
                    data = await response.json()
                    raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', '')
                    if raw_text:
                        json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                        if json_match:
                            return json.loads(json_match.group(0))
                else:
                    print(f"  - WARNING: Vision API returned status {response.status}")
            return None
        except Exception as e:
            print(f"  - CRITICAL: Async Vision API call failed. Error: {e}")
            return None

# ------------------------------------------------------------------------------
# STEP 1.2: DEFINE THE ASYNC PDF PROCESSING PIPELINE
# ------------------------------------------------------------------------------
print("\n--- STEP 1.2: DEFINING THE ASYNC PDF PROCESSING PIPELINE (v6.2) ---")

async def process_single_pdf_async(pdf_path: str, start_page: int, end_page: int):
    pdf_filename = os.path.basename(pdf_path)
    pdf_name_base = os.path.splitext(pdf_filename)[0].replace(' ', '_')
    print(f"\n{'='*80}\n🚀 Starting processing for: {pdf_filename} [Pages {start_page}-{end_page}]\n{'='*80}")
    asset_dir = os.path.join(PATHS["asset_textbooks"], pdf_name_base); page_images_dir = os.path.join(asset_dir, "page_images"); figure_images_dir = os.path.join(asset_dir, "figure_images"); content_json_path = os.path.join(PATHS["content_textbooks"], f"{pdf_name_base}_CONTENT.json"); figures_json_path = os.path.join(PATHS["content_textbooks"], f"{pdf_name_base}_FIGURES.json"); os.makedirs(page_images_dir, exist_ok=True); os.makedirs(figure_images_dir, exist_ok=True)
    doc = fitz.open(pdf_path); total_pages = len(doc); actual_end_page = min(end_page, total_pages)

    all_content_chunks = []
    if os.path.exists(content_json_path):
        print("--> Found existing CONTENT file. Loading...");
        with open(content_json_path, 'r', encoding='utf-8') as f: all_content_chunks = json.load(f)
    else:
        print("\n--- STAGE A: Enhancing text for all pages concurrently ---")
        text_semaphore = asyncio.Semaphore(TEXT_CONCURRENCY_LIMIT)
        pages_to_process = range(start_page - 1, actual_end_page)
        async with aiohttp.ClientSession() as session:
            tasks = [get_gemini_enhanced_text_async(session, doc.load_page(p).get_text("text"), p + 1, text_semaphore) for p in pages_to_process]
            page_results = await tqdm_asyncio.gather(*tasks, desc=f"📚 Enhancing text for {pdf_filename}")
        current_main_heading, current_sub_heading = None, None
        for page_num, enhanced_markdown in sorted(page_results, key=lambda x: x[0]):
            if enhanced_markdown.startswith("Error"): all_content_chunks.append({"source_document": pdf_filename, "page_number": page_num, "headings": {"main_heading": current_main_heading, "sub_heading": current_sub_heading}, "content": enhanced_markdown}); continue
            parts = re.split(r'(^(?:##|###)\s+.*)', enhanced_markdown, flags=re.MULTILINE)
            if parts[0] and parts[0].strip(): all_content_chunks.append({"source_document": pdf_filename, "page_number": page_num, "headings": {"main_heading": current_main_heading, "sub_heading": current_sub_heading}, "content": parts[0].strip()})
            for i in range(1, len(parts), 2):
                heading_line = parts[i]; content = parts[i+1].strip() if (i + 1) < len(parts) else ""
                match = re.match(r'^(##|###)\s+(.*)', heading_line)
                if match:
                    level, title = match.groups()
                    if level == '##': current_main_heading, current_sub_heading = title.strip(), None
                    else: current_sub_heading = title.strip()
                if content: all_content_chunks.append({"source_document": pdf_filename, "page_number": page_num, "headings": {"main_heading": current_main_heading, "sub_heading": current_sub_heading}, "content": content})
        with open(content_json_path, 'w', encoding='utf-8') as f: json.dump(all_content_chunks, f, indent=4)
        print(f"\n✅ Text content for {len(all_content_chunks)} chunks saved.")

    print("\n--- STAGE C: Analyzing images concurrently ---")
    all_figures_data = []
    if os.path.exists(figures_json_path):
        with open(figures_json_path, 'r', encoding='utf-8') as f: all_figures_data = json.load(f)
        print(f"--> Resuming from {len(all_figures_data)} previously processed figures.")

    processed_pages = {fig['source_page'] for fig in all_figures_data}; vision_tasks = []; vision_task_metadata = []
    vision_semaphore = asyncio.Semaphore(VISION_CONCURRENCY_LIMIT)

    print("-> Preparing image analysis tasks...")
    for page_num in range(start_page - 1, actual_end_page):
        current_page_idx = page_num + 1
        if current_page_idx in processed_pages: continue
        page = doc.load_page(page_num); images_on_page = page.get_images(full=True)
        if not images_on_page: continue
        page_image_path = os.path.join(page_images_dir, f"page_{current_page_idx:04d}.png")
        if not os.path.exists(page_image_path): page.get_pixmap(dpi=150).save(page_image_path)
        page_context_b64s, page_context_markdowns = [], []
        if page_num > 0:
            prev_page_path = os.path.join(page_images_dir, f"page_{page_num:04d}.png")
            if os.path.exists(prev_page_path):
                if b64 := image_to_base64(prev_page_path): page_context_b64s.append(b64)
                page_context_markdowns.append(f"--- PREVIOUS PAGE MARKDOWN ---\n" + "\n".join([c['content'] for c in all_content_chunks if c['page_number'] == page_num]))
        if b64 := image_to_base64(page_image_path): page_context_b64s.append(b64)
        page_context_markdowns.append(f"--- CURRENT PAGE MARKDOWN ---\n" + "\n".join([c['content'] for c in all_content_chunks if c['page_number'] == current_page_idx]))
        for img_index, img in enumerate(images_on_page):
            try:
                base_image = doc.extract_image(img[0])
                if not base_image or len(base_image["image"]) < 5000: continue
                figure_b64 = base64.b64encode(base_image["image"]).decode('utf-8')
                img_filename = f"{pdf_name_base}_page_{current_page_idx}_img_{img_index + 1}.{base_image['ext']}"; img_save_path = os.path.join(figure_images_dir, img_filename)
                with open(img_save_path, "wb") as f: f.write(base_image["image"])
                vision_tasks.append((figure_b64, page_context_b64s, page_context_markdowns)); vision_task_metadata.append({"source_page": current_page_idx, "image_path": img_save_path})
            except Exception as img_e: print(f"\n  - ⚠️ WARNING: Could not extract/process image {img_index + 1} on page {current_page_idx}. Error: {img_e}")

    if vision_tasks:
        async with aiohttp.ClientSession() as session:
            tasks = [call_gemini_vision_async(session, fig_b64, ctx_b64, ctx_md, vision_semaphore) for fig_b64, ctx_b64, ctx_md in vision_tasks]
            vision_results = await tqdm_asyncio.gather(*tasks, desc=f"🖼️ Analyzing figures for {pdf_filename}")

        for i, result in enumerate(vision_results):
            if result: # We now process every result, even if 'figure_id' is null
                meta = vision_task_metadata[i]
                all_figures_data.append({
                    "source_document": pdf_filename,
                    "source_page": meta["source_page"],
                    "figure_id": result.get('figure_id'),
                    "description": result.get('matched_caption'),
                    "image_path": os.path.relpath(meta["image_path"], PATHS["root"])
                })
                if (i + 1) % 20 == 0:
                    with open(figures_json_path, 'w', encoding='utf-8') as f: json.dump(all_figures_data, f, indent=4)
                    print(f"\n  -> BATCH SAVE: Progress secure. {len(all_figures_data)} total figures found so far.")

    with open(figures_json_path, 'w', encoding='utf-8') as f: json.dump(all_figures_data, f, indent=4)
    print(f"\n✅ Figure analysis complete. Saved {len(all_figures_data)} total linked figures.")
    print(f"\n{'='*80}\n✅ Finished processing: {pdf_filename}\n{'='*80}")

# ------------------------------------------------------------------------------
# INTERACTIVE EXECUTION
# ------------------------------------------------------------------------------
async def main():
    try:
        if 'PATHS' not in globals(): raise NameError("PATHS not defined. Run Block 0.")
        if 'GEMINI_API_KEY' not in globals(): raise NameError("GEMINI_API_KEY not found. Run Block 0.")
        pdf_files = sorted([f for f in os.listdir(PATHS["source_pdfs"]) if f.lower().endswith('.pdf') and not f.startswith('.')])
        if not pdf_files: print(f"❌ No PDFs found in your '{PATHS['source_pdfs']}' folder."); return
        print("\n--- STEP 1.3: SELECT PDF(S) TO PROCESS ---")
        [print(f"  [{i+1}] {name}") for i, name in enumerate(pdf_files)]
        user_input = input("\nEnter number(s) to process (e.g., 1, 3-5 or 'all'): ").strip().lower()
        page_range_input = input("Enter page range to process (e.g., '100-115' or 'all'): ").strip().lower()
        start_page, end_page = 1, float('inf')
        if page_range_input != 'all':
            try: start_page, end_page = map(int, page_range_input.split('-'))
            except ValueError: print("⚠️ Invalid page range format. Defaulting to all pages.")
        selected_indices = set()
        if user_input == 'all': selected_indices = set(range(len(pdf_files)))
        else:
            for part in user_input.split(','):
                part = part.strip()
                if '-' in part:
                    start, end = map(int, part.split('-')); selected_indices.update(range(start - 1, end))
                elif part.isdigit(): selected_indices.add(int(part) - 1)
        for choice in sorted(list(selected_indices)):
            if 0 <= choice < len(pdf_files):
                await process_single_pdf_async(os.path.join(PATHS["source_pdfs"], pdf_files[choice]), start_page, end_page)
            else: print(f"⚠️ Warning: Number {choice + 1} is out of range.")
        print("\n\n🎉 All selected PDF processing jobs are complete!")
    except Exception as e: print(f"\nAn unexpected error occurred: {e}")

# Correct way to run in Colab/Jupyter
await main()

# **Block 2: Process Lecture Videos**
(hidden tex below)

<!-- This is the definitive, stable engine for converting video lectures into structured, AI-enriched JSON data.

**Core Purpose:**
*   To transcribe a video lecture, identify and extract each unique slide, and use AI to generate a clean title, summary, and polished transcript for each slide.

**Key Architectural Decisions & Learnings:**
1.  **Mandatory GPU Check:** The most critical lesson learned was the performance difference between CPU and GPU for transcription. Running the Whisper model on a CPU is unacceptably slow. This block now includes a **"pre-flight check"** that automatically verifies a GPU is active. If not, it **halts execution** and provides clear, step-by-step instructions for the user to switch to a GPU runtime. This prevents accidental, time-wasting runs on the wrong hardware.
2.  **Two-Step AI Enhancement:** Early attempts used a single, complex prompt to analyze a slide and generate a summary. This was unreliable. The final architecture uses a much more robust two-step process:
    *   **Step 1 (Vision):** The powerful `gemini-2.5-pro` model performs a simple, factual task: analyzing the slide image to extract the visible title and key text.
    *   **Step 2 (Text):** The fast and cost-effective `gemini-2.5-flash` model takes the factual data from Step 1, combines it with the spoken transcript, and performs the final text-generation and formatting task. Separating these concerns dramatically improved the reliability and quality of the output.
3.  **Direct API Calls:** Similar to Block 1, this block uses direct `requests` calls for all communication with the Gemini API. This was implemented to bypass the instability and cryptic `localhost` errors that were encountered with the `google-generativeai` client library in the Colab environment.
4.  **Batch Saving & Resuming:** For long lectures with many slides, the script saves its progress to the `final_ENHANCED_data.json` file every 10 slides. If the script is interrupted, it can be re-run, and it will intelligently skip any slides that have already been successfully processed, picking up where it left off.
5.  **Debug Mode (Video Clipping):** To enable rapid testing without processing a multi-hour video, the script includes an interactive prompt for a "debug duration." If a duration is entered (e.g., "5" for 5 minutes), the script uses `ffmpeg` to create a short temporary clip, runs the entire pipeline on that clip, and then automatically deletes it. -->


In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 2: PROCESS LECTURE VIDEOS (v3.2 - STAGE A DEBUG)
# ==============================================================================
#
# SUMMARY:
# This version adds detailed debugging print statements to Stage A to diagnose
# why slide data is not being generated.
#
# ==============================================================================

import shutil
import cv2
import whisper
from skimage.metrics import structural_similarity as ssim
from PIL import Image
import numpy as np
import json
import os
import io
import base64
import re
from typing import Dict, Optional, List

import asyncio
import aiohttp
from tqdm.asyncio import tqdm_asyncio

# This block assumes Block 0 has been run and its variables are available.

# --- Concurrency Settings ---
API_CONCURRENCY_LIMIT = 10

# ------------------------------------------------------------------------------
# STEP 2.1: DEFINE ASYNC HELPER FUNCTIONS
# ------------------------------------------------------------------------------
print("--- STEP 2.1: DEFINING ASYNC HELPER FUNCTIONS (v3.2) ---")

# ... [All helper functions from v3.1 remain the same] ...
async def extract_key_info_from_image_async(session: aiohttp.ClientSession, image_obj: Image.Image) -> Dict:
    try:
        model_name = "gemini-2.5-pro"; url = f"https://generativelanguage.googleapis.com/v1beta/models/{model_name}:generateContent?key={GEMINI_API_KEY}"; buffered = io.BytesIO(); image_obj.convert("RGB").save(buffered, format="JPEG"); img_str = base64.b64encode(buffered.getvalue()).decode("utf-8"); prompt = ("Analyze the following slide image...\nReturn ONLY a valid JSON object with two keys: 'slide_title' and 'key_phrases'."); payload = {"contents": [{"parts": [{"text": prompt}, {"inline_data": {"mime_type": "image/jpeg", "data": img_str}}]}]}; headers = {'Content-Type': 'application/json'}
        async with session.post(url, headers=headers, json=payload, timeout=90) as response:
            if response.status == 200:
                data = await response.json(); raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', ''); json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                if json_match: return json.loads(json_match.group(0))
            else: print(f"  - ⚠️ VISION API ERROR: Status {response.status}, Response: {await response.text()}")
        return {"slide_title": "Error - API Call Failed", "key_phrases": []}
    except Exception as e: return {"slide_title": "Error - Vision Step Exception", "key_phrases": [str(e)]}

async def generate_final_json_from_text_async(session: aiohttp.ClientSession, info: Dict, raw_transcript: str) -> Dict:
    try:
        model_name = 'gemini-2.5-flash'; url = f"https://generativelanguage.googleapis.com/v1beta/models/{model_name}:generateContent?key={GEMINI_API_KEY}"; title = info.get('slide_title', 'N/A'); key_phrases = ", ".join(info.get('key_phrases', [])); prompt = (f"You are an AI assistant... CONTEXT:\n- Slide Title: '{title}'\n- Key Phrases on Slide: '{key_phrases}'\n- Spoken Transcript: '{raw_transcript}'\n\nReturn ONLY a valid JSON object with three keys: `title`, `cleaned_transcript`, and `summary`."); payload = {"contents": [{"parts": [{"text": prompt}]}]}; headers = {'Content-Type': 'application/json'}
        async with session.post(url, headers=headers, json=payload, timeout=60) as response:
            if response.status == 200:
                data = await response.json(); raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', ''); json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                if json_match: return json.loads(json_match.group(0))
            else: print(f"  - ⚠️ TEXT API ERROR: Status {response.status}, Response: {await response.text()}")
        return {"title": "Error - Text Step Failed", "cleaned_transcript": raw_transcript, "summary": "API Call Failed"}
    except Exception as e: return {"title": "Error - Text Step Exception", "cleaned_transcript": raw_transcript, "summary": str(e)}

async def enhance_single_slide_async(session: aiohttp.ClientSession, segment: Dict, semaphore: asyncio.Semaphore) -> Dict:
    async with semaphore:
        full_image_path = os.path.join(PATHS["root"], segment['image_path'])
        if not os.path.exists(full_image_path): segment.update({"title": "Error - Image Missing", "summary": "Image file not found."}); return segment
        try:
            with Image.open(full_image_path) as img:
                key_info = await extract_key_info_from_image_async(session, img);
                if "Error" in key_info.get("slide_title", ""): key_info = {"slide_title": "Title Unknown", "key_phrases": []}
                enhancements = await generate_final_json_from_text_async(session, key_info, segment['raw_transcript'])
                segment.update(enhancements)
        except Exception as e: segment.update({"title": "Error - Unhandled Exception", "summary": str(e)})
        return segment

# ------------------------------------------------------------------------------
# STEP 2.2: DEFINE THE MAIN PROCESSING PIPELINE
# ------------------------------------------------------------------------------
print("\n--- STEP 2.2: DEFINING THE MAIN PROCESSING PIPELINE (v3.2) ---")

async def process_single_video_async(video_path: str, debug_duration_seconds: Optional[int] = None):
    video_filename = os.path.basename(video_path)
    lecture_name = os.path.splitext(video_filename)[0].rsplit('_Part', 1)[0].rsplit(' Part', 1)[0]
    duration_info = f"for the first {debug_duration_seconds // 60} minutes" if debug_duration_seconds else "for the full duration"
    print(f"\n{'='*80}\n🚀 Starting processing for lecture: {lecture_name} ({duration_info})\n{'='*80}")

    # --- Path setup ---
    lecture_content_dir = os.path.join(PATHS['content_lectures'], lecture_name); lecture_asset_dir = os.path.join(PATHS['asset_lectures'], lecture_name); slide_images_dir = os.path.join(lecture_asset_dir, "slide_images"); structured_json_path = os.path.join(lecture_content_dir, 'final_structured_data.json'); enhanced_json_path = os.path.join(lecture_content_dir, 'final_ENHANCED_data.json'); os.makedirs(slide_images_dir, exist_ok=True); os.makedirs(lecture_content_dir, exist_ok=True)

    if os.path.exists(enhanced_json_path):
        action = input(f"Lecture ('{lecture_name}') has a final file. [S]kip, or [F]orce Re-enhancement? ").lower().strip()
        if action == 's': print("--> Skipping lecture."); return
        elif action != 'f': print("--> Invalid choice. Skipping lecture."); return
        else: print("--> Force Re-enhancement selected.")

    temp_video_path = None; original_video_path = video_path
    if debug_duration_seconds:
        temp_video_path = os.path.join(lecture_asset_dir, f"temp_clip_{video_filename}"); print(f"--> Creating temporary {debug_duration_seconds // 60}-minute clip...")
        os.system(f'ffmpeg -y -i "{original_video_path}" -t {debug_duration_seconds} -c copy "{temp_video_path}" > /dev/null 2>&1')
        video_path = temp_video_path; print("--> Clip created.")

    try:
        # --- STAGE A (SYNCHRONOUS): TRANSCRIPTION & SLIDE EXTRACTION ---
        final_structured_data = []
        if not os.path.exists(structured_json_path):
            print(f"\n--- STAGE A: TRANSCRIBING & EXTRACTING SLIDES ---")
            print(f"[Whisper] Starting transcription for: {os.path.basename(video_path)}...")
            temp_screenshots_dir = os.path.join(lecture_asset_dir, 'screenshots_raw_temp'); os.makedirs(temp_screenshots_dir, exist_ok=True)
            try:
                model = whisper.load_model("base"); result = model.transcribe(video_path, fp16=False)
                print("--> Transcription complete. Extracting screenshots...")

                # --- NEW DEBUGGING PRINTS ---
                print(f"DEBUG: Whisper found {len(result['segments'])} transcript segments.")

                video_capture = cv2.VideoCapture(video_path); video_data = []
                for segment in result["segments"]:
                    screenshot_path = os.path.join(temp_screenshots_dir, f"frame_{segment['id']:04d}.jpg")
                    video_capture.set(cv2.CAP_PROP_POS_MSEC, segment['start'] * 1000)
                    success, frame = video_capture.read()
                    if success:
                        cv2.imwrite(screenshot_path, frame)
                        segment['screenshot_path'] = screenshot_path
                        video_data.append(segment)
                video_capture.release()

                # --- NEW DEBUGGING PRINTS ---
                print(f"DEBUG: Successfully extracted {len(video_data)} video frames to match segments.")

                print("[Consolidator] Grouping transcript based on slide changes...")
                consolidated_data = []
                if video_data:
                    current_group = {**video_data[0], 'consolidated_text': video_data[0]['text'].strip()}; prev_img = cv2.imread(video_data[0]['screenshot_path'])
                    if prev_img is not None:
                        prev_img_gray = cv2.cvtColor(prev_img, cv2.COLOR_BGR2GRAY)
                        for entry in video_data[1:]:
                            curr_img = cv2.imread(entry['screenshot_path'])
                            if curr_img is None: continue
                            curr_img_gray = cv2.cvtColor(cv2.resize(curr_img, (prev_img_gray.shape[1], prev_img_gray.shape[0])), cv2.COLOR_BGR2GRAY)
                            score, _ = ssim(prev_img_gray, curr_img_gray, full=True)
                            if score >= 0.90: current_group['consolidated_text'] += " " + entry['text'].strip(); current_group['end'] = entry['end']; current_group['screenshot_path'] = entry['screenshot_path']
                            else: consolidated_data.append(current_group); current_group = {**entry, 'consolidated_text': entry['text'].strip()}
                            prev_img_gray = curr_img_gray
                        consolidated_data.append(current_group)

                # --- NEW DEBUGGING PRINTS ---
                print(f"DEBUG: Consolidated into {len(consolidated_data)} unique slides.")

                for i, segment in enumerate(consolidated_data):
                    dest_path = os.path.join(slide_images_dir, f"slide_{i:04d}.jpg"); shutil.copy(segment['screenshot_path'], dest_path)
                    final_structured_data.append({"segment_id": i, "start_time": segment['start'], "end_time": segment['end'],"raw_transcript": segment['consolidated_text'],"image_path": os.path.relpath(dest_path, PATHS["root"])})

                with open(structured_json_path, 'w', encoding='utf-8') as f: json.dump(final_structured_data, f, indent=2)
                print("--> Slide consolidation complete.")
            except Exception as e: print(f"❌ ERROR during Stage A for {video_filename}: {e}"); return
            finally: shutil.rmtree(temp_screenshots_dir, ignore_errors=True)
        else:
            print("\n--- STAGE A: SKIPPED (Found Existing Data) ---")
            with open(structured_json_path, 'r', encoding='utf-8') as f: final_structured_data = json.load(f)

        # --- STAGE B (ASYNCHRONOUS): CONCURRENT DATA ENHANCEMENT ---
        print(f"\n--- STAGE B: ENHANCING DATA CONCURRENTLY ---")
        if not final_structured_data: print("❌ No data to enhance. Halting."); return
        data_to_enhance = [s for s in final_structured_data if "title" not in s or "Error" in s.get("title", "")]; processed_segments = [s for s in final_structured_data if s not in data_to_enhance]
        if not data_to_enhance: print("--> All slides already enhanced. Skipping API calls."); final_data = processed_segments
        else:
            print(f"[Enhancer] Starting concurrent API calls for {len(data_to_enhance)} slides...")
            semaphore = asyncio.Semaphore(API_CONCURRENCY_LIMIT);
            async with aiohttp.ClientSession() as session:
                tasks = [enhance_single_slide_async(session, segment, semaphore) for segment in data_to_enhance]
                enhancement_results = await tqdm_asyncio.gather(*tasks, desc=f"Enhancing {lecture_name}")
            enhanced_results_dict = {res['segment_id']: res for res in enhancement_results}; final_data = []
            for segment in final_structured_data:
                if segment['segment_id'] in enhanced_results_dict: final_data.append(enhanced_results_dict[segment['segment_id']])
                else: final_data.append(segment)
            with open(enhanced_json_path, 'w', encoding='utf-8') as f: json.dump(final_data, f, indent=2)
            print(f"\n--> All {len(enhancement_results)} slides enhanced and saved.")
        error_count = sum(1 for s in final_data if "Error" in s.get("title", "")); print(f"\n{'='*80}\n✅ Finished processing for lecture: {lecture_name}")
        if error_count > 0: print(f"⚠️  Completed with {error_count} errors.")
        print(f"{'='*80}")
    finally:
        if temp_video_path and os.path.exists(temp_video_path): os.remove(temp_video_path); print(f"--> Temporary clip deleted.")

# --- INTERACTIVE EXECUTION ---
async def main():
    try:
        if 'PATHS' not in globals(): raise NameError("PATHS not defined.")
        if 'GEMINI_API_KEY' not in globals() or not GEMINI_API_KEY: raise NameError("GEMINI_API_KEY not found.")
        video_files = sorted([f for f in os.listdir(PATHS["source_videos"]) if not f.startswith('.') and f.lower().endswith(('.mp4', '.mov', '.avi', '.mkv'))])
        if not video_files: print(f"❌ No videos found in '{PATHS['source_videos']}'.")
        else:
            print("\n--- SELECT VIDEO(S) TO PROCESS ---"); [print(f"  [{i+1}] {name}") for i, name in enumerate(video_files)]
            user_input = input("\nEnter number(s) to process (e.g., 1, 3 or 'all'): ").strip().lower()
            duration_input = input("Enter duration in minutes for debugging (e.g., '5' or 'all'): ").strip().lower()
            debug_duration = None
            if duration_input != 'all':
                try: debug_duration = int(duration_input) * 60
                except ValueError: print("⚠️ Invalid duration. Defaulting to full video.")
            selected_indices = range(len(video_files)) if user_input == 'all' else [int(c.strip()) - 1 for c in user_input.split(',')]
            for choice in sorted(list(set(selected_indices))):
                if 0 <= choice < len(video_files):
                    selected_video_path = os.path.join(PATHS["source_videos"], video_files[choice])
                    await process_single_video_async(selected_video_path, debug_duration)
                else: print(f"⚠️ Warning: Number {choice + 1} is out of range.")
            print("\n\n🎉 All selected video processing jobs are complete!")
    except Exception as e: print(f"\nAn unexpected error occurred: {e}")

# Correct way to run in Colab/Jupyter
await main()

--- STEP 2.1: DEFINING ASYNC HELPER FUNCTIONS (v3.2) ---

--- STEP 2.2: DEFINING THE MAIN PROCESSING PIPELINE (v3.2) ---

--- SELECT VIDEO(S) TO PROCESS ---
  [1] BST_Lecture_1_Grossing.mp4
  [2] BST_Lecture_2_SoftTissue1.mp4
  [3] BST_Lecture_3_SofTissue2.mp4
  [4] BST_Lecture_4_SoftTissue3.mp4
  [5] BST_Lecture_5_Bone1.mp4
  [6] BST_Lecture_6_Bone2.mp4
  [7] Breast_Lecture_Epithelial Part 1_Chen.mp4
  [8] Breast_Lecture_Fibroepithelial.mp4
  [9] Breast_Lecture_Grossing.mp4
  [10] Breast_Lecture_IHC.mp4
  [11] Breast_Lecture_Invasive.mp4
  [12] Breast_Lecture_Lobular.mp4
  [13] Breast_Lecture_Normal.mp4
  [14] Breast_Lecture_Papillary.mp4
  [15] Breast_Lecture_Prognostics.mp4
  [16] Breast_Lecture_Rad-Path.mp4
  [17] Breast_Lecture_SpindleCell.mp4
  [18] Breast_Lecture_Treated.mp4
  [19] GI_Lecture_0_Gross_Liver.mp4
  [20] GI_Lecture_10_Colon.mp4
  [21] GI_Lecture_11_Esophagus.mp4
  [22] GI_Lecture_12_SmallIntestine.mp4
  [23] GI_Lecture_13_Stomach2.mp4
  [24] GI_Lecture_1_Liver1.mp4


Enhancing Other_Heme_Lecture_AML Plasma BMFailure BMSystemic: 100%|██████████| 46/46 [01:28<00:00,  1.93s/it]



--> All 46 slides enhanced and saved.

✅ Finished processing for lecture: Other_Heme_Lecture_AML Plasma BMFailure BMSystemic
--> Temporary clip deleted.

🚀 Starting processing for lecture: Other_Skin_Lecture_10_Blisters (for the first 25 minutes)
--> Creating temporary 25-minute clip...
--> Clip created.

--- STAGE A: TRANSCRIBING & EXTRACTING SLIDES ---
[Whisper] Starting transcription for: temp_clip_Other_Skin_Lecture_10_Blisters.mp4...
--> Transcription complete. Extracting screenshots...
DEBUG: Whisper found 363 transcript segments.
DEBUG: Successfully extracted 363 video frames to match segments.
[Consolidator] Grouping transcript based on slide changes...
DEBUG: Consolidated into 269 unique slides.
--> Slide consolidation complete.

--- STAGE B: ENHANCING DATA CONCURRENTLY ---
[Enhancer] Starting concurrent API calls for 269 slides...


Enhancing Other_Skin_Lecture_10_Blisters: 100%|██████████| 269/269 [06:41<00:00,  1.49s/it]


--> All 269 slides enhanced and saved.

✅ Finished processing for lecture: Other_Skin_Lecture_10_Blisters
--> Temporary clip deleted.


🎉 All selected video processing jobs are complete!





# Block 3: RAG Logic
<!--### 📚 For Textbooks and Figures (`_CONTENT.json` & `_FIGURES.json`)
This content is treated as dense, complex, and potentially ambiguous, so it goes through the entire intelligent funnel:

1.  **AI Classifier:** First, it filters out any non-pathologic content (like normal histology).
2.  **AI Entity Extractor:** It finds the single best keyword to ensure the search is precise.
3.  **Fuzzy Match & AI Verification (Flash):** It finds the best match and uses the fast Flash model to verify it.
4.  **Automatic Re-Search:** If Flash rejects the match, it tries to self-correct with a more targeted search.
5.  **Final Pro Arbitration:** If all else fails, it escalates to the powerful Pro model for a final, autonomous decision.

### 🎤 For Lectures (`_ENHANCED_data.json`)
Lectures are processed with a more direct approach because the content is more linear and less ambiguous.

1.  **No Classifier:** The script assumes all slides in a lecture are relevant and processes everything.
2.  **Simpler Matching:** It uses the AI Entity Extractor to get a good query from the slide title, but it only auto-links if the fuzzy match score is very high (`≥ 90%`).
3.  **No AI Verification/Arbitration:** It does not have the complex, multi-step verification and escalation process. If the initial high-confidence match fails, it moves on.

The schematic below visualizes the complex funnel used for **textbooks and figures**.-->

```
                  [ Start Processing New Item ]
                             |
                             V
              [ Pre-Filter (Check Length & Cache) ]
                             |
                             V
                       < Is Content Long Enough? > --No--> [ SKIP ]
                             |
                             Yes
                             |
                             V
              [ AI Classifier (Flash) Classifies Content ]
                             |
                             V
                       < Is it 'Pathologic'? > -----No--> [ SKIP ]
                             |
                             Yes
                             |
                             V
   [ AI Entity Extractor (Flash) Finds Most Specific Term ]
                             |
                             V
               [ Create Final, Focused Search Query ]
                             |
                             V
< Query Matches "Sticky Context" (Last Known Topic) with >=85% Confidence? >
       |                                                            |
       Yes                                                          No
       |                                                            |
       +------> [ LINK to Sticky Context ]                          V
                                                      [ Perform Fuzzy Search on ALL Headings ]
                                                                    |
                                                                    V
                                                      < Is Match Score >= 95%? > --Yes--> [ LINK (High Confidence) ]
                                                                    |
                                                                    No
                                                                    |
                                                                    V
                                                      < Is Match Score >= 85%? > --No--> [ SKIP (Too Ambiguous) ]
                                                                    |
                                                                    Yes (Plausible Match)
                                                                    |
                                                                    V
                                            [ AI Verifier (Flash) Checks the Plausible Match ]
                                                                    |
                                                                    V
                                                      < Did Flash VERIFY the Match? > --Yes--> [ LINK (Flash Verified) ]
                                                                    |
                                                                    No (Flash REJECTED)
                                                                    |
                                                                    V
                                            [ Attempt Automatic Re-Search with Specific Term ]
                                                                    |
                                                                    V
                                            < New Search Found a Match with >=90% Confidence? >
                                                  |                                     |
                                                  Yes                                   No
                                                  |                                     |
                                                  +--> [ LINK (Auto-Corrected) ]        V
                                                                          [ FINAL ARBITRATION: Re-verify Original Match ]
                                                                          [      Using Powerful PRO Model              ]
                                                                                        |
                                                                                        V
                                                                          < Did Pro VERIFY the Match? >
                                                                              |                     |
                                                                              Yes                   No
                                                                              |                     |
                                                                              V                     V
                                                                 [ LINK (Pro Arbitration) ]     [ SKIP (Final Rejection) ]
```

##v7.0

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 3: RAG ENRICHMENT (v7.0 - FULLY AUTONOMOUS "GOD MODE")
# ==============================================================================
#
# PURPOSE:
# This version is fully autonomous, with verbose logging and a "Pro is God"
# final arbitration step. When faced with ambiguity, it tasks the Gemini 2.5 Pro
# model with making the final decision after all other automated steps fail.
# ==============================================================================

import json
import os
import re
import base64
from thefuzz import process as fuzzy_process
from typing import Optional, List, Dict, Set

import asyncio
import aiohttp

# ------------------------------------------------------------------------------
# STEP 3.0: USER CONFIGURATION
# ------------------------------------------------------------------------------
# Fallback for the initial verification step. The final arbitration always uses Pro.
USE_PRO_MODEL_FALLBACK = True
# ------------------------------------------------------------------------------

# ------------------------------------------------------------------------------
# STEP 3.1: DEFINE CORE RAG FUNCTIONS (v7.0)
# ------------------------------------------------------------------------------
print("--- STEP 3.1: DEFINING CORE RAG FUNCTIONS (v7.0) - FULLY AUTONOMOUS ---")

# --- CPU-Bound Helper Functions (Remain Synchronous) ---
def clean_heading(title: str) -> str:
    cleaned = re.sub(r'\{#.*?\}', '', title); cleaned = re.sub(r'^[A-Z]\\[uU][0-9a-fA-F]{4}', '', cleaned); return cleaned.strip()
def parse_notebook(notebook_path: str) -> list[dict]:
    print(f"\n[Parser] Reading notebook: {os.path.basename(notebook_path)}...");
    if not os.path.exists(notebook_path): print(f"❌ ERROR: Notebook file not found."); return []
    with open(notebook_path, 'r', encoding='utf-8') as f: content = f.read()
    parts = re.split(r'(^(?:##|###)\s+.*)', content, flags=re.MULTILINE); chunks, current_main, current_sub = [], None, None; source_name = os.path.basename(notebook_path)
    if parts and parts[0].strip(): chunks.append({"main_heading": None, "sub_heading": None, "content": parts[0].strip(), "source": source_name})
    for i in range(1, len(parts), 2):
        match = re.match(r'^(##|###)\s+(.*)', parts[i]); content = parts[i+1].strip() if (i + 1) < len(parts) else ""
        if match:
            level, raw_title = match.groups(); title = clean_heading(raw_title)
            if level == '##': current_main, current_sub = title, None
            else: current_sub = title
        if content: chunks.append({"main_heading": current_main, "sub_heading": current_sub, "content": content, "source": source_name})
    print(f"--> Successfully parsed {len(chunks)} content chunks."); return chunks
def find_fuzzy_match(query: str, notebook_headings: list, limit=1) -> Optional[list]:
    if not query: return None
    if limit == 1:
        result = fuzzy_process.extractOne(query, notebook_headings); return [result] if result else None
    return fuzzy_process.extract(query, notebook_headings, limit=limit)
def build_page_to_headings_map(content_data: list) -> dict:
    page_map = {}; current_main_heading = None
    for item in content_data:
        page_num = item.get('page_number') or item.get('source_page')
        if page_num:
            main_heading = item.get('headings', {}).get('main_heading'); sub_heading = item.get('headings', {}).get('sub_heading')
            if main_heading: current_main_heading = main_heading
            page_map[page_num] = {'main_heading': current_main_heading, 'sub_heading': sub_heading}
    return page_map

# --- ASYNC AI HELPER FUNCTIONS ---
async def classify_content_type_async(session: aiohttp.ClientSession, content_to_classify: str, api_key: str) -> str:
    if not content_to_classify or len(content_to_classify.split()) < 5: return 'ambiguous'
    try:
        model_name = "models/gemini-2.5-flash"; url = f"https://generativelanguage.googleapis.com/v1beta/{model_name}:generateContent?key={api_key}"; prompt = ("You are an expert pathologist... Classify the 'Text Content' into 'pathologic', 'normal', or 'ambiguous'...\n" f"Text Content:\n---\n{content_to_classify}\n---\n\n" "Return ONLY a single valid JSON object with one key, `classification`..."); payload = {"contents": [{"parts": [{"text": prompt}]}]}; headers = {'Content-Type': 'application/json'}
        async with session.post(url, headers=headers, json=payload, timeout=60) as response:
            if response.status == 200:
                data = await response.json(); raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', ''); json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                if json_match: return json.loads(json_match.group(0)).get('classification', 'ambiguous')
        return 'ambiguous'
    except Exception: return 'ambiguous'
async def extract_specific_entity_with_gemini(session: aiohttp.ClientSession, content_to_analyze: str, api_key: str) -> Optional[str]:
    if not content_to_analyze: return None
    try:
        model_name = "models/gemini-2.5-flash"; url = f"https://generativelanguage.googleapis.com/v1beta/{model_name}:generateContent?key={api_key}"; prompt = ("You are a pathology expert... Analyze the 'TEXT TO ANALYZE' and identify the single most unique and diagnostically significant medical entity...\nINSTRUCTIONS:\n1. Discard common, non-specific terms (e.g., 'tumor', 'lesion', 'clear cell', 'bone', 'carcinoma').\n2. From the remaining specific terms, select the SINGLE most important entity that defines the diagnosis...\nEXAMPLES:\n- Text: '...a clear cell acanthoma...'\n- Output: {\"specific_entity\": \"acanthoma\"}...\n" f"TEXT TO ANALYZE:\n---\n{content_to_analyze}\n---\n\n" "Return ONLY a single valid JSON object with one key: 'specific_entity'."); payload = {"contents": [{"parts": [{"text": prompt}]}]}; headers = {'Content-Type': 'application/json'}
        async with session.post(url, headers=headers, json=payload, timeout=60) as response:
            if response.status == 200:
                data = await response.json(); raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', ''); json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                if json_match: return json.loads(json_match.group(0)).get('specific_entity')
        return None
    except Exception: return None

async def _verify_heading_single_attempt(session: aiohttp.ClientSession, context: dict, candidate_heading: str, api_key: str, use_pro: bool) -> Optional[bool]:
    model_name = "models/gemini-2.5-pro" if use_pro else "models/gemini-2.5-flash"; print(f" (Using {model_name.split('/')[-1]}...) ", end="")
    try:
        url = f"https://generativelanguage.googleapis.com/v1beta/{model_name}:generateContent?key={api_key}"; prompt = ("You are an expert pathologist acting as a strict verifier... **CRITICAL RULES:**\n1. AVOID OVERGENERALIZATION...\n2. AVOID SUBSET ERRORS...\n" f"CANDIDATE HEADING: '{candidate_heading}'\n\nCONTEXT:\n{context['current_page_text']}\n\nITEM TO CLASSIFY:\n{context['item_content']}\n\n" "TASK: Is the 'Candidate Heading' the correct and precise match?\nReturn ONLY a single valid JSON object with one key, 'is_correct_match'..."); parts = [{"text": prompt}];
        if context.get('page_image_b64'): parts.append({"inline_data": {"mime_type": "image/png", "data": context['page_image_b64']}})
        payload = {"contents": [{"parts": parts}]}; headers = {'Content-Type': 'application/json'}
        async with session.post(url, headers=headers, json=payload, timeout=120) as response:
            if response.status == 200:
                data = await response.json(); raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', '')
                json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                if json_match: return json.loads(json_match.group(0)).get('is_correct_match')
        print(f"\n  - ⚠️ Gemini API Warning: Status {response.status_code}, Response: {await response.text()}"); return None
    except Exception as e: print(f"\n  - ❌ CRITICAL: Gemini API call failed. Error: {e}"); return None

async def call_gemini_to_verify_heading_async(session: aiohttp.ClientSession, context: dict, candidate_heading: str, api_key: str, use_pro_fallback: bool) -> bool:
    result = await _verify_heading_single_attempt(session, context, candidate_heading, api_key, use_pro=False)
    if result is None and use_pro_fallback:
        print("Flash model failed, escalating to Pro model...", end="")
        result = await _verify_heading_single_attempt(session, context, candidate_heading, api_key, use_pro=True)
    return result if result is not None else False

async def call_gemini_pro_for_final_decision_async(session: aiohttp.ClientSession, context: dict, top_5_candidates: list, api_key: str) -> Optional[str]:
    """NEW: Asks Gemini Pro to act as the final arbiter and choose the best match from a list."""
    print("  --> Escalating to Gemini Pro for final arbitration...")
    try:
        model_name = "models/gemini-2.5-pro"; url = f"https://generativelanguage.googleapis.com/v1beta/{model_name}:generateContent?key={api_key}"
        candidate_list_str = "\n".join([f"- '{heading}'" for heading, score in top_5_candidates])
        prompt = (
            "You are the final arbiter, an expert pathologist tasked with making the best possible link with ambiguous information.\n\n"
            "Review the 'ITEM TO CLASSIFY' and the 'Top 5 Candidate Headings' from a notebook.\n\n"
            "INSTRUCTIONS:\n"
            "1. Select the SINGLE best and most precise heading from the candidate list that accurately describes the item.\n"
            "2. If and only if NONE of the candidates are a reasonable or correct match, you must indicate no match.\n\n"
            f"ITEM TO CLASSIFY:\n---\n{context['item_content']}\n---\n\n"
            f"Top 5 Candidate Headings:\n{candidate_list_str}\n\n"
            "Return a JSON object with one key, `chosen_heading`. The value must be the full string of the best heading from the list, or `null` if no candidate is a good fit."
        )
        parts = [{"text": prompt}];
        if context.get('page_image_b64'): parts.append({"inline_data": {"mime_type": "image/png", "data": context['page_image_b64']}})
        payload = {"contents": [{"parts": parts}]}; headers = {'Content-Type': 'application/json'}
        async with session.post(url, headers=headers, json=payload, timeout=120) as response:
            if response.status == 200:
                data = await response.json(); raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', '')
                json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                if json_match: return json.loads(json_match.group(0)).get('chosen_heading')
        return None
    except Exception as e: print(f"\n  - ❌ CRITICAL: Gemini Pro arbitration failed. Error: {e}"); return None

async def process_files_async(file_paths: list, notebook_headings_map: dict, all_content_data: dict, PATHS: dict, GEMINI_API_KEY: str, use_pro_fallback: bool):
    notebook_headings_list = list(notebook_headings_map.keys())
    source_notebook_name = list(notebook_headings_map.values())[0]['source'] if notebook_headings_map else "Unknown"
    for file_path in file_paths:
        print(f"\n{'='*60}\n--- Processing: {os.path.basename(file_path)} ---\n{'='*60}")
        try:
            with open(file_path, 'r', encoding='utf-8') as f: data = json.load(f)
        except (IOError, json.JSONDecodeError) as e:
            print(f"  -> ERROR: Could not read file. Skipping. Error: {e}"); continue
        update_count, last_confident_heading = 0, None; skipped_content_cache: Set[str] = set()
        is_lecture = '_ENHANCED_data.json' in os.path.basename(file_path)
        page_map = build_page_to_headings_map(all_content_data.get(f"{os.path.basename(file_path).replace('_FIGURES.json','')}_CONTENT.json",[])) if os.path.basename(file_path).endswith('_FIGURES.json') else {}

        async with aiohttp.ClientSession() as session:
            for i, item in enumerate(data):
                links = []
                item_content_full = (item.get('description', '') or item.get('content', '')) if not is_lecture else item.get('cleaned_transcript','')
                if item_content_full in skipped_content_cache or len(item_content_full.split()) < 5: continue

                if not is_lecture:
                    content_type = await classify_content_type_async(session, item_content_full, GEMINI_API_KEY)
                    if content_type != 'pathologic':
                        print(f"\r  -> Progress: {i + 1}/{len(data)} items processed... [SKIPPING: '{content_type}' content]", end=""); continue

                query_heading = item.get('title') if is_lecture else (item.get('headings',{}).get('sub_heading') or item.get('headings',{}).get('main_heading'))
                if not query_heading and not is_lecture:
                    page_num = item.get('source_page'); page_context = page_map.get(page_num, {}); query_heading = page_context.get('sub_heading') or page_context.get('main_heading')
                if not query_heading: continue

                specific_entity = await extract_specific_entity_with_gemini(session, item_content_full, GEMINI_API_KEY)
                final_query = specific_entity if specific_entity else query_heading

                if last_confident_heading and find_fuzzy_match(final_query, [last_confident_heading], limit=1)[0][1] >= 85:
                    print(f"\r  -> Progress: {i + 1}/{len(data)} items processed... [CONTEXT HIT: '{last_confident_heading}']", end=""); links = [{"heading": last_confident_heading, "source": source_notebook_name, "score": 1.00}]
                    item['notebook_links'] = links; update_count += 1; continue

                match_results = find_fuzzy_match(final_query, notebook_headings_list, limit=1)
                if not match_results: continue
                best_match, score = match_results[0]
                item_preview = (item_content_full[:120] + '...') if len(item_content_full) > 120 else item_content_full

                if is_lecture:
                    if score >= 90: links = [{"heading": best_match, "source": source_notebook_name, "score": 1.00}]; last_confident_heading = best_match
                else: # Textbook Logic
                    if score >= 95:
                        print(f"\n  [ITEM {i+1}] \"{item_preview}\"\n  -> MATCH: '{best_match}' (Score: {score}) -> ✅ HIGH CONFIDENCE. Auto-linking.")
                        links = [{"heading": best_match, "source": source_notebook_name, "score": 1.00}]; last_confident_heading = best_match
                    elif 85 <= score < 95:
                        print(f"\n  [ITEM {i+1}] \"{item_preview}\"\n  -> MATCH: '{best_match}' (Score: {score}) -> ❔ PLAUSIBLE. Consulting AI Verifier...", end="")
                        context = {'item_content': item_content_full}; page_num = item.get('source_page') or item.get('page_number'); project_name_base = os.path.basename(file_path).replace('_FIGURES.json', '').replace('_CONTENT.json', '')
                        page_image_path = os.path.join(PATHS['asset_textbooks'], project_name_base, 'page_images', f"page_{page_num:04d}.png")
                        if os.path.exists(page_image_path):
                            with open(page_image_path, "rb") as f: context['page_image_b64'] = base64.b64encode(f.read()).decode('utf-8')
                        context['current_page_text'] = "\n".join([c['content'] for c in all_content_data.get(f"{project_name_base}_CONTENT.json", []) if (c.get('page_number') or c.get('source_page')) == page_num])

                        is_correct = await call_gemini_to_verify_heading_async(session, context, best_match, GEMINI_API_KEY, use_pro_fallback)

                        if is_correct:
                            print(f"\n  -> AI SAYS: ✅ VERIFIED. Auto-linking to '{best_match}'.")
                            links = [{"heading": best_match, "source": source_notebook_name, "score": round(score/100, 2)}]; last_confident_heading = best_match
                        else:
                            print(f"\n  -> AI SAYS: ❌ REJECTED. Attempting automatic re-search...")
                            second_chance_entity = await extract_specific_entity_with_gemini(session, item_content_full, GEMINI_API_KEY)
                            if second_chance_entity and second_chance_entity.lower() != final_query.lower():
                                print(f"    -> AI extracted specific term: '{second_chance_entity}'. Re-searching...")
                                new_matches = find_fuzzy_match(second_chance_entity, notebook_headings_list, limit=1)
                                if new_matches and new_matches[0][1] >= 90:
                                    new_best_match, new_score = new_matches[0]
                                    print(f"    --> SUCCESS on 2nd attempt: Found '{new_best_match}' ({new_score}). Auto-linking.")
                                    links = [{"heading": new_best_match, "source": source_notebook_name, "score": round(new_score/100, 2)}]; last_confident_heading = new_best_match

                            if not links:
                                print("    --> Second attempt failed. Escalating to Pro for final arbitration.")
                                top_5_matches = find_fuzzy_match(query_heading, notebook_headings_list, limit=5)
                                chosen_heading = await call_gemini_pro_for_final_decision_async(session, context, top_5_matches, GEMINI_API_KEY)
                                if chosen_heading:
                                    print(f"    --> PRO ARBITRATION SUCCESS: Linking to '{chosen_heading}'.")
                                    links = [{"heading": chosen_heading, "source": source_notebook_name, "score": 1.00}]; last_confident_heading = chosen_heading
                                else:
                                    print(f"    --> PRO ARBITRATION FAILED: No suitable match found. Skipping item.")
                                    skipped_content_cache.add(item_content_full); last_confident_heading = None
                    else: last_confident_heading = None

                if links: item['notebook_links'] = links; update_count += 1
                print(f"\r  -> Progress: {i + 1}/{len(data)} items processed...", end="")

        print(f"\n\n  --- Summary for {os.path.basename(file_path)} ---")
        if update_count > 0:
            print(f"  -> Found/updated links for {update_count} items. Saving updates...");
            with open(file_path, 'w', encoding='utf-8') as f: json.dump(data, f, indent=4)
            print("  ✅ File saved successfully.")
        else: print("  -> No new links found for any items.")

async def main():
    try:
        if 'PATHS' not in globals(): raise NameError("PATHS not defined. Run Block 0.")
        if 'GEMINI_API_KEY' not in globals() or not GEMINI_API_KEY: raise NameError("GEMINI_API_KEY not defined. Run Block 0.")
        KNOWLEDGE_PIPELINE_ROOT, CONTENT_LIBRARY_DIR, DEFAULT_NOTEBOOK_DIR = PATHS['root'], PATHS['content_library'], PATHS['notebooks']
        print("\n--- STEP 3.2: SELECT YOUR NOTEBOOK FILE ---")
        discovered_notebooks = sorted([f for f in os.listdir(DEFAULT_NOTEBOOK_DIR) if f.lower().endswith('.md')]) if os.path.isdir(DEFAULT_NOTEBOOK_DIR) else []
        if discovered_notebooks:
            [print(f"  [{i+1}] {name}") for i, name in enumerate(discovered_notebooks)]; print("\nSelect a notebook by number, or paste the full path.")
        else: print("No notebooks found. Please provide the full path to your .md file.")
        user_input = input("Enter selection: ").strip(); notebook_path = None
        try:
            choice_index = int(user_input) - 1
            if 0 <= choice_index < len(discovered_notebooks): notebook_path = os.path.join(DEFAULT_NOTEBOOK_DIR, discovered_notebooks[choice_index])
        except (ValueError, IndexError):
            if os.path.exists(user_input) and user_input.lower().endswith('.md'): notebook_path = user_input
        if not notebook_path: raise ValueError("No valid notebook file selected.")
        print(f"--> Using notebook: {os.path.basename(notebook_path)}")
        notebook_chunks = parse_notebook(notebook_path)
        if not notebook_chunks: raise ValueError("Notebook is empty or could not be parsed.")
        print("\n[Indexer] Creating heading map..."); notebook_headings_map = {f"{c['main_heading']} - {c['sub_heading']}" if c['sub_heading'] else c['main_heading']: c for c in notebook_chunks if c['main_heading']}; print(f"--> Indexed {len(notebook_headings_map)} unique headings.")
        all_content_files = [os.path.join(root, f) for root, _, files in os.walk(CONTENT_LIBRARY_DIR) for f in files if f.endswith(('_CONTENT.json', '_FIGURES.json', '_ENHANCED_data.json'))]
        if not all_content_files: print("❌ No content files found to process.")
        else:
            print("\nPlease select content file(s) to link to your notebook:"); [print(f"  [{i+1}] {os.path.relpath(path, KNOWLEDGE_PIPELINE_ROOT)}") for i, path in enumerate(all_content_files)]
            user_input_files = input("\nEnter number(s) separated by commas (e.g., '1, 3' or 'all'): ").strip().lower(); selected_files = []
            if user_input_files == 'all': selected_files = all_content_files
            else:
                try: indices = [int(c.strip()) - 1 for c in user_input_files.split(',')]; selected_files = [all_content_files[i] for i in indices if 0 <= i < len(all_content_files)]
                except ValueError: print("Invalid input."); selected_files = []
            if selected_files:
                all_content_data = {}; files_to_load_for_context = set(selected_files); print("\n[Context Loader] Identifying necessary context files...")
                for f in selected_files:
                    if f.endswith('_FIGURES.json'):
                        content_equivalent = f.replace('_FIGURES.json', '_CONTENT.json')
                        if os.path.exists(content_equivalent): files_to_load_for_context.add(content_equivalent)
                print(f"Pre-loading {len(files_to_load_for_context)} file(s) for context...")
                for file_path in files_to_load_for_context:
                    try:
                        with open(file_path, 'r', encoding='utf-8') as content_f: all_content_data[os.path.basename(file_path)] = json.load(content_f)
                    except Exception as e: print(f"  - ⚠️ Could not load context from {os.path.basename(file_path)}: {e}")
                print("--> Context loading complete.")
                await process_files_async(selected_files, notebook_headings_map, all_content_data, PATHS, GEMINI_API_KEY, USE_PRO_MODEL_FALLBACK)
                print("\n\n🎉 All selected RAG enrichment jobs are complete!")
            else: print("No files selected to process.")
    except Exception as e: print(f"\nAn unexpected error occurred: {e}")

# --- Final execution call ---
await main()

--- STEP 3.1: DEFINING CORE RAG FUNCTIONS (v7.0) - FULLY AUTONOMOUS ---

--- STEP 3.2: SELECT YOUR NOTEBOOK FILE ---
  [1] Bone and Soft Tissue Pathology Notebook 0920 v2.md
  [2] Breast Pathology Notebook.md
  [3] Gynecologic Pathology Notebook.md
  [4] Skin Pathology Notebook 0921.md

Select a notebook by number, or paste the full path.
Enter selection: 4
--> Using notebook: Skin Pathology Notebook 0921.md

[Parser] Reading notebook: Skin Pathology Notebook 0921.md...
--> Successfully parsed 230 content chunks.

[Indexer] Creating heading map...
--> Indexed 224 unique headings.

Please select content file(s) to link to your notebook:
  [1] _content_library/textbooks/SoftTissue_Enzinger_CONTENT.json
  [2] _content_library/textbooks/Breast_Atlas_FIGURES.json
  [3] _content_library/textbooks/Breast_Atlas_CONTENT.json
  [4] _content_library/textbooks/Breast_Biopsy_FIGURES.json
  [5] _content_library/textbooks/Breast_Biopsy_CONTENT.json
  [6] _content_library/textbooks/GI_Biopsy_Interpret

##v8.0

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 3: RAG ENRICHMENT (v8.1 - STABLE CONCURRENT "GOD MODE")
# ==============================================================================
#
# PURPOSE:
# This version is fully autonomous and highly concurrent. It processes items
# in parallel using asyncio.gather to maximize speed and ensure correct data
# order. A semaphore prevents API rate-limiting, and it retains the "Pro is God"
# final arbitration for ambiguous cases.
# ==============================================================================

import json
import os
import re
import base64
from thefuzz import process as fuzzy_process
from typing import Optional, List, Dict, Set

import asyncio
import aiohttp

# ------------------------------------------------------------------------------
# STEP 3.0: USER CONFIGURATION
# ------------------------------------------------------------------------------
# Fallback for the initial verification step. The final arbitration always uses Pro.
USE_PRO_MODEL_FALLBACK = True
# Maximum number of concurrent API requests to avoid rate-limiting.
MAX_CONCURRENT_REQUESTS = 50
# ------------------------------------------------------------------------------

# ------------------------------------------------------------------------------
# STEP 3.1: DEFINE CORE RAG FUNCTIONS (v8.1)
# ------------------------------------------------------------------------------
print("--- STEP 3.1: DEFINING CORE RAG FUNCTIONS (v8.1) - STABLE & CONCURRENT ---")

# --- CPU-Bound Helper Functions (Remain Synchronous) ---
def clean_heading(title: str) -> str:
    cleaned = re.sub(r'\{#.*?\}', '', title); cleaned = re.sub(r'^[A-Z]\\[uU][0-9a-fA-F]{4}', '', cleaned); return cleaned.strip()
def parse_notebook(notebook_path: str) -> list[dict]:
    print(f"\n[Parser] Reading notebook: {os.path.basename(notebook_path)}...");
    if not os.path.exists(notebook_path): print(f"❌ ERROR: Notebook file not found."); return []
    with open(notebook_path, 'r', encoding='utf-8') as f: content = f.read()
    parts = re.split(r'(^(?:##|###)\s+.*)', content, flags=re.MULTILINE); chunks, current_main, current_sub = [], None, None; source_name = os.path.basename(notebook_path)
    if parts and parts[0].strip(): chunks.append({"main_heading": None, "sub_heading": None, "content": parts[0].strip(), "source": source_name})
    for i in range(1, len(parts), 2):
        match = re.match(r'^(##|###)\s+(.*)', parts[i]); content = parts[i+1].strip() if (i + 1) < len(parts) else ""
        if match:
            level, raw_title = match.groups(); title = clean_heading(raw_title)
            if level == '##': current_main, current_sub = title, None
            else: current_sub = title
        if content: chunks.append({"main_heading": current_main, "sub_heading": current_sub, "content": content, "source": source_name})
    print(f"--> Successfully parsed {len(chunks)} content chunks."); return chunks
def find_fuzzy_match(query: str, notebook_headings: list, limit=1) -> Optional[list]:
    if not query: return None
    if limit == 1:
        result = fuzzy_process.extractOne(query, notebook_headings); return [result] if result else None
    return fuzzy_process.extract(query, notebook_headings, limit=limit)
def build_page_to_headings_map(content_data: list) -> dict:
    page_map = {}; current_main_heading = None
    for item in content_data:
        page_num = item.get('page_number') or item.get('source_page')
        if page_num:
            main_heading = item.get('headings', {}).get('main_heading'); sub_heading = item.get('headings', {}).get('sub_heading')
            if main_heading: current_main_heading = main_heading
            page_map[page_num] = {'main_heading': current_main_heading, 'sub_heading': sub_heading}
    return page_map

# --- ASYNC AI HELPER FUNCTIONS ---
async def classify_content_type_async(session: aiohttp.ClientSession, content_to_classify: str, api_key: str) -> str:
    if not content_to_classify or len(content_to_classify.split()) < 5: return 'ambiguous'
    try:
        model_name = "models/gemini-2.5-flash"; url = f"https://generativelanguage.googleapis.com/v1beta/{model_name}:generateContent?key={api_key}"; prompt = ("You are an expert pathologist... Classify the 'Text Content' into 'pathologic', 'normal', or 'ambiguous'...\n" f"Text Content:\n---\n{content_to_classify}\n---\n\n" "Return ONLY a single valid JSON object with one key, `classification`..."); payload = {"contents": [{"parts": [{"text": prompt}]}]}; headers = {'Content-Type': 'application/json'}
        async with session.post(url, headers=headers, json=payload, timeout=60) as response:
            if response.status == 200:
                data = await response.json(); raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', ''); json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                if json_match: return json.loads(json_match.group(0)).get('classification', 'ambiguous')
        return 'ambiguous'
    except Exception: return 'ambiguous'
async def extract_specific_entity_with_gemini(session: aiohttp.ClientSession, content_to_analyze: str, api_key: str) -> Optional[str]:
    if not content_to_analyze: return None
    try:
        model_name = "models/gemini-2.5-flash"; url = f"https://generativelanguage.googleapis.com/v1beta/{model_name}:generateContent?key={api_key}"; prompt = ("You are a pathology expert... Analyze the 'TEXT TO ANALYZE' and identify the single most unique and diagnostically significant medical entity...\nINSTRUCTIONS:\n1. Discard common, non-specific terms (e.g., 'tumor', 'lesion', 'clear cell', 'bone', 'carcinoma').\n2. From the remaining specific terms, select the SINGLE most important entity that defines the diagnosis...\nEXAMPLES:\n- Text: '...a clear cell acanthoma...'\n- Output: {\"specific_entity\": \"acanthoma\"}...\n" f"TEXT TO ANALYZE:\n---\n{content_to_analyze}\n---\n\n" "Return ONLY a single valid JSON object with one key: 'specific_entity'."); payload = {"contents": [{"parts": [{"text": prompt}]}]}; headers = {'Content-Type': 'application/json'}
        async with session.post(url, headers=headers, json=payload, timeout=60) as response:
            if response.status == 200:
                data = await response.json(); raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', ''); json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                if json_match: return json.loads(json_match.group(0)).get('specific_entity')
        return None
    except Exception: return None

async def _verify_heading_single_attempt(session: aiohttp.ClientSession, context: dict, candidate_heading: str, api_key: str, use_pro: bool) -> Optional[bool]:
    model_name = "models/gemini-2.5-pro" if use_pro else "models/gemini-2.5-flash"
    try:
        url = f"https://generativelanguage.googleapis.com/v1beta/{model_name}:generateContent?key={api_key}"; prompt = ("You are an expert pathologist acting as a strict verifier... **CRITICAL RULES:**\n1. AVOID OVERGENERALIZATION...\n2. AVOID SUBSET ERRORS...\n" f"CANDIDATE HEADING: '{candidate_heading}'\n\nCONTEXT:\n{context['current_page_text']}\n\nITEM TO CLASSIFY:\n{context['item_content']}\n\n" "TASK: Is the 'Candidate Heading' the correct and precise match?\nReturn ONLY a single valid JSON object with one key, 'is_correct_match'..."); parts = [{"text": prompt}];
        if context.get('page_image_b64'): parts.append({"inline_data": {"mime_type": "image/png", "data": context['page_image_b64']}})
        payload = {"contents": [{"parts": parts}]}; headers = {'Content-Type': 'application/json'}
        async with session.post(url, headers=headers, json=payload, timeout=120) as response:
            if response.status == 200:
                data = await response.json(); raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', '')
                json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                if json_match: return json.loads(json_match.group(0)).get('is_correct_match')
        return None
    except Exception as e: return None

async def call_gemini_to_verify_heading_async(session: aiohttp.ClientSession, context: dict, candidate_heading: str, api_key: str, use_pro_fallback: bool) -> bool:
    result = await _verify_heading_single_attempt(session, context, candidate_heading, api_key, use_pro=False)
    if result is None and use_pro_fallback:
        result = await _verify_heading_single_attempt(session, context, candidate_heading, api_key, use_pro=True)
    return result if result is not None else False

async def call_gemini_pro_for_final_decision_async(session: aiohttp.ClientSession, context: dict, top_5_candidates: list, api_key: str) -> Optional[str]:
    try:
        model_name = "models/gemini-2.5-pro"; url = f"https://generativelanguage.googleapis.com/v1beta/{model_name}:generateContent?key={api_key}"
        candidate_list_str = "\n".join([f"- '{heading}'" for heading, score in top_5_candidates]); prompt = ("You are the final arbiter, an expert pathologist tasked with making the best possible link with ambiguous information.\n\n" "Review the 'ITEM TO CLASSIFY' and the 'Top 5 Candidate Headings' from a notebook.\n\n" "INSTRUCTIONS:\n" "1. Select the SINGLE best and most precise heading from the candidate list that accurately describes the item.\n" "2. If and only if NONE of the candidates are a reasonable or correct match, you must indicate no match.\n\n" f"ITEM TO CLASSIFY:\n---\n{context['item_content']}\n---\n\n" f"Top 5 Candidate Headings:\n{candidate_list_str}\n\n" "Return a JSON object with one key, `chosen_heading`. The value must be the full string of the best heading from the list, or `null` if no candidate is a good fit.")
        parts = [{"text": prompt}];
        if context.get('page_image_b64'): parts.append({"inline_data": {"mime_type": "image/png", "data": context['page_image_b64']}})
        payload = {"contents": [{"parts": parts}]}; headers = {'Content-Type': 'application/json'}
        async with session.post(url, headers=headers, json=payload, timeout=120) as response:
            if response.status == 200:
                data = await response.json(); raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', '')
                json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                if json_match: return json.loads(json_match.group(0)).get('chosen_heading')
        return None
    except Exception: return None

# --- CORE CONCURRENT PROCESSING LOGIC ---
async def _process_single_item_async(
    session: aiohttp.ClientSession, item: dict, is_lecture: bool, notebook_headings_list: list,
    notebook_headings_map: dict, page_map: dict, all_content_data: dict, file_path: str,
    PATHS: dict, GEMINI_API_KEY: str, use_pro_fallback: bool, semaphore: asyncio.Semaphore
) -> Optional[Dict]:
    """Processes a single item, handling all logic and API calls under a semaphore."""
    async with semaphore:
        item_content_full = (item.get('description', '') or item.get('content', '')) if not is_lecture else item.get('cleaned_transcript', '')
        if len(item_content_full.split()) < 5: return None

        if not is_lecture:
            content_type = await classify_content_type_async(session, item_content_full, GEMINI_API_KEY)
            if content_type != 'pathologic': return None

        query_heading = item.get('title') if is_lecture else (item.get('headings', {}).get('sub_heading') or item.get('headings', {}).get('main_heading'))
        if not query_heading and not is_lecture:
            page_num = item.get('source_page'); page_context = page_map.get(page_num, {}); query_heading = page_context.get('sub_heading') or page_context.get('main_heading')
        if not query_heading: return None

        specific_entity = await extract_specific_entity_with_gemini(session, item_content_full, GEMINI_API_KEY)
        final_query = specific_entity if specific_entity else query_heading

        match_results = find_fuzzy_match(final_query, notebook_headings_list, limit=1)
        if not match_results: return None

        best_match, score = match_results[0]; source_notebook_name = list(notebook_headings_map.values())[0]['source']

        if is_lecture:
            if score >= 90: return {"heading": best_match, "source": source_notebook_name, "score": 1.00}
        else:
            if score >= 95: return {"heading": best_match, "source": source_notebook_name, "score": 1.00}
            elif 85 <= score < 95:
                context = {'item_content': item_content_full}; page_num = item.get('source_page') or item.get('page_number'); project_name_base = os.path.basename(file_path).replace('_FIGURES.json', '').replace('_CONTENT.json', '')
                page_image_path = os.path.join(PATHS['asset_textbooks'], project_name_base, 'page_images', f"page_{page_num:04d}.png")
                if os.path.exists(page_image_path):
                    with open(page_image_path, "rb") as f: context['page_image_b64'] = base64.b64encode(f.read()).decode('utf-8')
                context['current_page_text'] = "\n".join([c['content'] for c in all_content_data.get(f"{project_name_base}_CONTENT.json", []) if (c.get('page_number') or c.get('source_page')) == page_num])

                is_correct = await call_gemini_to_verify_heading_async(session, context, best_match, GEMINI_API_KEY, use_pro_fallback)
                if is_correct: return {"heading": best_match, "source": source_notebook_name, "score": round(score / 100, 2)}
                else:
                    second_chance_entity = await extract_specific_entity_with_gemini(session, item_content_full, GEMINI_API_KEY)
                    if second_chance_entity and second_chance_entity.lower() != final_query.lower():
                        new_matches = find_fuzzy_match(second_chance_entity, notebook_headings_list, limit=1)
                        if new_matches and new_matches[0][1] >= 90:
                            new_best_match, new_score = new_matches[0]; return {"heading": new_best_match, "source": source_notebook_name, "score": round(new_score / 100, 2)}
                    top_5_matches = find_fuzzy_match(query_heading, notebook_headings_list, limit=5)
                    chosen_heading = await call_gemini_pro_for_final_decision_async(session, context, top_5_matches, GEMINI_API_KEY)
                    if chosen_heading: return {"heading": chosen_heading, "source": source_notebook_name, "score": 1.00}
        return None

async def process_files_async(file_paths: list, notebook_headings_map: dict, all_content_data: dict, PATHS: dict, GEMINI_API_KEY: str, use_pro_fallback: bool):
    notebook_headings_list = list(notebook_headings_map.keys())
    semaphore = asyncio.Semaphore(MAX_CONCURRENT_REQUESTS)

    for file_path in file_paths:
        print(f"\n{'='*60}\n--- Processing: {os.path.basename(file_path)} ---\n{'='*60}")
        try:
            with open(file_path, 'r', encoding='utf-8') as f: data = json.load(f)
        except (IOError, json.JSONDecodeError) as e:
            print(f"  -> ERROR: Could not read file. Skipping. Error: {e}"); continue

        is_lecture = '_ENHANCED_data.json' in os.path.basename(file_path)
        page_map = build_page_to_headings_map(all_content_data.get(f"{os.path.basename(file_path).replace('_FIGURES.json','')}_CONTENT.json",[])) if os.path.basename(file_path).endswith('_FIGURES.json') else {}

        async with aiohttp.ClientSession() as session:
            tasks = [
                _process_single_item_async(session, item, is_lecture, notebook_headings_list, notebook_headings_map, page_map, all_content_data, file_path, PATHS, GEMINI_API_KEY, use_pro_fallback, semaphore)
                for item in data
            ]
            print(f"--> Running {len(tasks)} item tasks concurrently... (This may take a while)")
            # Use asyncio.gather to run all tasks and preserve the order of results.
            # return_exceptions=True prevents one failed task from stopping all others.
            results = await asyncio.gather(*tasks, return_exceptions=True)
            print("--> Concurrent processing finished.")

        update_count = 0
        for i, result in enumerate(results):
            if isinstance(result, Exception):
                print(f"  - ⚠️ WARNING: Task for item {i} failed with an error: {result}")
                continue # Skip this item and move to the next
            if result: # 'result' is the link dictionary or None
                data[i]['notebook_links'] = [result]
                update_count += 1

        print(f"\n\n  --- Summary for {os.path.basename(file_path)} ---")
        if update_count > 0:
            print(f"  -> Found/updated links for {update_count} items. Saving updates...");
            with open(file_path, 'w', encoding='utf-8') as f: json.dump(data, f, indent=4)
            print("  ✅ File saved successfully.")
        else: print("  -> No new links found for any items.")

async def main():
    try:
        if 'PATHS' not in globals(): raise NameError("PATHS not defined. Run Block 0.")
        if 'GEMINI_API_KEY' not in globals() or not GEMINI_API_KEY: raise NameError("GEMINI_API_KEY not defined. Run Block 0.")
        KNOWLEDGE_PIPELINE_ROOT, CONTENT_LIBRARY_DIR, DEFAULT_NOTEBOOK_DIR = PATHS['root'], PATHS['content_library'], PATHS['notebooks']
        print("\n--- STEP 3.2: SELECT YOUR NOTEBOOK FILE ---")
        discovered_notebooks = sorted([f for f in os.listdir(DEFAULT_NOTEBOOK_DIR) if f.lower().endswith('.md')]) if os.path.isdir(DEFAULT_NOTEBOOK_DIR) else []
        if discovered_notebooks:
            [print(f"  [{i+1}] {name}") for i, name in enumerate(discovered_notebooks)]; print("\nSelect a notebook by number, or paste the full path.")
        else: print("No notebooks found. Please provide the full path to your .md file.")
        user_input = input("Enter selection: ").strip(); notebook_path = None
        try:
            choice_index = int(user_input) - 1
            if 0 <= choice_index < len(discovered_notebooks): notebook_path = os.path.join(DEFAULT_NOTEBOOK_DIR, discovered_notebooks[choice_index])
        except (ValueError, IndexError):
            if os.path.exists(user_input) and user_input.lower().endswith('.md'): notebook_path = user_input
        if not notebook_path: raise ValueError("No valid notebook file selected.")
        print(f"--> Using notebook: {os.path.basename(notebook_path)}")
        notebook_chunks = parse_notebook(notebook_path)
        if not notebook_chunks: raise ValueError("Notebook is empty or could not be parsed.")
        print("\n[Indexer] Creating heading map..."); notebook_headings_map = {f"{c['main_heading']} - {c['sub_heading']}" if c['sub_heading'] else c['main_heading']: c for c in notebook_chunks if c['main_heading']}; print(f"--> Indexed {len(notebook_headings_map)} unique headings.")
        all_content_files = [os.path.join(root, f) for root, _, files in os.walk(CONTENT_LIBRARY_DIR) for f in files if f.endswith(('_CONTENT.json', '_FIGURES.json', '_ENHANCED_data.json'))]
        if not all_content_files: print("❌ No content files found to process.")
        else:
            print("\nPlease select content file(s) to link to your notebook:"); [print(f"  [{i+1}] {os.path.relpath(path, KNOWLEDGE_PIPELINE_ROOT)}") for i, path in enumerate(all_content_files)]
            user_input_files = input("\nEnter number(s) separated by commas (e.g., '1, 3' or 'all'): ").strip().lower(); selected_files = []
            if user_input_files == 'all': selected_files = all_content_files
            else:
                try: indices = [int(c.strip()) - 1 for c in user_input_files.split(',')]; selected_files = [all_content_files[i] for i in indices if 0 <= i < len(all_content_files)]
                except ValueError: print("Invalid input."); selected_files = []
            if selected_files:
                all_content_data = {}; files_to_load_for_context = set(selected_files); print("\n[Context Loader] Identifying necessary context files...")
                for f in selected_files:
                    if f.endswith('_FIGURES.json'):
                        content_equivalent = f.replace('_FIGURES.json', '_CONTENT.json')
                        if os.path.exists(content_equivalent): files_to_load_for_context.add(content_equivalent)
                print(f"Pre-loading {len(files_to_load_for_context)} file(s) for context...")
                for file_path in files_to_load_for_context:
                    try:
                        with open(file_path, 'r', encoding='utf-8') as content_f: all_content_data[os.path.basename(file_path)] = json.load(content_f)
                    except Exception as e: print(f"  - ⚠️ Could not load context from {os.path.basename(file_path)}: {e}")
                print("--> Context loading complete.")
                await process_files_async(selected_files, notebook_headings_map, all_content_data, PATHS, GEMINI_API_KEY, USE_PRO_MODEL_FALLBACK)
                print("\n\n🎉 All selected RAG enrichment jobs are complete!")
            else: print("No files selected to process.")
    except Exception as e: print(f"\nAn unexpected error occurred: {e}")

# --- Final execution call ---
# Installs and applies a patch to allow asyncio to run in Colab/Jupyter.
!pip install -q nest_asyncio
import nest_asyncio
nest_asyncio.apply()
# Run the main asynchronous function
asyncio.run(main())

--- STEP 3.1: DEFINING CORE RAG FUNCTIONS (v8.1) - STABLE & CONCURRENT ---

--- STEP 3.2: SELECT YOUR NOTEBOOK FILE ---
  [1] Bone and Soft Tissue Pathology Notebook 0920 v2.md
  [2] Breast Pathology Notebook.md
  [3] Gynecologic Pathology Notebook.md
  [4] Skin Pathology Notebook 0921.md

Select a notebook by number, or paste the full path.
Enter selection: 4
--> Using notebook: Skin Pathology Notebook 0921.md

[Parser] Reading notebook: Skin Pathology Notebook 0921.md...
--> Successfully parsed 230 content chunks.

[Indexer] Creating heading map...
--> Indexed 224 unique headings.

Please select content file(s) to link to your notebook:
  [1] _content_library/textbooks/SoftTissue_Enzinger_CONTENT.json
  [2] _content_library/textbooks/Breast_Atlas_FIGURES.json
  [3] _content_library/textbooks/Breast_Atlas_CONTENT.json
  [4] _content_library/textbooks/Breast_Biopsy_FIGURES.json
  [5] _content_library/textbooks/Breast_Biopsy_CONTENT.json
  [6] _content_library/textbooks/GI_Biopsy_Interp

##v8.2

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 3: RAG ENRICHMENT (v8.2 - UNIFIED LOGIC "GOD MODE")
# ==============================================================================
#
# PURPOSE:
# This version applies the same robust, multi-step AI verification and
# arbitration logic to BOTH lectures and textbooks. It processes items
# concurrently to maximize speed and ensure accuracy across all content types.
# ==============================================================================

import json
import os
import re
import base64
from thefuzz import process as fuzzy_process
from typing import Optional, List, Dict, Set

import asyncio
import aiohttp

# ------------------------------------------------------------------------------
# STEP 3.0: USER CONFIGURATION
# ------------------------------------------------------------------------------
# Fallback for the initial verification step. The final arbitration always uses Pro.
USE_PRO_MODEL_FALLBACK = True
# Maximum number of concurrent API requests to avoid rate-limiting.
MAX_CONCURRENT_REQUESTS = 50
# ------------------------------------------------------------------------------

# ------------------------------------------------------------------------------
# STEP 3.1: DEFINE CORE RAG FUNCTIONS (v8.2)
# ------------------------------------------------------------------------------
print("--- STEP 3.1: DEFINING CORE RAG FUNCTIONS (v8.2) - UNIFIED & CONCURRENT ---")

# --- CPU-Bound Helper Functions (Remain Synchronous) ---
def clean_heading(title: str) -> str:
    cleaned = re.sub(r'\{#.*?\}', '', title); cleaned = re.sub(r'^[A-Z]\\[uU][0-9a-fA-F]{4}', '', cleaned); return cleaned.strip()
def parse_notebook(notebook_path: str) -> list[dict]:
    print(f"\n[Parser] Reading notebook: {os.path.basename(notebook_path)}...");
    if not os.path.exists(notebook_path): print(f"❌ ERROR: Notebook file not found."); return []
    with open(notebook_path, 'r', encoding='utf-8') as f: content = f.read()
    parts = re.split(r'(^(?:##|###)\s+.*)', content, flags=re.MULTILINE); chunks, current_main, current_sub = [], None, None; source_name = os.path.basename(notebook_path)
    if parts and parts[0].strip(): chunks.append({"main_heading": None, "sub_heading": None, "content": parts[0].strip(), "source": source_name})
    for i in range(1, len(parts), 2):
        match = re.match(r'^(##|###)\s+(.*)', parts[i]); content = parts[i+1].strip() if (i + 1) < len(parts) else ""
        if match:
            level, raw_title = match.groups(); title = clean_heading(raw_title)
            if level == '##': current_main, current_sub = title, None
            else: current_sub = title
        if content: chunks.append({"main_heading": current_main, "sub_heading": current_sub, "content": content, "source": source_name})
    print(f"--> Successfully parsed {len(chunks)} content chunks."); return chunks
def find_fuzzy_match(query: str, notebook_headings: list, limit=1) -> Optional[list]:
    if not query: return None
    if limit == 1:
        result = fuzzy_process.extractOne(query, notebook_headings); return [result] if result else None
    return fuzzy_process.extract(query, notebook_headings, limit=limit)
def build_page_to_headings_map(content_data: list) -> dict:
    page_map = {}; current_main_heading = None
    for item in content_data:
        page_num = item.get('page_number') or item.get('source_page')
        if page_num:
            main_heading = item.get('headings', {}).get('main_heading'); sub_heading = item.get('headings', {}).get('sub_heading')
            if main_heading: current_main_heading = main_heading
            page_map[page_num] = {'main_heading': current_main_heading, 'sub_heading': sub_heading}
    return page_map

# --- ASYNC AI HELPER FUNCTIONS ---
async def classify_content_type_async(session: aiohttp.ClientSession, content_to_classify: str, api_key: str) -> str:
    if not content_to_classify or len(content_to_classify.split()) < 5: return 'ambiguous'
    try:
        model_name = "models/gemini-2.5-flash"; url = f"https://generativelanguage.googleapis.com/v1beta/{model_name}:generateContent?key={api_key}"; prompt = ("You are an expert pathologist... Classify the 'Text Content' into 'pathologic', 'normal', or 'ambiguous'...\n" f"Text Content:\n---\n{content_to_classify}\n---\n\n" "Return ONLY a single valid JSON object with one key, `classification`..."); payload = {"contents": [{"parts": [{"text": prompt}]}]}; headers = {'Content-Type': 'application/json'}
        async with session.post(url, headers=headers, json=payload, timeout=60) as response:
            if response.status == 200:
                data = await response.json(); raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', ''); json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                if json_match: return json.loads(json_match.group(0)).get('classification', 'ambiguous')
        return 'ambiguous'
    except Exception: return 'ambiguous'
async def extract_specific_entity_with_gemini(session: aiohttp.ClientSession, content_to_analyze: str, api_key: str) -> Optional[str]:
    if not content_to_analyze: return None
    try:
        model_name = "models/gemini-2.5-flash"; url = f"https://generativelanguage.googleapis.com/v1beta/{model_name}:generateContent?key={api_key}"; prompt = ("You are a pathology expert... Analyze the 'TEXT TO ANALYZE' and identify the single most unique and diagnostically significant medical entity...\nINSTRUCTIONS:\n1. Discard common, non-specific terms (e.g., 'tumor', 'lesion', 'clear cell', 'bone', 'carcinoma').\n2. From the remaining specific terms, select the SINGLE most important entity that defines the diagnosis...\nEXAMPLES:\n- Text: '...a clear cell acanthoma...'\n- Output: {\"specific_entity\": \"acanthoma\"}...\n" f"TEXT TO ANALYZE:\n---\n{content_to_analyze}\n---\n\n" "Return ONLY a single valid JSON object with one key: 'specific_entity'."); payload = {"contents": [{"parts": [{"text": prompt}]}]}; headers = {'Content-Type': 'application/json'}
        async with session.post(url, headers=headers, json=payload, timeout=60) as response:
            if response.status == 200:
                data = await response.json(); raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', ''); json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                if json_match: return json.loads(json_match.group(0)).get('specific_entity')
        return None
    except Exception: return None

async def _verify_heading_single_attempt(session: aiohttp.ClientSession, context: dict, candidate_heading: str, api_key: str, use_pro: bool) -> Optional[bool]:
    model_name = "models/gemini-2.5-pro" if use_pro else "models/gemini-2.5-flash"
    try:
        url = f"https://generativelanguage.googleapis.com/v1beta/{model_name}:generateContent?key={api_key}"; prompt = ("You are an expert pathologist acting as a strict verifier... **CRITICAL RULES:**\n1. AVOID OVERGENERALIZATION...\n2. AVOID SUBSET ERRORS...\n" f"CANDIDATE HEADING: '{candidate_heading}'\n\nCONTEXT:\n{context['current_page_text']}\n\nITEM TO CLASSIFY:\n{context['item_content']}\n\n" "TASK: Is the 'Candidate Heading' the correct and precise match?\nReturn ONLY a single valid JSON object with one key, 'is_correct_match'..."); parts = [{"text": prompt}];
        if context.get('page_image_b64'): parts.append({"inline_data": {"mime_type": "image/png", "data": context['page_image_b64']}})
        payload = {"contents": [{"parts": parts}]}; headers = {'Content-Type': 'application/json'}
        async with session.post(url, headers=headers, json=payload, timeout=120) as response:
            if response.status == 200:
                data = await response.json(); raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', '')
                json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                if json_match: return json.loads(json_match.group(0)).get('is_correct_match')
        return None
    except Exception as e: return None

async def call_gemini_to_verify_heading_async(session: aiohttp.ClientSession, context: dict, candidate_heading: str, api_key: str, use_pro_fallback: bool) -> bool:
    result = await _verify_heading_single_attempt(session, context, candidate_heading, api_key, use_pro=False)
    if result is None and use_pro_fallback:
        result = await _verify_heading_single_attempt(session, context, candidate_heading, api_key, use_pro=True)
    return result if result is not None else False

async def call_gemini_pro_for_final_decision_async(session: aiohttp.ClientSession, context: dict, top_5_candidates: list, api_key: str) -> Optional[str]:
    try:
        model_name = "models/gemini-2.5-pro"; url = f"https://generativelanguage.googleapis.com/v1beta/{model_name}:generateContent?key={api_key}"
        candidate_list_str = "\n".join([f"- '{heading}'" for heading, score in top_5_candidates]); prompt = ("You are the final arbiter, an expert pathologist tasked with making the best possible link with ambiguous information.\n\n" "Review the 'ITEM TO CLASSIFY' and the 'Top 5 Candidate Headings' from a notebook.\n\n" "INSTRUCTIONS:\n" "1. Select the SINGLE best and most precise heading from the candidate list that accurately describes the item.\n" "2. If and only if NONE of the candidates are a reasonable or correct match, you must indicate no match.\n\n" f"ITEM TO CLASSIFY:\n---\n{context['item_content']}\n---\n\n" f"Top 5 Candidate Headings:\n{candidate_list_str}\n\n" "Return a JSON object with one key, `chosen_heading`. The value must be the full string of the best heading from the list, or `null` if no candidate is a good fit.")
        parts = [{"text": prompt}];
        if context.get('page_image_b64'): parts.append({"inline_data": {"mime_type": "image/png", "data": context['page_image_b64']}})
        payload = {"contents": [{"parts": parts}]}; headers = {'Content-Type': 'application/json'}
        async with session.post(url, headers=headers, json=payload, timeout=120) as response:
            if response.status == 200:
                data = await response.json(); raw_text = data.get('candidates', [{}])[0].get('content', {}).get('parts', [{}])[0].get('text', '')
                json_match = re.search(r'\{.*\}', raw_text, re.DOTALL)
                if json_match: return json.loads(json_match.group(0)).get('chosen_heading')
        return None
    except Exception: return None

# --- CORE CONCURRENT PROCESSING LOGIC ---
async def _process_single_item_async(
    session: aiohttp.ClientSession, item: dict, is_lecture: bool, notebook_headings_list: list,
    notebook_headings_map: dict, page_map: dict, all_content_data: dict, file_path: str,
    PATHS: dict, GEMINI_API_KEY: str, use_pro_fallback: bool, semaphore: asyncio.Semaphore
) -> Optional[Dict]:
    """
    Processes a single item with a unified, robust logic for both lectures and
    textbooks, handling all API calls under a semaphore.
    """
    async with semaphore:
        item_content_full = (item.get('description', '') or item.get('content', '')) if not is_lecture else item.get('cleaned_transcript', '')
        if len(item_content_full.split()) < 5: return None

        # For textbooks, first classify content and skip if not pathologic.
        if not is_lecture:
            content_type = await classify_content_type_async(session, item_content_full, GEMINI_API_KEY)
            if content_type != 'pathologic': return None

        # Determine the initial query heading from the item's metadata.
        query_heading = item.get('title') if is_lecture else (item.get('headings', {}).get('sub_heading') or item.get('headings', {}).get('main_heading'))
        if not query_heading and not is_lecture:
            page_num = item.get('source_page'); page_context = page_map.get(page_num, {}); query_heading = page_context.get('sub_heading') or page_context.get('main_heading')
        if not query_heading: return None

        # Use AI to find a more specific term to improve search accuracy.
        specific_entity = await extract_specific_entity_with_gemini(session, item_content_full, GEMINI_API_KEY)
        final_query = specific_entity if specific_entity else query_heading

        # Perform the initial fuzzy match against the notebook headings.
        match_results = find_fuzzy_match(final_query, notebook_headings_list, limit=1)
        if not match_results: return None

        best_match, score = match_results[0]
        source_notebook_name = list(notebook_headings_map.values())[0]['source']

        # --- UNIFIED LOGIC FOR LECTURES AND TEXTBOOKS ---
        if score >= 95:
            # High confidence: Auto-link without verification.
            return {"heading": best_match, "source": source_notebook_name, "score": 1.00}

        elif 85 <= score < 95:
            # Plausible confidence: Requires AI verification.
            context = {'item_content': item_content_full, 'current_page_text': ""}
            # For textbooks, add richer context (page image and surrounding text).
            if not is_lecture:
                page_num = item.get('source_page') or item.get('page_number')
                if page_num:
                    project_name_base = os.path.basename(file_path).replace('_FIGURES.json', '').replace('_CONTENT.json', '')
                    page_image_path = os.path.join(PATHS['asset_textbooks'], project_name_base, 'page_images', f"page_{page_num:04d}.png")
                    if os.path.exists(page_image_path):
                        with open(page_image_path, "rb") as f: context['page_image_b64'] = base64.b64encode(f.read()).decode('utf-8')
                    context['current_page_text'] = "\n".join([c['content'] for c in all_content_data.get(f"{project_name_base}_CONTENT.json", []) if (c.get('page_number') or c.get('source_page')) == page_num])

            is_correct = await call_gemini_to_verify_heading_async(session, context, best_match, GEMINI_API_KEY, use_pro_fallback)

            if is_correct:
                return {"heading": best_match, "source": source_notebook_name, "score": round(score / 100, 2)}
            else:
                # Verification failed. Try a "second chance" with another AI entity extraction.
                second_chance_entity = await extract_specific_entity_with_gemini(session, item_content_full, GEMINI_API_KEY)
                if second_chance_entity and second_chance_entity.lower() != final_query.lower():
                    new_matches = find_fuzzy_match(second_chance_entity, notebook_headings_list, limit=1)
                    if new_matches and new_matches[0][1] >= 90:
                        new_best_match, new_score = new_matches[0]
                        return {"heading": new_best_match, "source": source_notebook_name, "score": round(new_score / 100, 2)}

                # Second chance failed. Escalate to Gemini Pro for a final decision from the top 5 candidates.
                top_5_matches = find_fuzzy_match(query_heading, notebook_headings_list, limit=5)
                if top_5_matches:
                    chosen_heading = await call_gemini_pro_for_final_decision_async(session, context, top_5_matches, GEMINI_API_KEY)
                    if chosen_heading:
                        return {"heading": chosen_heading, "source": source_notebook_name, "score": 1.00}

        # If score is < 85 or all verification steps fail, do not link.
        return None

async def process_files_async(file_paths: list, notebook_headings_map: dict, all_content_data: dict, PATHS: dict, GEMINI_API_KEY: str, use_pro_fallback: bool):
    notebook_headings_list = list(notebook_headings_map.keys())
    semaphore = asyncio.Semaphore(MAX_CONCURRENT_REQUESTS)

    for file_path in file_paths:
        print(f"\n{'='*60}\n--- Processing: {os.path.basename(file_path)} ---\n{'='*60}")
        try:
            with open(file_path, 'r', encoding='utf-8') as f: data = json.load(f)
        except (IOError, json.JSONDecodeError) as e:
            print(f"  -> ERROR: Could not read file. Skipping. Error: {e}"); continue

        is_lecture = '_ENHANCED_data.json' in os.path.basename(file_path)
        page_map = build_page_to_headings_map(all_content_data.get(f"{os.path.basename(file_path).replace('_FIGURES.json','')}_CONTENT.json",[])) if os.path.basename(file_path).endswith('_FIGURES.json') else {}

        async with aiohttp.ClientSession() as session:
            tasks = [
                _process_single_item_async(session, item, is_lecture, notebook_headings_list, notebook_headings_map, page_map, all_content_data, file_path, PATHS, GEMINI_API_KEY, use_pro_fallback, semaphore)
                for item in data
            ]
            print(f"--> Running {len(tasks)} item tasks concurrently... (This may take a while)")
            results = await asyncio.gather(*tasks, return_exceptions=True)
            print("--> Concurrent processing finished.")

        update_count = 0
        for i, result in enumerate(results):
            if isinstance(result, Exception):
                print(f"  - ⚠️ WARNING: Task for item {i} failed with an error: {result}")
                continue
            if result:
                data[i]['notebook_links'] = [result]
                update_count += 1

        print(f"\n\n  --- Summary for {os.path.basename(file_path)} ---")
        if update_count > 0:
            print(f"  -> Found/updated links for {update_count} items. Saving updates...");
            with open(file_path, 'w', encoding='utf-8') as f: json.dump(data, f, indent=4)
            print("  ✅ File saved successfully.")
        else: print("  -> No new links found for any items.")

async def main():
    try:
        if 'PATHS' not in globals(): raise NameError("PATHS not defined. Run Block 0.")
        if 'GEMINI_API_KEY' not in globals() or not GEMINI_API_KEY: raise NameError("GEMINI_API_KEY not defined. Run Block 0.")
        KNOWLEDGE_PIPELINE_ROOT, CONTENT_LIBRARY_DIR, DEFAULT_NOTEBOOK_DIR = PATHS['root'], PATHS['content_library'], PATHS['notebooks']
        print("\n--- STEP 3.2: SELECT YOUR NOTEBOOK FILE ---")
        discovered_notebooks = sorted([f for f in os.listdir(DEFAULT_NOTEBOOK_DIR) if f.lower().endswith('.md')]) if os.path.isdir(DEFAULT_NOTEBOOK_DIR) else []
        if discovered_notebooks:
            [print(f"  [{i+1}] {name}") for i, name in enumerate(discovered_notebooks)]; print("\nSelect a notebook by number, or paste the full path.")
        else: print("No notebooks found. Please provide the full path to your .md file.")
        user_input = input("Enter selection: ").strip(); notebook_path = None
        try:
            choice_index = int(user_input) - 1
            if 0 <= choice_index < len(discovered_notebooks): notebook_path = os.path.join(DEFAULT_NOTEBOOK_DIR, discovered_notebooks[choice_index])
        except (ValueError, IndexError):
            if os.path.exists(user_input) and user_input.lower().endswith('.md'): notebook_path = user_input
        if not notebook_path: raise ValueError("No valid notebook file selected.")
        print(f"--> Using notebook: {os.path.basename(notebook_path)}")
        notebook_chunks = parse_notebook(notebook_path)
        if not notebook_chunks: raise ValueError("Notebook is empty or could not be parsed.")
        print("\n[Indexer] Creating heading map..."); notebook_headings_map = {f"{c['main_heading']} - {c['sub_heading']}" if c['sub_heading'] else c['main_heading']: c for c in notebook_chunks if c['main_heading']}; print(f"--> Indexed {len(notebook_headings_map)} unique headings.")
        all_content_files = [os.path.join(root, f) for root, _, files in os.walk(CONTENT_LIBRARY_DIR) for f in files if f.endswith(('_CONTENT.json', '_FIGURES.json', '_ENHANCED_data.json'))]
        if not all_content_files: print("❌ No content files found to process.")
        else:
            print("\nPlease select content file(s) to link to your notebook:"); [print(f"  [{i+1}] {os.path.relpath(path, KNOWLEDGE_PIPELINE_ROOT)}") for i, path in enumerate(all_content_files)]
            user_input_files = input("\nEnter number(s) separated by commas (e.g., '1, 3' or 'all'): ").strip().lower(); selected_files = []
            if user_input_files == 'all': selected_files = all_content_files
            else:
                try: indices = [int(c.strip()) - 1 for c in user_input_files.split(',')]; selected_files = [all_content_files[i] for i in indices if 0 <= i < len(all_content_files)]
                except ValueError: print("Invalid input."); selected_files = []
            if selected_files:
                all_content_data = {}; files_to_load_for_context = set(selected_files); print("\n[Context Loader] Identifying necessary context files...")
                for f in selected_files:
                    if f.endswith('_FIGURES.json'):
                        content_equivalent = f.replace('_FIGURES.json', '_CONTENT.json')
                        if os.path.exists(content_equivalent): files_to_load_for_context.add(content_equivalent)
                print(f"Pre-loading {len(files_to_load_for_context)} file(s) for context...")
                for file_path in files_to_load_for_context:
                    try:
                        with open(file_path, 'r', encoding='utf-8') as content_f: all_content_data[os.path.basename(file_path)] = json.load(content_f)
                    except Exception as e: print(f"  - ⚠️ Could not load context from {os.path.basename(file_path)}: {e}")
                print("--> Context loading complete.")
                await process_files_async(selected_files, notebook_headings_map, all_content_data, PATHS, GEMINI_API_KEY, USE_PRO_MODEL_FALLBACK)
                print("\n\n🎉 All selected RAG enrichment jobs are complete!")
            else: print("No files selected to process.")
    except Exception as e: print(f"\nAn unexpected error occurred: {e}")

# --- Final execution call ---
!pip install -q nest_asyncio
import nest_asyncio
nest_asyncio.apply()
asyncio.run(main())

# **Block 4: Build Knowledge Hub**

In [None]:
# @title {display-mode: "form"}

import json
import os
import shutil
from urllib.parse import quote

# ------------------------------------------------------------------------------
# STEP 4.1: DEFINE THE HUB PUBLISHING PIPELINE
# ------------------------------------------------------------------------------
print("--- STEP 4.1: DEFINING THE HUB PUBLISHING PIPELINE ---")

def compile_and_package_hub(selected_lectures, selected_textbooks, package_path, KNOWLEDGE_PIPELINE_ROOT, CONTENT_LIBRARY_DIR, ASSET_LECTURES_DIR, SOURCE_PDFS_DIR):
    all_knowledge_entries, referenced_pdfs = [], set()
    print("\n[Compiler] Starting data aggregation...")
    pkg_assets_dir = os.path.join(package_path, 'assets')
    os.makedirs(pkg_assets_dir, exist_ok=True)

    if selected_textbooks:
        print(f"--> Processing {len(selected_textbooks)} textbook(s)...")
        for base_name in selected_textbooks:
            content_file = os.path.join(CONTENT_LIBRARY_DIR, 'textbooks', f"{base_name}_CONTENT.json"); figure_file = os.path.join(CONTENT_LIBRARY_DIR, 'textbooks', f"{base_name}_FIGURES.json")
            pdf_filename = None
            try:
                source_file = content_file if os.path.exists(content_file) else figure_file
                with open(source_file, 'r', encoding='utf-8') as f: pdf_filename = json.load(f)[0].get('source_document')
                if pdf_filename: referenced_pdfs.add(pdf_filename)
            except (IOError, IndexError): print(f"  - Warning: Could not determine PDF filename for {base_name}.")

            if os.path.exists(content_file):
                with open(content_file, 'r', encoding='utf-8') as f:
                    for chunk in json.load(f):
                        pdf_link = f"pdfs/{quote(pdf_filename)}#page={chunk['page_number']}" if pdf_filename else None
                        title = chunk['headings'].get('sub_heading') or chunk['headings'].get('main_heading') or f"Page {chunk['page_number']}"
                        all_knowledge_entries.append({"type": "textbook_chunk", "source": f"Textbook: {pdf_filename or 'N/A'}", "title": title, "summary": (chunk.get('content', '')[:250] + '...'), "full_content": chunk.get('content', ''), "notebook_links": chunk.get('notebook_links', []), "searchable_text": f"textbook page {title} {chunk.get('content', '')}", "pdf_link": pdf_link, "page_number": chunk.get('page_number')})
            if os.path.exists(figure_file):
                 with open(figure_file, 'r', encoding='utf-8') as f:
                    for figure in json.load(f):
                        pdf_link = f"pdfs/{quote(pdf_filename)}#page={figure['source_page']}" if pdf_filename else None
                        source_img_path_full = os.path.join(KNOWLEDGE_PIPELINE_ROOT, figure.get('image_path', ''))
                        if os.path.exists(source_img_path_full):
                            dest_folder = os.path.join(pkg_assets_dir, 'figures', base_name); os.makedirs(dest_folder, exist_ok=True)
                            shutil.copy(source_img_path_full, dest_folder)
                            relative_image_path = os.path.join('assets', 'figures', base_name, os.path.basename(source_img_path_full))
                            all_knowledge_entries.append({"type": "textbook_figure", "source": f"Figure from: {pdf_filename or 'N/A'}", "title": f"Fig. {figure.get('figure_id', 'N/A')}", "summary": figure.get('description', ''), "image_path": relative_image_path, "notebook_links": figure.get('notebook_links', []), "searchable_text": f"figure fig {figure.get('figure_id', '')} {figure.get('description', '')}", "pdf_link": pdf_link, "full_content": figure.get('description', '')})

    if selected_lectures:
        print(f"--> Processing {len(selected_lectures)} selected lecture(s)...")
        for lecture_name in selected_lectures:
            json_path = os.path.join(CONTENT_LIBRARY_DIR, 'lectures', lecture_name, 'final_ENHANCED_data.json')
            asset_folder = os.path.join(ASSET_LECTURES_DIR, lecture_name)
            video_file = next((f for f in os.listdir(asset_folder) if f.lower().endswith(('.mp4', '.mov', '.mkv'))), None) if os.path.exists(asset_folder) else None
            if os.path.exists(json_path):
                with open(json_path, 'r', encoding='utf-8') as f:
                    for slide in json.load(f):
                        relative_video_path = None
                        if video_file:
                            dest_folder = os.path.join(pkg_assets_dir, 'videos', lecture_name); os.makedirs(dest_folder, exist_ok=True)
                            shutil.copy(os.path.join(asset_folder, video_file), dest_folder)
                            relative_video_path = os.path.join('assets', 'videos', lecture_name, video_file)
                        source_img_path_full = os.path.join(KNOWLEDGE_PIPELINE_ROOT, slide.get('image_path', ''))
                        if os.path.exists(source_img_path_full):
                            dest_folder = os.path.join(pkg_assets_dir, 'slides', lecture_name); os.makedirs(dest_folder, exist_ok=True)
                            shutil.copy(source_img_path_full, dest_folder)
                            relative_image_path = os.path.join('assets', 'slides', lecture_name, os.path.basename(source_img_path_full))
                            all_knowledge_entries.append({"type": "lecture_slide", "source": f"Lecture: {lecture_name.replace('_', ' ')}", "title": slide.get('title', 'N/A'), "summary": slide.get('summary', ''), "full_content": slide.get('cleaned_transcript', ''), "image_path": relative_image_path, "notebook_links": slide.get('notebook_links', []), "searchable_text": f"lecture slide {slide.get('title', '')} {slide.get('summary', '')}", "video_path": relative_video_path, "start_time": slide.get('start_time')})

    print(f"\n[Packager] Compiled {len(all_knowledge_entries)} entries. Copying {len(referenced_pdfs)} source PDF(s)...")
    pkg_pdfs_dir = os.path.join(package_path, 'pdfs'); os.makedirs(pkg_pdfs_dir, exist_ok=True)
    for pdf in referenced_pdfs:
        if os.path.exists(os.path.join(SOURCE_PDFS_DIR, pdf)): shutil.copy(os.path.join(SOURCE_PDFS_DIR, pdf), pkg_pdfs_dir)
    return all_knowledge_entries

def generate_knowledge_hub_html(knowledge_data: list, package_path: str):
    print("\n[HTML Generator] Building the final interface...")
    embedded_json_data = json.dumps(knowledge_data)
    hub_html_path = os.path.join(package_path, 'KNOWLEDGE_HUB.html')
    html_template = f"""
    <!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><title>Knowledge Hub</title><style>:root{{--bg-color:#f8f9fa;--card-bg:#fff;--text-color:#212529;--primary-color:#4285F4;--shadow:0 2px 4px rgba(0,0,0,0.1)}}body{{font-family:Roboto,sans-serif;margin:0;background-color:var(--bg-color);}}.container{{width:95%;max-width:1800px;margin:20px auto}}.search-bar{{width:100%;max-width:800px;margin:0 auto 30px;position:relative}}#searchInput{{width:100%;font-size:1.2em;padding:12px 20px;border-radius:24px;border:1px solid #dfe1e5;box-shadow:var(--shadow)}}.filter-pills{{display:flex;justify-content:center;gap:12px;margin-bottom:25px}}.filter-pill{{padding:8px 16px;border-radius:18px;background-color:var(--card-bg);cursor:pointer;border:1px solid #dfe1e5;transition:all .2s}}.filter-pill.active{{background-color:var(--primary-color);color:#fff;border-color:var(--primary-color)}}#results-grid{{display:grid;grid-template-columns:repeat(auto-fill,minmax(320px,1fr));gap:25px}}#pagination{{text-align:center;margin-top:30px}}#pagination button{{padding:8px 16px;margin:0 5px;cursor:pointer;border-radius:4px;border:1px solid #dfe1e5;background-color:#fff}}.result-card{{background-color:var(--card-bg);border-radius:8px;overflow:hidden;display:flex;flex-direction:column;box-shadow:var(--shadow);text-decoration:none;color:inherit}}.card-image-container{{width:100%;height:200px;background-color:#eee;cursor:pointer}}.card-content{{padding:15px;flex-grow:1}}.card-footer{{padding:10px 15px;border-top:1px solid #dfe1e5;display:flex;justify-content:space-between;align-items:center}}.video-btn{{background-color:var(--primary-color);color:#fff;border:0;padding:6px 12px;border-radius:4px;cursor:pointer}}.links-container{{display:flex;flex-wrap:wrap;gap:5px;margin-top:10px}}.link-tag{{background-color:#e9ecef;padding:3px 8px;border-radius:5px;font-size:.75em}}.modal-overlay{{position:fixed;top:0;left:0;width:100%;height:100%;background:rgba(0,0,0,.7);display:flex;justify-content:center;align-items:center;z-index:1000;opacity:0;visibility:hidden}}.modal-overlay.visible{{opacity:1;visibility:visible}}.modal-content{{background:#fff;padding:25px;border-radius:8px;width:80vw;max-width:1200px;height:80vh;display:flex;gap:20px}}.modal-text-content{{flex:1;overflow-y:auto;padding-right:15px}}.modal-image-container{{flex:1;display:flex;align-items:center;justify-content:center}}img{{max-width:100%;max-height:100%;object-fit:cover;}}.modal-video-content{{background:#000;padding:10px;border-radius:8px}}video{{max-width:85vw;max-height:85vh}}</style></head><body><div class="container"><div class="search-bar"><input type="text" id="searchInput" placeholder="Search..."></div><div class="filter-pills"><div class="filter-pill active" data-filter="all">All</div><div class="filter-pill" data-filter="lecture_slide">Lectures</div><div class="filter-pill" data-filter="textbook_figure">Figures</div><div class="filter-pill" data-filter="textbook_chunk">Textbook</div></div><div id="results-count"></div><div id="results-grid"></div><div id="pagination"></div></div><div class="modal-overlay" id="quickLookModal"><div class="modal-content"><div class="modal-text-content"><h2 id="modal-title"></h2><div id="modal-text"></div></div><div class="modal-image-container"><img id="modal-image" src=""></div></div></div><div class="modal-overlay" id="videoModal"><div class="modal-video-content"><video id="modalVideo" controls></video></div></div>
    <script>
        const data = {embedded_json_data};
        const RPP = 24; let currentPage = 1, currentFilter = 'all', currentSearch = '';
        const grid=document.getElementById("results-grid"), input=document.getElementById("searchInput"), count=document.getElementById("results-count"), pagination=document.getElementById("pagination"), pills=document.querySelectorAll('.filter-pill'), qlModal=document.getElementById("quickLookModal"), qlTitle=document.getElementById("modal-title"), qlText=document.getElementById("modal-text"), qlImg=document.getElementById("modal-image"), vModal=document.getElementById("videoModal"), vPlayer=document.getElementById("modalVideo");
        function render() {{
            const filtered = data.filter(s => (currentFilter === 'all' || s.type === currentFilter) && (!currentSearch || s.searchable_text.toLowerCase().includes(currentSearch)));
            count.textContent = `${{filtered.length}} results found.`;
            const pageData = filtered.slice((currentPage - 1) * RPP, currentPage * RPP);
            grid.innerHTML = pageData.map(s => {{
                const originalIndex = data.indexOf(s);
                const imageTag = s.image_path ? `<div class="card-image-container" onclick="openQuickLook(${{originalIndex}})"><img src="${{s.image_path}}"></div>` : '';
                const links = s.notebook_links || [];
                const linksTag = links.length > 0 ? `<div class="links-container">${{links.map(t => `<span class="link-tag">${{t}}</span>`).join('')}}</div>` : '';
                let footer = `<div class="card-footer"><span>${{s.source}}</span>`;
                if (s.video_path) {{ footer += `<button class="video-btn" onclick="event.stopPropagation(); event.preventDefault(); openVideo('${{s.video_path}}', ${{s.start_time}})">Go to Video</button>`; }}
                footer += `</div>`;
                const card_inner = `${{imageTag}}<div class="card-content" onclick="openQuickLook(${{originalIndex}})"><h3>${{s.title}}</h3><p>${{s.summary}}</p>${{linksTag}}</div>${{footer}}`;
                return s.pdf_link ? `<a href="${{s.pdf_link}}" target="_blank" class="result-card">${{card_inner}}</a>` : `<div class="result-card">${{card_inner}}</div>`;
            }}).join('');
            renderPagination(filtered.length);
        }}
        function renderPagination(total) {{ const totalPages = Math.ceil(total / RPP); pagination.innerHTML = ''; if (totalPages <= 1) return; pagination.innerHTML = `<button onclick="changePage(-1)" ${{currentPage === 1 ? 'disabled' : ''}}>Previous</button><span> Page ${{currentPage}} of ${{totalPages}} </span><button onclick="changePage(1)" ${{currentPage === totalPages ? 'disabled' : ''}}>Next</button>`; }}
        function changePage(dir) {{ currentPage += dir; render(); window.scrollTo(0, 0); }}
        function openQuickLook(index){{ const item=data[index]; qlTitle.textContent=item.title; qlText.innerHTML=(item.full_content||item.summary).replace(/\\n/g,'<br>'); qlImg.style.display=item.image_path?'flex':'none'; qlImg.src=item.image_path||''; qlModal.classList.add("visible"); }}
        function openVideo(path, time){{ vPlayer.src=path; vPlayer.currentTime=time; vModal.classList.add("visible"); vPlayer.play(); }}
        input.addEventListener("input", e => {{ currentPage = 1; currentSearch = e.target.value.toLowerCase().trim(); render(); }});
        pills.forEach(p => p.addEventListener('click', e => {{ pills.forEach(pi => pi.classList.remove('active')); e.target.classList.add('active'); currentPage = 1; currentFilter = e.target.dataset.filter; render(); }}));
        [qlModal, vModal].forEach(m => m.addEventListener("click", e => {{ if(e.target === m){{ m.classList.remove("visible"); if(m===vModal) vPlayer.pause(); }} }}));
        document.addEventListener('DOMContentLoaded', render);
    </script></body></html>
    """
    with open(hub_html_path, 'w', encoding='utf-8') as f: f.write(html_template)
    print("✅ Hub interface saved.")

# ------------------------------------------------------------------------------
# INTERACTIVE EXECUTION
# ------------------------------------------------------------------------------
try:
    KNOWLEDGE_PIPELINE_ROOT = '/content/drive/MyDrive/1-Projects/Knowledge_Pipeline'
    CONTENT_LIBRARY_DIR = os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_content_library')
    ASSET_LECTURES_DIR = os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_asset_library', 'lectures')
    SOURCE_PDFS_DIR = os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_source_materials', 'pdfs')
    OUTPUTS_DIR = os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_outputs')

    available_lectures = sorted([d for d in os.listdir(os.path.join(CONTENT_LIBRARY_DIR, 'lectures')) if os.path.isdir(os.path.join(CONTENT_LIBRARY_DIR, 'lectures', d))])
    available_textbooks = sorted(list(set([f.replace('_CONTENT.json', '').replace('_FIGURES.json', '') for f in os.listdir(os.path.join(CONTENT_LIBRARY_DIR, 'textbooks'))])))

    print("\nPlease select LECTURES to include (e.g., 1,3 or 'all'):")
    [print(f"  [{i+1}] {name}") for i, name in enumerate(available_lectures)]
    lecture_input = input("Lectures: ").strip().lower()
    selected_lectures = []
    if lecture_input == 'all': selected_lectures = available_lectures
    elif lecture_input: selected_lectures = [available_lectures[int(c.strip())-1] for c in lecture_input.split(',')]

    print("\nPlease select TEXTBOOKS to include (e.g., 1,2 or 'all'):")
    [print(f"  [{i+1}] {name}") for i, name in enumerate(available_textbooks)]
    textbook_input = input("Textbooks: ").strip().lower()
    selected_textbooks = []
    if textbook_input == 'all': selected_textbooks = available_textbooks
    elif textbook_input: selected_textbooks = [available_textbooks[int(c.strip())-1] for c in textbook_input.split(',')]

    if not selected_lectures and not selected_textbooks: raise ValueError("No content selected.")

    package_name = input("\nEnter a name for your Hub Package: ").strip() or "Knowledge_Hub_Package"
    package_path = os.path.join(OUTPUTS_DIR, package_name)

    if os.path.exists(package_path):
        action = input(f"Warning: Package folder '{package_name}' already exists. [O]verwrite or [C]ancel? ").lower()
        if action == 'c': raise SystemExit("--> Creation cancelled.")
        shutil.rmtree(package_path)
    os.makedirs(package_path)

    compiled_data = compile_and_package_hub(selected_lectures, selected_textbooks, package_path, KNOWLEDGE_PIPELINE_ROOT, CONTENT_LIBRARY_DIR, ASSET_LECTURES_DIR, SOURCE_PDFS_DIR)
    generate_knowledge_hub_html(compiled_data, package_path)

    print(f"\n[Compressing] Zipping the package '{package_name}'...")
    zip_path_base = os.path.join(OUTPUTS_DIR, package_name)
    shutil.make_archive(base_name=zip_path_base, format='zip', root_dir=package_path)
    shutil.rmtree(package_path)

    print("\n" + "="*80)
    print("✅ KNOWLEDGE HUB PACKAGE CREATED & ZIPPED!")
    print("ACTION REQUIRED: Download '{package_name}.zip' from your '_outputs' folder, unzip, and open 'KNOWLEDGE_HUB.html'.")
    print("="*80)

except Exception as e:
    print(f"\nAn unexpected error occurred: {e}")

--- STEP 4.1: DEFINING THE HUB PUBLISHING PIPELINE ---

Please select LECTURES to include (e.g., 1,3 or 'all'):
  [1] BST_Lecture_1_Grossing
  [2] BST_Lecture_2_SoftTissue1
  [3] BST_Lecture_3_SofTissue2
  [4] BST_Lecture_4_SoftTissue3
  [5] BST_Lecture_5_Bone1
  [6] BST_Lecture_6_Bone2
  [7] Breast_Lecture_Epithelial
  [8] Breast_Lecture_Fibroepithelial
  [9] Breast_Lecture_Grossing
  [10] Breast_Lecture_IHC
  [11] Breast_Lecture_Invasive
  [12] Breast_Lecture_Lobular
  [13] Breast_Lecture_Normal
  [14] Breast_Lecture_Papillary
  [15] Breast_Lecture_Prognostics
  [16] Breast_Lecture_Rad-Path
  [17] Breast_Lecture_SpindleCell
  [18] Breast_Lecture_Treated
  [19] GI_Lecture_0_Gross_Liver
  [20] GI_Lecture_10_Colon
  [21] GI_Lecture_11_Esophagus
  [22] GI_Lecture_12_SmallIntestine
  [23] GI_Lecture_13_Stomach2
  [24] GI_Lecture_1_Liver1
  [25] GI_Lecture_2_Liver2
  [26] GI_Lecture_3_Liver3
  [27] GI_Lecture_4_Peds1
  [28] GI_Lecture_5_Peds2
  [29] GI_Lecture_7_IBD
  [30] GI_Lecture_8_Pan

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 5: BUILD RAG CURATION HUB (v2.2 - ENHANCED UI LAYOUT)
# ==============================================================================
#
# PURPOSE:
# This version implements the user-requested UI enhancement for a more efficient
# curation workflow.
#
# NEW FEATURES:
# 1.  **Dedicated Image Panel:** The middle panel is now solely for displaying the
#     page or slide image, maximizing visual context.
# 2.  **Split Action Panel:** The right-hand panel is now split vertically:
#     - **Top Half:** Contains all audit tools (current links, search, add/flag buttons).
#     - **Bottom Half:** Displays the full text content of the selected item.
# ==============================================================================

import json
import os
import shutil
import re
from IPython.display import display, clear_output

print("--- INITIALIZING RAG CURATION HUB BUILDER (v2.2) ---")

# ------------------------------------------------------------------------------
# STEP 5.1: DEFINE THE HUB GENERATION LOGIC
# ------------------------------------------------------------------------------

def generate_curation_hub_html(package_path: str, project_name: str, notebook_name: str, project_type: str):
    """
    Generates the self-contained HTML file with the enhanced 3-panel split layout.
    """
    print("  -> Generating CURATION_HUB.html file...")
    hub_html_path = os.path.join(package_path, 'CURATION_HUB.html')

    # HTML template updated with new structure and styles for the split right panel.
    html_template = f"""
    <!DOCTYPE html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>RAG Curation Hub: {project_name}</title>
        <script src="https://cdn.jsdelivr.net/npm/fuse.js/dist/fuse.min.js"></script>
        <style>
            :root {{ --bg-main: #f0f2f5; --bg-panel: #ffffff; --border-color: #d9d9d9; --text-color: #333; --primary: #1890ff; --danger: #f5222d; --success: #52c41a; --warning: #faad14; }}
            body {{ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Arial, sans-serif; margin: 0; background-color: var(--bg-main); display: flex; height: 100vh; overflow: hidden; }}
            .panel {{ background-color: var(--bg-panel); border-right: 1px solid var(--border-color); display: flex; flex-direction: column; }}
            #nav-panel {{ width: 25%; min-width: 350px; }}
            #context-panel {{ width: 50%; padding: 16px; box-sizing: border-box; justify-content: center; align-items: center; background-color: #e9ecef; }}
            #action-panel {{ width: 25%; min-width: 350px; border-right: none; }}
            .panel-header {{ display:flex; justify-content: space-between; align-items: center; padding: 12px 16px; border-bottom: 1px solid var(--border-color); font-weight: 600; background-color: #fafafa; flex-shrink: 0; }}
            .panel-content {{ overflow-y: auto; flex-grow: 1; }}
            #action-panel-top {{ height: 50%; overflow-y: auto; border-bottom: 1px solid var(--border-color); padding-bottom: 10px; }}
            #action-panel-bottom {{ height: 50%; overflow-y: auto; padding: 16px; }}
            #item-list {{ list-style-type: none; padding: 0; margin: 0; }}
            .list-item {{ display: flex; align-items: center; padding: 10px 16px; border-bottom: 1px solid #f0f0f0; cursor: pointer; }}
            .list-item:hover {{ background-color: #e6f7ff; }}
            .list-item.active {{ background-color: #bae7ff; font-weight: 500;}}
            .list-item input[type='checkbox'] {{ margin-right: 12px; }}
            .flag-icon {{ margin-left: auto; color: var(--danger); font-size: 1.2em; }}
            #page-image {{ max-width: 100%; max-height: 100%; object-fit: contain; box-shadow: 0 4px 8px rgba(0,0,0,0.1); }}
            #current-links-list, #search-results-list {{ list-style-type: none; padding: 0 16px; }}
            .link-item, .result-item {{ display: flex; justify-content: space-between; align-items: center; padding: 8px; border-bottom: 1px solid #f0f0f0; }}
            .link-item button, .result-item button {{ background: none; border: none; cursor: pointer; font-size: 1.2em; }}
            #search-input, #manual-heading-input, #manual-source-input {{ width: calc(100% - 32px); padding: 8px; margin: 8px 16px 0 16px; border: 1px solid var(--border-color); border-radius: 4px; }}
            .action-button {{ display: block; width: calc(100% - 32px); padding: 10px; margin: 8px 16px; border: none; border-radius: 4px; font-size: 0.9em; cursor: pointer; }}
            #save-button {{ background-color: var(--success); color: white; padding: 12px; font-size: 1em; margin-top: auto; }}
        </style>
    </head>
    <body>
        <div class="panel" id="nav-panel">
            <div class="panel-header">Items to Review</div>
            <div class="panel-content"><ul id="item-list"></ul></div>
        </div>
        <div class="panel" id="context-panel">
            <img id="page-image" src="" alt="Context Image">
        </div>
        <div class="panel" id="action-panel">
            <div class="panel-header">Audit & Actions</div>
            <div id="action-panel-top" class="panel-content">
                <button id="flag-button" class="action-button">🚩 Flag for Review</button>
                <button id="apply-to-page-button" class="action-button">🔄 Apply Links to All on Page</button>
                <hr style="margin: 16px;">
                <h4>Current Links</h4>
                <ul id="current-links-list"></ul>
                <hr style="margin: 16px;">
                <h4>Add Link via Search</h4>
                <input type="text" id="search-input" placeholder="Search notebook headings...">
                <ul id="search-results-list"></ul>
                <hr style="margin: 16px;">
                <h4>Add Link Manually</h4>
                <input type="text" id="manual-heading-input" placeholder="Enter new heading...">
                <input type="text" id="manual-source-input" placeholder="Enter source notebook...">
                <button id="manual-add-button" class="action-button">➕ Add Manual Link</button>
            </div>
            <div id="action-panel-bottom" class="panel-content">
                <h4 id="content-header">Content</h4>
                <div id="item-content-display"></div>
            </div>
            <button id="save-button" class="action-button">💾 Save & Download Updated JSON(s)</button>
        </div>

        <script>
            let allItems = [], notebookChunks = [], fuse, currentItemIndex = -1;
            const reviewedItems = new Set();
            const PROJECT_TYPE = '{project_type}', PROJECT_NAME = '{project_name}', NOTEBOOK_NAME = '{notebook_name}';
            const dom = {{
                itemList: document.getElementById('item-list'),
                pageImage: document.getElementById('page-image'),
                itemContent: document.getElementById('item-content-display'),
                contentHeader: document.getElementById('content-header'),
                currentLinks: document.getElementById('current-links-list'),
                searchInput: document.getElementById('search-input'),
                searchResults: document.getElementById('search-results-list'),
                saveButton: document.getElementById('save-button'),
                flagButton: document.getElementById('flag-button'),
                applyToPageButton: document.getElementById('apply-to-page-button'),
                manualHeadingInput: document.getElementById('manual-heading-input'),
                manualSourceInput: document.getElementById('manual-source-input'),
                manualAddButton: document.getElementById('manual-add-button')
            }};

            async function initialize() {{
                try {{
                    let dataPromises = (PROJECT_TYPE === 'textbook') ?
                        [fetch(`./data/${{PROJECT_NAME}}_CONTENT.json`).then(r => r.json()), fetch(`./data/${{PROJECT_NAME}}_FIGURES.json`).then(r => r.json())] :
                        [fetch(`./data/final_ENHANCED_data.json`).then(r => r.json())];
                    dataPromises.push(fetch(`./notebook/{notebook_name}`).then(r => r.text()));

                    const results = await Promise.all(dataPromises);
                    allItems = (PROJECT_TYPE === 'textbook') ? [...results[0], ...results[1]] : results[0];
                    const notebookMd = results[results.length - 1];

                    const sortKey = PROJECT_TYPE === 'textbook' ? 'page_number' : 'start_time';
                    const pageKey = PROJECT_TYPE === 'textbook' ? 'source_page' : null;
                    allItems.sort((a, b) => (a[sortKey] || a[pageKey]) - (b[sortKey] || b[pageKey]));
                    allItems.forEach((item, index) => {{ item.original_index = index; item.flagged_for_review = item.flagged_for_review || false; }});

                    parseNotebook(notebookMd, '{notebook_name}');
                    setupFuseSearch();
                    renderItemList();
                    if(allItems.length > 0) displayItem(0);
                }} catch (error) {{
                    console.error("Initialization Failed:", error);
                    alert("Error loading data. Check console for details and ensure all files are in the package.");
                }}
            }}

            function parseNotebook(markdown, sourceName) {{
                const splitPattern = /^(?:##|###)\\s+(.*)/gm;
                const chunks = []; let lastIndex = 0; let match;
                while ((match = splitPattern.exec(markdown)) !== null) {{
                    if (match.index > lastIndex) chunks.push({{ content: markdown.substring(lastIndex, match.index).trim() }});
                    chunks.push({{ heading: match[1].trim(), level: match[0].startsWith('###') ? 3 : 2, content: '' }});
                    lastIndex = splitPattern.lastIndex;
                }}
                if (lastIndex < markdown.length) chunks.push({{ content: markdown.substring(lastIndex).trim() }});
                let currentMain = null;
                notebookChunks = chunks.map(c => {{
                    if (c.level === 2) currentMain = c.heading;
                    if (c.level) return {{ main_heading: (c.level === 2 ? c.heading : currentMain), sub_heading: (c.level === 3 ? c.heading : null), source: sourceName }};
                    return null;
                }}).filter(Boolean);
            }}

            function setupFuseSearch() {{ fuse = new Fuse(notebookChunks, {{ keys: ['main_heading', 'sub_heading'], threshold: 0.3 }}); }}

            function renderItemList() {{
                dom.itemList.innerHTML = allItems.map((item, index) => {{
                    let title, pageOrTime;
                    if (PROJECT_TYPE === 'textbook') {{
                        const isFigure = item.type === 'textbook_figure';
                        title = isFigure ? `FIG: ${{item.figure_id || 'N/A'}}` : `CHUNK: ${{item.headings?.sub_heading || item.headings?.main_heading || 'Intro'}}`;
                        pageOrTime = `p.${{item.page_number || item.source_page}}`;
                    }} else {{
                        const mins = Math.floor(item.start_time / 60); const secs = Math.floor(item.start_time % 60);
                        title = `SLIDE: ${{item.title || 'Untitled'}}`;
                        pageOrTime = `t=${{mins}}:${{String(secs).padStart(2, '0')}}`;
                    }}
                    const flagIcon = item.flagged_for_review ? '<span class="flag-icon">🚩</span>' : '';
                    return `<li class="list-item ${{index === currentItemIndex ? 'active' : ''}}" onclick="displayItem(${{index}})">
                                <input type="checkbox" onchange="toggleReviewed(${{index}}, this.checked)" ${{reviewedItems.has(index) ? 'checked' : ''}}>
                                <span>${{title}} (${{pageOrTime}})</span>${{flagIcon}}
                            </li>`;
                }}).join('');
            }}

            function displayItem(index) {{
                currentItemIndex = index;
                const item = allItems[index];
                if (!item) return;

                if (PROJECT_TYPE === 'textbook') {{
                    const pageNum = item.page_number || item.source_page;
                    dom.pageImage.src = `./assets/${{PROJECT_NAME}}/page_images/page_${{String(pageNum).padStart(4, '0')}}.png`;
                    dom.contentHeader.textContent = "Chunk / Figure Content";
                    dom.itemContent.innerHTML = `<p>${{item.content || item.description}}</p>`;
                }} else {{
                    dom.pageImage.src = `./assets/${{PROJECT_NAME}}/slide_images/${{item.image_path.split('/').pop()}}`;
                    dom.contentHeader.textContent = "Enhanced Transcript";
                    dom.itemContent.innerHTML = `<p>${{item.cleaned_transcript || ''}}</p>`;
                }}
                updateFlagButtonState();
                renderCurrentLinks();
                renderItemList();
            }}

            function renderCurrentLinks() {{
                if (currentItemIndex === -1) return;
                const links = allItems[currentItemIndex].notebook_links || [];
                dom.currentLinks.innerHTML = links.map((link, index) => `
                    <li class="link-item">
                        <span>${{link.heading}} <i>(${{link.source.replace('.md','')}})</i></span>
                        <button onclick="deleteLink(${{index}})">❌</button>
                    </li>`).join('');
            }}

            function renderSearchResults(results) {{
                dom.searchResults.innerHTML = results.slice(0, 10).map(result => {{
                    const main = result.item.main_heading; const sub = result.item.sub_heading || '';
                    const fullHeading = sub ? `${{main}} - ${{sub}}` : main;
                    return `<li class="result-item"><span>${{fullHeading}}</span><button onclick="addLink('${{main.replace(/'/g, "\\\\'") }}', '${{sub.replace(/'/g, "\\\\'") }}')">➕</button></li>`;
                }}).join('');
            }}

            function updateFlagButtonState() {{
                const item = allItems[currentItemIndex];
                dom.flagButton.style.backgroundColor = item.flagged_for_review ? 'var(--danger)' : 'var(--warning)';
                dom.flagButton.textContent = item.flagged_for_review ? '🚩 Unflag Item' : '🚩 Flag for Review';
            }}

            function toggleReviewed(index, isChecked) {{ if (isChecked) reviewedItems.add(index); else reviewedItems.delete(index); }}
            function deleteLink(linkIndex) {{ if (currentItemIndex === -1) return; allItems[currentItemIndex].notebook_links.splice(linkIndex, 1); renderCurrentLinks(); }}

            function addLink(main, sub, source = NOTEBOOK_NAME, score = 1.00) {{
                if (currentItemIndex === -1) return;
                const newHeading = sub ? `${{main}} - ${{sub}}` : main;
                const newLink = {{ heading: newHeading, source: source, score: score }};
                if (!allItems[currentItemIndex].notebook_links) allItems[currentItemIndex].notebook_links = [];
                if (!allItems[currentItemIndex].notebook_links.some(l => l.heading === newHeading)) allItems[currentItemIndex].notebook_links.push(newLink);
                renderCurrentLinks();
                dom.searchInput.value = ''; dom.searchResults.innerHTML = '';
            }}

            function flagItem() {{ if (currentItemIndex === -1) return; allItems[currentItemIndex].flagged_for_review = !allItems[currentItemIndex].flagged_for_review; updateFlagButtonState(); renderItemList(); }}

            function applyToPage() {{
                if (currentItemIndex === -1 || PROJECT_TYPE !== 'textbook') {{
                    alert("This function is only available for textbook projects.");
                    return;
                }}
                const currentItem = allItems[currentItemIndex];
                const pageNum = currentItem.page_number || currentItem.source_page;
                if (!pageNum) {{ alert("This item does not have a page number."); return; }}
                const linksToApply = currentItem.notebook_links || [];
                let updatedCount = 0;
                allItems.forEach(item => {{
                    if ((item.page_number || item.source_page) === pageNum) {{
                        item.notebook_links = JSON.parse(JSON.stringify(linksToApply));
                        updatedCount++;
                    }}
                }});
                alert(`Applied links to ${{updatedCount}} items on page ${{pageNum}}.`);
                renderCurrentLinks();
            }}

            function saveAndDownload() {{
                let dataToSave = {{}};
                if (PROJECT_TYPE === 'textbook') {{
                    dataToSave[`${{PROJECT_NAME}}_CONTENT_CURATED.json`] = allItems.filter(i => i.type === 'textbook_chunk').map(i => {{ const newItem={{...i}}; delete newItem.original_index; return newItem; }});
                    dataToSave[`${{PROJECT_NAME}}_FIGURES_CURATED.json`] = allItems.filter(i => i.type === 'textbook_figure').map(i => {{ const newItem={{...i}}; delete newItem.original_index; return newItem; }});
                }} else {{
                    dataToSave['final_ENHANCED_data_CURATED.json'] = allItems.map(i => {{ const newItem={{...i}}; delete newItem.original_index; return newItem; }});
                }}
                for (const filename in dataToSave) {{
                    const blob = new Blob([JSON.stringify(dataToSave[filename], null, 4)], {{ type: 'application/json' }});
                    const a = document.createElement('a');
                    a.href = URL.createObjectURL(blob); a.download = filename; a.click(); URL.revokeObjectURL(a.href);
                }}
                alert("Your curated JSON file(s) are being downloaded!");
            }}

            // --- EVENT LISTENERS ---
            dom.searchInput.addEventListener('input', e => {{ const q = e.target.value; if (q.length > 2) renderSearchResults(fuse.search(q)); else dom.searchResults.innerHTML = ''; }});
            dom.saveButton.addEventListener('click', saveAndDownload);
            dom.flagButton.addEventListener('click', flagItem);
            dom.applyToPageButton.addEventListener('click', applyToPage);
            dom.manualAddButton.addEventListener('click', () => {{
                const heading = dom.manualHeadingInput.value.trim();
                const source = dom.manualSourceInput.value.trim() || NOTEBOOK_NAME;
                if (heading) addLink(heading, '', source);
                dom.manualHeadingInput.value = ''; dom.manualSourceInput.value = '';
            }});

            document.addEventListener('DOMContentLoaded', initialize);
        </script>
    </body>
    </html>
    """
    with open(hub_html_path, 'w', encoding='utf-8') as f:
        f.write(html_template)
    print("  ✅ Curation Hub HTML generated successfully.")

def build_curation_package(project_name, notebook_path, project_type, PATHS):
    # This Python function remains largely the same, its job is just to package the files.
    package_name = f"Curation_Session_{project_name}"
    package_path = os.path.join(PATHS['outputs'], package_name)
    print(f"\n[Builder] Creating package at: {os.path.relpath(package_path, '/content/drive/MyDrive/')}")

    if os.path.exists(package_path):
        shutil.rmtree(package_path)
    os.makedirs(package_path)

    data_dir = os.path.join(package_path, 'data')
    notebook_dir = os.path.join(package_path, 'notebook')
    assets_dir = os.path.join(package_path, 'assets', project_name)
    os.makedirs(data_dir); os.makedirs(notebook_dir); os.makedirs(assets_dir)

    print("  -> Copying data and asset files...")
    if project_type == 'textbook':
        for file_suffix in ['_CONTENT.json', '_FIGURES.json']:
            src = os.path.join(PATHS['content_textbooks'], f"{project_name}{file_suffix}")
            if os.path.exists(src): shutil.copy(src, data_dir)
            else: print(f"  - ⚠️ WARNING: {os.path.basename(src)} not found.")
        asset_src_dir = os.path.join(PATHS['asset_textbooks'], project_name)
        if os.path.isdir(asset_src_dir):
            shutil.copytree(asset_src_dir, assets_dir, dirs_exist_ok=True)
            print(f"    - Copied textbook assets for {project_name}.")
        else:
            print(f"  - ⚠️ WARNING: Asset directory not found: {asset_src_dir}")

    elif project_type == 'lecture':
        src = os.path.join(PATHS['content_lectures'], project_name, 'final_ENHANCED_data.json')
        if os.path.exists(src): shutil.copy(src, data_dir)
        else: print(f"  - ⚠️ WARNING: {os.path.basename(src)} not found.")
        asset_src_dir = os.path.join(PATHS['asset_lectures'], project_name)
        if os.path.isdir(asset_src_dir):
            shutil.copytree(asset_src_dir, assets_dir, dirs_exist_ok=True)
            print(f"    - Copied lecture assets for {project_name}.")
        else:
            print(f"  - ⚠️ WARNING: Asset directory not found: {asset_src_dir}")

    print("  -> Copying notebook file...")
    shutil.copy(notebook_path, notebook_dir)

    generate_curation_hub_html(package_path, project_name, os.path.basename(notebook_path), project_type)

    print(f"\n[Compressing] Zipping the package '{package_name}'...")
    zip_path_base = os.path.join(PATHS['outputs'], package_name)
    shutil.make_archive(base_name=zip_path_base, format='zip', root_dir=package_path)
    shutil.rmtree(package_path)

    print("\n" + "="*80)
    print("✅ RAG CURATION HUB PACKAGE CREATED & ZIPPED!")
    print(f"ACTION REQUIRED: Download '{package_name}.zip' from your '_outputs' folder, unzip it, and open 'CURATION_HUB.html' in your browser.")
    print("="*80)

# ------------------------------------------------------------------------------
# INTERACTIVE EXECUTION
# ------------------------------------------------------------------------------
try:
    if 'PATHS' not in globals():
        raise NameError("PATHS dictionary not found. Please run Block 0 first.")

    print("\n--- SELECT PROJECT TYPE TO CURATE ---")
    print("  [1] Textbook Project")
    print("  [2] Lecture Project")
    type_choice = input("Enter your choice (1 or 2): ").strip()

    project_type = None; available_projects = []
    if type_choice == '1':
        project_type = 'textbook'
        project_dir = PATHS['content_textbooks']
        available_projects = sorted(list(set([f.replace('_CONTENT.json', '').replace('_FIGURES.json', '') for f in os.listdir(project_dir) if f.endswith(('_CONTENT.json', '_FIGURES.json'))])))
    elif type_choice == '2':
        project_type = 'lecture'
        project_dir = PATHS['content_lectures']
        available_projects = sorted([d for d in os.listdir(project_dir) if os.path.isdir(os.path.join(project_dir, d))])
    else:
        raise ValueError("Invalid project type selected.")

    if not available_projects:
        print(f"❌ No processed {project_type} projects found to curate.")
    else:
        print(f"\n--- SELECT {project_type.upper()} PROJECT TO CURATE ---")
        [print(f"  [{i+1}] {name}") for i, name in enumerate(available_projects)]
        project_choice_input = input(f"Enter the number of the {project_type} project: ").strip()
        project_choice = int(project_choice_input) - 1

        if 0 <= project_choice < len(available_projects):
            selected_project_name = available_projects[project_choice]
            print(f"\nSelected Project: {selected_project_name}")

            print("\n--- SELECT YOUR NOTEBOOK FILE ---")
            DEFAULT_NOTEBOOK_DIR = PATHS['notebooks']
            discovered_notebooks = []
            if os.path.isdir(DEFAULT_NOTEBOOK_DIR):
                discovered_notebooks = sorted([f for f in os.listdir(DEFAULT_NOTEBOOK_DIR) if f.lower().endswith('.md')])

            if discovered_notebooks:
                print("Discovered notebooks in your default directory:")
                [print(f"  [{i+1}] {name}") for i, name in enumerate(discovered_notebooks)]
                print("\nSelect by number, or paste the full path to any other .md file.")
            else:
                print("No notebooks found. Please provide the full path to your .md file.")

            notebook_input = input("Enter selection: ").strip()
            notebook_path = None
            try:
                notebook_choice_index = int(notebook_input) - 1
                if 0 <= notebook_choice_index < len(discovered_notebooks):
                    notebook_path = os.path.join(DEFAULT_NOTEBOOK_DIR, discovered_notebooks[notebook_choice_index])
            except (ValueError, IndexError):
                if os.path.exists(notebook_input) and notebook_input.lower().endswith('.md'):
                    notebook_path = notebook_input

            if notebook_path:
                print(f"--> Using notebook: {os.path.basename(notebook_path)}")
                build_curation_package(selected_project_name, notebook_path, project_type, PATHS)
            else:
                 print("❌ Invalid notebook selection or path provided.")
        else:
            print("❌ Invalid project selection.")

except Exception as e:
    print(f"\nAn unexpected error occurred during execution: {e}")

#Utility Blocks

##Project-Based Migration Utility

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK M: PROJECT-BASED MIGRATION UTILITY (v1.3 - FINAL PATH FIX)
# ==============================================================================
#
# PURPOSE:
# This definitive version fixes the bug where the script could not find the
# source '..._STANDARDIZED.json' file due to naming inconsistencies (space vs.
# underscore). It now intelligently checks for both possible folder names.
# ==============================================================================

import json
import os
import re
import shutil

print("--- INITIALIZING PROJECT-BASED MIGRATION UTILITY (v1.3) ---")

# ------------------------------------------------------------------------------
# MIGRATION LOGIC
# ------------------------------------------------------------------------------

def migrate_lecture_project(project_name, old_base_dir, new_content_dir, new_asset_dir, pipeline_root):
    """Migrates a complete lecture project folder."""
    print(f"\n🚀 Migrating LECTURE project: {project_name}")
    old_project_path = os.path.join(old_base_dir, project_name)

    old_json_path = os.path.join(old_project_path, 'final_ENHANCED_data.json')
    old_screenshots_path = os.path.join(old_project_path, 'screenshots')
    source_video_file = next((f for f in os.listdir(old_project_path) if f.lower().endswith(('.mp4', '.mov', '.mkv'))), None)

    new_content_project_path = os.path.join(new_content_dir, project_name)
    new_asset_project_path = os.path.join(new_asset_dir, project_name)
    new_slide_images_path = os.path.join(new_asset_project_path, 'slide_images')
    new_json_path = os.path.join(new_content_project_path, 'final_ENHANCED_data.json')

    os.makedirs(new_content_project_path, exist_ok=True)
    os.makedirs(new_slide_images_path, exist_ok=True)

    print("  -> Migrating assets...")
    copied_images = 0
    if os.path.isdir(old_screenshots_path):
        for img in os.listdir(old_screenshots_path):
            shutil.copy(os.path.join(old_screenshots_path, img), new_slide_images_path)
            copied_images += 1
        print(f"    - Copied {copied_images} slide images.")
    if source_video_file:
        shutil.copy(os.path.join(old_project_path, source_video_file), new_asset_project_path)
        print(f"    - Copied video file: {source_video_file}")

    print("  -> Migrating and transforming JSON data...")
    if not os.path.exists(old_json_path):
        print(f"    - ⚠️ WARNING: 'final_ENHANCED_data.json' not found. Cannot migrate data.")
        return

    with open(old_json_path, 'r', encoding='utf-8') as f: old_data = json.load(f)
    new_data = []
    for item in old_data:
        new_item = {k: v for k, v in item.items() if k not in ['video_index', 'video_filename', 'keywords', 'raw_transcript', 'screenshot_path']}
        image_filename = os.path.basename(item.get('screenshot_path', ''))
        new_item['image_path'] = os.path.relpath(os.path.join(new_slide_images_path, image_filename), pipeline_root)
        if 'notebook_heading' in item: new_item['notebook_links'] = item['notebook_heading']
        new_data.append(new_item)

    with open(new_json_path, 'w', encoding='utf-8') as f: json.dump(new_data, f, indent=4)
    print("  ✅ Success! Project migration complete.")


def migrate_textbook_project(project_name, old_content_dir, old_asset_dir, old_anno_dir, new_content_dir, new_asset_dir, pipeline_root):
    """Migrates a complete textbook project, merging and repairing data."""
    print(f"\n🚀 Migrating TEXTBOOK project: {project_name}")

    # ======================= THE CRITICAL FIX (v1.3) =======================
    # Intelligently find the path to the "good" metadata file by checking for
    # both underscore and space versions of the project name in the old directory.

    project_name_with_space = project_name.replace('_', ' ')

    # Path with underscore (e.g., /Annotation_Project/SoftTissue_Enzinger/...)
    path_option_1 = os.path.join(old_anno_dir, project_name, f"{project_name}_STANDARDIZED.json")
    # Path with space (e.g., /Annotation_Project/SoftTissue Enzinger/...)
    path_option_2 = os.path.join(old_anno_dir, project_name_with_space, f"{project_name}_STANDARDIZED.json")

    old_standard_fig_path = None
    if os.path.exists(path_option_1):
        old_standard_fig_path = path_option_1
    elif os.path.exists(path_option_2):
        old_standard_fig_path = path_option_2

    # =====================================================================

    old_content_json_path = os.path.join(old_content_dir, f"{project_name}_CONTENT.json")
    old_corrected_fig_path = os.path.join(old_content_dir, f"{project_name}_STANDARDIZED_CORRECTED.json")
    old_asset_project_path = os.path.join(old_asset_dir, project_name)

    new_asset_project_path = os.path.join(new_asset_dir, project_name)
    new_content_json_path = os.path.join(new_content_dir, f"{project_name}_CONTENT.json")
    new_figure_json_path = os.path.join(new_content_dir, f"{project_name}_FIGURES.json")

    print("  -> Migrating assets...")
    if os.path.isdir(old_asset_project_path):
        if not os.path.exists(new_asset_project_path):
             shutil.copytree(old_asset_project_path, new_asset_project_path, dirs_exist_ok=True)
             print(f"    - Copied complete asset folder for {project_name}.")
        else:
             print(f"    - Asset folder already exists. Skipping copy.")

    if os.path.exists(old_content_json_path):
        print("  -> Migrating and transforming CONTENT file...")
        with open(old_content_json_path, 'r', encoding='utf-8') as f: old_data = json.load(f)
        all_content_chunks, current_main_heading, current_sub_heading = [], None, None
        for page in old_data:
            page_num, pdf_filename, markdown = page.get('page_number'), page.get('source_document'), page.get('content_markdown', '')
            parts = re.split(r'(^(?:##|###)\s+.*)', markdown, flags=re.MULTILINE)
            if parts[0] and parts[0].strip(): all_content_chunks.append({"source_document": pdf_filename, "page_number": page_num, "headings": {"main_heading": current_main_heading, "sub_heading": current_sub_heading}, "content": parts[0].strip()})
            for i in range(1, len(parts), 2):
                heading_line = parts[i]; content_for_heading = parts[i+1].strip() if (i + 1) < len(parts) else ""
                match = re.match(r'^(##|###)\s+(.*)', heading_line)
                if match:
                    level, title = match.groups()
                    if level == '##': current_main_heading, current_sub_heading = title.strip(), None
                    else: current_sub_heading = title.strip()
                if content_for_heading: all_content_chunks.append({"source_document": pdf_filename, "page_number": page_num, "headings": {"main_heading": current_main_heading, "sub_heading": current_sub_heading}, "content": content_for_heading})
        with open(new_content_json_path, 'w', encoding='utf-8') as f: json.dump(all_content_chunks, f, indent=4)
        print(f"    - Transformed {len(old_data)} pages into {len(all_content_chunks)} chunks.")

    print("  -> Merging and repairing FIGURES file...")
    if os.path.exists(old_corrected_fig_path) and old_standard_fig_path: # Check if the path was found
        good_metadata = {}
        with open(old_standard_fig_path, 'r', encoding='utf-8') as f:
            for item in json.load(f):
                filename = os.path.basename(item.get('image_path', ''))
                good_metadata[filename] = {"description": item.get('description'), "figure_id": item.get('figure_id')}
        with open(old_corrected_fig_path, 'r', encoding='utf-8') as f: old_data = json.load(f)
        new_data = []
        deduced_pdf_name = f"{project_name.replace('_', ' ')}.pdf"
        for item in old_data:
            filename = os.path.basename(item.get('image_path', ''))
            new_item = {"source_document": deduced_pdf_name, "source_page": item.get('source_page'), "image_path": os.path.relpath(os.path.join(new_asset_project_path, 'figure_images', filename), pipeline_root)}
            if filename in good_metadata:
                new_item['figure_id'] = good_metadata[filename].get('figure_id', 'N/A')
                new_item['description'] = good_metadata[filename].get('description')
            else:
                new_item['figure_id'] = item.get('figure_id', 'N/A')
                new_item['description'] = item.get('description')
            if 'notebook_heading' in item: new_item['notebook_links'] = item['notebook_heading']
            new_data.append(new_item)
        with open(new_figure_json_path, 'w', encoding='utf-8') as f: json.dump(new_data, f, indent=4)
        print(f"    - Successfully merged and repaired {len(new_data)} figure entries.")
    else:
        print(f"    - ⚠️ WARNING: Could not find both required _STANDARDIZED files for {project_name}. Skipping figure migration.")
    print("  ✅ Success! Project migration complete.")

# ------------------------------------------------------------------------------
# INTERACTIVE EXECUTION
# ------------------------------------------------------------------------------
try:
    OLD_DIDACTIC_ROOT = '/content/drive/MyDrive/1-Projects/Didactic_to_JSON'
    OLD_ANNO_ROOT = '/content/drive/MyDrive/1-Projects/Annotation_Project'
    OLD_OUTPUTS_DIR = os.path.join(OLD_DIDACTIC_ROOT, 'Outputs')
    OLD_CONTENT_DIR = os.path.join(OLD_DIDACTIC_ROOT, '_textbook_content')
    OLD_ASSET_DIR = os.path.join(OLD_DIDACTIC_ROOT, '_textbook_assets')

    KNOWLEDGE_PIPELINE_ROOT = '/content/drive/MyDrive/1-Projects/Knowledge_Pipeline'
    NEW_CONTENT_LECTURES_DIR = os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_content_library', 'lectures')
    NEW_ASSET_LECTURES_DIR = os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_asset_library', 'lectures')
    NEW_CONTENT_TEXTBOOKS_DIR = os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_content_library', 'textbooks')
    NEW_ASSET_TEXTBOOKS_DIR = os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_asset_library', 'textbooks')

    while True:
        print("\n" + "="*80)
        print("Select the type of OLD project you want to migrate:")
        print("  [1] LECTURE project (from the 'Outputs' folder)")
        print("  [2] TEXTBOOK project (from '_textbook_content' and '_textbook_assets')")
        print("  [3] Exit Migration Utility")

        choice = input("\nEnter your choice (1-3): ").strip()
        if choice == '3': print("Exiting utility."); break

        projects, old_dir = [], ''
        if choice == '1':
            old_dir = OLD_OUTPUTS_DIR
            projects = sorted([d for d in os.listdir(old_dir) if os.path.isdir(os.path.join(old_dir, d))])
            print("\nSelect LECTURE project(s) to migrate (e.g., 1, 3-5 or 'all'):")
        elif choice == '2':
            old_dir = OLD_CONTENT_DIR
            projects = sorted(list(set([f.replace('_CONTENT.json','').replace('_STANDARDIZED_CORRECTED.json','') for f in os.listdir(old_dir)])))
            print("\nSelect TEXTBOOK project(s) to migrate (e.g., 1, 2 or 'all'):")
        else:
            print("Invalid choice."); continue

        for i, name in enumerate(projects): print(f"  [{i+1}] {name}")
        user_input = input("Enter selection: ").strip().lower()

        selected_indices = set()
        if user_input == 'all':
            selected_indices = set(range(len(projects)))
        else:
            parts = user_input.split(',')
            for part in parts:
                part = part.strip()
                if '-' in part:
                    start, end = map(int, part.split('-'))
                    selected_indices.update(range(start - 1, end))
                elif part.isdigit():
                    selected_indices.add(int(part) - 1)

        selected_projects = [projects[i] for i in sorted(list(selected_indices)) if 0 <= i < len(projects)]

        if not selected_projects:
            print("No valid projects selected.")
            continue

        for proj_name in selected_projects:
            if choice == '1':
                migrate_lecture_project(proj_name, OLD_OUTPUTS_DIR, NEW_CONTENT_LECTURES_DIR, NEW_ASSET_LECTURES_DIR, KNOWLEDGE_PIPELINE_ROOT)
            elif choice == '2':
                migrate_textbook_project(proj_name, OLD_CONTENT_DIR, OLD_ASSET_DIR, OLD_ANNO_ROOT, NEW_CONTENT_TEXTBOOKS_DIR, NEW_ASSET_TEXTBOOKS_DIR, KNOWLEDGE_PIPELINE_ROOT)
except Exception as e:
    print(f"\nAn unexpected error occurred: {e}")

--- INITIALIZING PROJECT-BASED MIGRATION UTILITY (v1.3) ---

Select the type of OLD project you want to migrate:
  [1] LECTURE project (from the 'Outputs' folder)
  [2] TEXTBOOK project (from '_textbook_content' and '_textbook_assets')
  [3] Exit Migration Utility


KeyboardInterrupt: Interrupted by user

##Project Status Dashboard**

v1.5

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK S: PROJECT STATUS DASHBOARD (v1.5 - ERROR THRESHOLD)
# ==============================================================================
#
# PURPOSE:
# This definitive version incorporates an error tolerance. Files with 3 or fewer
# errors are now considered 'Processed', making the dashboard less sensitive to
# minor, expected issues like blank title slides.
# ==============================================================================

import os
import json
import pandas as pd
from IPython.display import display

print("--- INITIALIZING PROJECT STATUS DASHBOARD (v1.5) ---")

def check_pdf_status():
    """Scans both source and content to create a complete PDF status report."""
    if 'PATHS' not in globals(): raise NameError("PATHS not defined. Run Block 0.")
    source_dir = PATHS['source_pdfs']
    content_dir = PATHS['content_textbooks']

    source_projects = {os.path.splitext(f)[0].replace(' ', '_') for f in os.listdir(source_dir) if f.lower().endswith('.pdf') and not f.startswith('.')}
    content_projects = {f.replace('_CONTENT.json', '').replace('_FIGURES.json', '') for f in os.listdir(content_dir)}
    master_project_list = sorted(list(source_projects.union(content_projects)))

    statuses = []

    for base_name in master_project_list:
        content_path = os.path.join(content_dir, f"{base_name}_CONTENT.json")
        figures_path = os.path.join(content_dir, f"{base_name}_FIGURES.json")
        assumed_pdf_name = base_name.replace('_', ' ') + ".pdf"
        source_path = os.path.join(source_dir, assumed_pdf_name)
        source_status = "✅ Found" if os.path.exists(source_path) else "❓ Missing (Migrated?)"
        has_content = os.path.exists(content_path)
        has_figures = os.path.exists(figures_path)

        if not has_content and not has_figures:
            content_status, figures_status = "⚪️ Unprocessed", "⚪️ Unprocessed"
        else:
            content_status = "❌ Missing"
            if has_content:
                try:
                    with open(content_path, 'r', encoding='utf-8') as f: data = json.load(f)
                    if not data: content_status = "⚠️ Empty"
                    else:
                        error_count = sum(1 for chunk in data if "Error processing page" in chunk.get('content', ''))
                        # NEW: Implemented error threshold of 3
                        if error_count <= 3: content_status = "✅ Processed"
                        elif error_count == len(data): content_status = "⚠️ All Chunks Errored"
                        else: content_status = f"🟡 Partial ({error_count} errors)"
                except (json.JSONDecodeError, IndexError): content_status = "⚠️ Corrupt"

            figures_status = "❌ Missing"
            if has_figures:
                try:
                    with open(figures_path, 'r', encoding='utf-8') as f: data = json.load(f)
                    num_figures = len(data)
                    if num_figures >= 100: figures_status = f"✅ {num_figures} figs"
                    elif num_figures > 0: figures_status = f"🟡 {num_figures} figs (Incomplete)"
                    else: figures_status = "⚠️ Empty"
                except json.JSONDecodeError: figures_status = "⚠️ Corrupt"

        statuses.append({
            "Source PDF": source_status, "Project Name": base_name,
            "Content JSON": content_status, "Figures JSON": figures_status
        })
    return statuses

def check_video_status():
    """Scans both source and content to create a complete video status report."""
    if 'PATHS' not in globals(): raise NameError("PATHS not defined. Run Block 0.")
    source_dir = PATHS['source_videos']
    content_dir = PATHS['content_lectures']
    asset_dir = PATHS['asset_lectures']

    source_videos = {os.path.splitext(f)[0].rsplit('_Part', 1)[0].rsplit(' Part', 1)[0] for f in os.listdir(source_dir) if not f.startswith('.') and f.lower().endswith(('.mp4', '.mov', '.mkv'))}
    content_lectures = {d for d in os.listdir(content_dir) if os.path.isdir(os.path.join(content_dir, d))}
    master_lecture_list = sorted(list(source_videos.union(content_lectures)))
    statuses = []

    for lecture_name in master_lecture_list:
        enhanced_path = os.path.join(content_dir, lecture_name, 'final_ENHANCED_data.json')
        source_status = "❓ Missing (Migrated?)"
        lecture_asset_path = os.path.join(asset_dir, lecture_name)
        if os.path.exists(lecture_asset_path) and any(f.lower().endswith(('.mp4', '.mov', '.mkv')) for f in os.listdir(lecture_asset_path)):
            source_status = "✅ Found (in assets)"
        elif any(f.startswith(lecture_name) and f.lower().endswith(('.mp4', '.mov', '.mkv')) for f in os.listdir(source_dir)):
            source_status = "✅ Found (in source)"

        json_status = "⚪️ Unprocessed"
        if os.path.exists(enhanced_path):
            try:
                with open(enhanced_path, 'r', encoding='utf-8') as f: data = json.load(f)
                if not data: json_status = "⚠️ Empty"
                else:
                    error_count = sum(1 for slide in data if "Error" in slide.get('title', ''))
                    # NEW: Implemented error threshold of 3
                    if error_count <= 3: json_status = "✅ Processed"
                    elif error_count == len(data): json_status = "⚠️ All Slides Errored"
                    else: json_status = f"🟡 Partial ({error_count} errors)"
            except (json.JSONDecodeError, IndexError): json_status = "⚠️ Corrupt"

        statuses.append({
            "Source Video": source_status, "Lecture Folder": lecture_name, "Enhanced JSON": json_status
        })
    return statuses

# --- Main Execution ---
try:
    print("\n--- PDF PROCESSING STATUS ---")
    pdf_statuses = check_pdf_status()
    if pdf_statuses:
        pdf_df = pd.DataFrame(pdf_statuses)
        def colorize(val):
            s_val = str(val)
            if '✅' in s_val: color = '#d4edda' # Green
            elif '🟡' in s_val: color = '#fff3cd' # Yellow
            elif '⚪️' in s_val: color = '#e2e3e5' # Grey
            elif '⚠️' in s_val or '❓' in s_val or '❌' in s_val: color = '#f8d7da' # Red
            else: color = 'white'
            return f'background-color: {color}'
        display(pdf_df.style.applymap(colorize).set_properties(**{'text-align': 'left'}))
    else:
        print("No source PDFs or processed PDF content found.")

    print("\n\n--- VIDEO PROCESSING STATUS ---")
    video_statuses = check_video_status()
    if video_statuses:
        video_df = pd.DataFrame(video_statuses)
        display(video_df.style.applymap(colorize).set_properties(**{'text-align': 'left'}))
    else:
        print("No source videos or processed lecture content found.")

except Exception as e:
    print(f"\nAn unexpected error occurred while generating the status report: {e}")

v1.6

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK S: PROJECT STATUS DASHBOARD (v1.6 - RAG AWARE)
# ==============================================================================
#
# PURPOSE:
# This definitive version makes the dashboard "RAG Aware." It adds a new column
# that inspects the output files and reports the percentage and raw count of
# items that have been successfully linked to a notebook by the RAG engine.
# ==============================================================================

import os
import json
import pandas as pd
from IPython.display import display

print("--- INITIALIZING PROJECT STATUS DASHBOARD (v1.6 - RAG AWARE) ---")

def get_rag_status(file_path):
    """Checks a JSON file and returns the RAG linking status."""
    if not os.path.exists(file_path):
        return "N/A"
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        if not data:
            return "⚠️ Empty"

        total_items = len(data)
        linked_items = sum(1 for item in data if 'notebook_links' in item and item['notebook_links'])

        if total_items == 0:
            return "⚠️ Empty"

        percentage = (linked_items / total_items) * 100

        if percentage >= 75:
            return f"✅ {percentage:.0f}% ({linked_items}/{total_items})"
        elif percentage > 0:
            return f"🟡 {percentage:.0f}% ({linked_items}/{total_items})"
        else:
            return "⚪️ 0% (0)"

    except (json.JSONDecodeError, IndexError):
        return "⚠️ Corrupt"
    except Exception:
        return "❓ Error"


def check_pdf_status():
    """Scans both source and content to create a complete PDF status report."""
    if 'PATHS' not in globals(): raise NameError("PATHS not defined. Run Block 0.")
    source_dir = PATHS['source_pdfs']
    content_dir = PATHS['content_textbooks']

    source_projects = {os.path.splitext(f)[0].replace(' ', '_') for f in os.listdir(source_dir) if f.lower().endswith('.pdf') and not f.startswith('.')}
    content_projects = {f.replace('_CONTENT.json', '').replace('_FIGURES.json', '') for f in os.listdir(content_dir)}
    master_project_list = sorted(list(source_projects.union(content_projects)))

    statuses = []

    for base_name in master_project_list:
        content_path = os.path.join(content_dir, f"{base_name}_CONTENT.json")
        figures_path = os.path.join(content_dir, f"{base_name}_FIGURES.json")
        assumed_pdf_name = base_name.replace('_', ' ') + ".pdf"
        source_path = os.path.join(source_dir, assumed_pdf_name)
        source_status = "✅ Found" if os.path.exists(source_path) else "❓ Missing"

        # Processing Status
        content_proc_status = "⚪️ Unprocessed"
        if os.path.exists(content_path):
            try:
                with open(content_path, 'r', encoding='utf-8') as f: data = json.load(f)
                if not data: content_proc_status = "⚠️ Empty"
                else:
                    error_count = sum(1 for chunk in data if "Error processing page" in chunk.get('content', ''))
                    if error_count <= 3: content_proc_status = "✅ Processed"
                    else: content_proc_status = f"🟡 Partial ({error_count} errors)"
            except (json.JSONDecodeError, IndexError): content_proc_status = "⚠️ Corrupt"

        figures_proc_status = "⚪️ Unprocessed"
        if os.path.exists(figures_path):
            try:
                with open(figures_path, 'r', encoding='utf-8') as f: data = json.load(f)
                figures_proc_status = f"✅ {len(data)} figs" if data else "⚠️ Empty"
            except json.JSONDecodeError: figures_proc_status = "⚠️ Corrupt"

        # RAG Status
        content_rag_status = get_rag_status(content_path)
        figures_rag_status = get_rag_status(figures_path)


        statuses.append({
            "Source PDF": source_status, "Project Name": base_name,
            "Content JSON": content_proc_status, "Content RAG": content_rag_status,
            "Figures JSON": figures_proc_status, "Figures RAG": figures_rag_status
        })
    return statuses

def check_video_status():
    """Scans both source and content to create a complete video status report."""
    if 'PATHS' not in globals(): raise NameError("PATHS not defined. Run Block 0.")
    source_dir = PATHS['source_videos']
    content_dir = PATHS['content_lectures']
    asset_dir = PATHS['asset_lectures']

    source_videos = {os.path.splitext(f)[0].rsplit('_Part', 1)[0].rsplit(' Part', 1)[0] for f in os.listdir(source_dir) if not f.startswith('.') and f.lower().endswith(('.mp4', '.mov', '.mkv'))}
    content_lectures = {d for d in os.listdir(content_dir) if os.path.isdir(os.path.join(content_dir, d))}
    master_lecture_list = sorted(list(source_videos.union(content_lectures)))
    statuses = []

    for lecture_name in master_lecture_list:
        enhanced_path = os.path.join(content_dir, lecture_name, 'final_ENHANCED_data.json')
        source_status = "❓ Missing"
        lecture_asset_path = os.path.join(asset_dir, lecture_name)
        if os.path.exists(lecture_asset_path) and any(f.lower().endswith(('.mp4', '.mov', '.mkv')) for f in os.listdir(lecture_asset_path)):
            source_status = "✅ Found (in assets)"
        elif any(f.startswith(lecture_name) and f.lower().endswith(('.mp4', '.mov', '.mkv')) for f in os.listdir(source_dir)):
            source_status = "✅ Found (in source)"

        json_proc_status = "⚪️ Unprocessed"
        if os.path.exists(enhanced_path):
            try:
                with open(enhanced_path, 'r', encoding='utf-8') as f: data = json.load(f)
                if not data: json_proc_status = "⚠️ Empty"
                else:
                    error_count = sum(1 for slide in data if "Error" in slide.get('title', ''))
                    if error_count <= 3: json_proc_status = "✅ Processed"
                    else: json_proc_status = f"🟡 Partial ({error_count} errors)"
            except (json.JSONDecodeError, IndexError): json_proc_status = "⚠️ Corrupt"

        # RAG Status
        rag_status = get_rag_status(enhanced_path)

        statuses.append({
            "Source Video": source_status, "Lecture Folder": lecture_name,
            "Enhanced JSON": json_proc_status, "RAG Links": rag_status
        })
    return statuses

# --- Main Execution ---
try:
    print("\n--- PDF PROCESSING STATUS ---")
    pdf_statuses = check_pdf_status()
    if pdf_statuses:
        pdf_df = pd.DataFrame(pdf_statuses)
        # Reorder columns to group RAG status
        pdf_df = pdf_df[["Source PDF", "Project Name", "Content JSON", "Content RAG", "Figures JSON", "Figures RAG"]]
        def colorize(val):
            s_val = str(val)
            if '✅' in s_val: color = '#d4edda' # Green
            elif '🟡' in s_val: color = '#fff3cd' # Yellow
            elif '⚪️' in s_val: color = '#e2e3e5' # Grey
            elif '⚠️' in s_val or '❓' in s_val or '❌' in s_val: color = '#f8d7da' # Red
            else: color = 'white'
            return f'background-color: {color}'
        display(pdf_df.style.applymap(colorize).set_properties(**{'text-align': 'left'}))
    else:
        print("No source PDFs or processed PDF content found.")

    print("\n\n--- VIDEO PROCESSING STATUS ---")
    video_statuses = check_video_status()
    if video_statuses:
        video_df = pd.DataFrame(video_statuses)
        display(video_df.style.applymap(colorize).set_properties(**{'text-align': 'left'}))
    else:
        print("No source videos or processed lecture content found.")

except Exception as e:
    print(f"\nAn unexpected error occurred while generating the status report: {e}")

##Directory Tree Generator

In [None]:
# @title {display-mode: "form"}
# UTILITY
# Cell 1: Directory Tree Generator

import os

def generate_tree(root_dir):
    """
    Generates and prints a file tree for a given directory, ignoring specified
    file extensions and directories.
    """
    # --- Configuration ---
    # Add any file extensions you want to ignore to this set (lowercase).
    ignored_extensions = {'.jpg', '.jpeg', '.png'}
    # Add any directory names you want to ignore to this set.
    ignored_dirs = {'__pycache__', '.git', '.ipynb_checkpoints'}
    # -------------------

    def tree_level(current_path, prefix=""):
        """A recursive function to build and print each level of the tree."""
        try:
            # Get all items in the current directory, excluding hidden ones
            entries = [e for e in os.listdir(current_path) if not e.startswith('.')]
        except OSError as e:
            print(f"Error accessing {current_path}: {e}")
            return

        # Filter out ignored directories and files with ignored extensions
        filtered_entries = []
        for entry in entries:
            full_path = os.path.join(current_path, entry)
            if os.path.isdir(full_path):
                if entry not in ignored_dirs:
                    filtered_entries.append(entry)
            else:
                if not entry.lower().endswith(tuple(ignored_extensions)):
                    filtered_entries.append(entry)

        filtered_entries.sort()
        pointers = ['├── '] * (len(filtered_entries) - 1) + ['└── ']

        for pointer, entry in zip(pointers, filtered_entries):
            print(prefix + pointer + entry)
            full_path = os.path.join(current_path, entry)
            if os.path.isdir(full_path):
                # Determine the prefix for the next level
                extension = '│   ' if pointer == '├── ' else '    '
                tree_level(full_path, prefix=prefix + extension)

    # --- Start the process ---
    print(f"{os.path.basename(os.path.abspath(root_dir))}/")
    tree_level(root_dir)


# --- RUN THE SCRIPT ---
# ❗️ Change this path to the directory you want to analyze.
# For example:
# - '/content/' for the default Colab directory
# - '/content/drive/MyDrive/Your_Project_Folder' for a folder in Google Drive
target_directory = '/content/drive/MyDrive/1-Projects/Knowledge_Pipeline'

generate_tree(target_directory)

Knowledge_Pipeline/
├── Pathology_Notebook
│   ├── Bone and Soft Tissue Pathology Notebook 0920 v2.md
│   ├── Bone and Soft Tissue Pathology Notebook.gdoc
│   ├── Breast Pathology Notebook.gdoc
│   ├── Breast Pathology Notebook.md
│   ├── Cytopathology Notebook .gdoc
│   ├── Endocrine Pathology Notebook.gdoc
│   ├── Gastrointestinal Pathology Notebook .gdoc
│   ├── Genitourinary Pathology Notebook.gdoc
│   ├── Gynecologic Pathology Notebook.gdoc
│   ├── Gynecologic Pathology Notebook.md
│   ├── Head & Neck Pathology Notebook.gdoc
│   ├── Pathology Notebook  (Main Document).gdoc
│   ├── Skin Pathology Notebook 0921.md
│   ├── Skin Pathology Notebook.gdoc
│   ├── Soft Tissue Pathology Notebook.gdoc
│   └── Thoracic Pathology Notebook.gdoc
├── YT_LINKS.json
├── _asset_library
│   ├── lectures
│   │   ├── BST_Lecture_1_Grossing
│   │   │   └── slide_images
│   │   ├── BST_Lecture_2_SoftTissue1
│   │   │   ├── generated_lecture_audio
│   │   │   └── slide_images
│   │   ├── BST_Lecture_3_So

## AI Crop Lecture Slides

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# BLOCK 2.1: CROP LECTURE SLIDES (UTILITY)
# ==============================================================================
#
# PURPOSE:
# An optional utility to intelligently crop the slide images for a processed
# lecture. This should be run AFTER Block 2 is complete for a given lecture.
# It uses a robust edge-detection method to find the main content area.
#
# ==============================================================================

import cv2
from PIL import Image
import numpy as np
from IPython.display import display, clear_output
import json
import os
import time

# ------------------------------------------------------------------------------
# INTERACTIVE EXECUTION
# ------------------------------------------------------------------------------
try:
    # --- Define required paths ---
    KNOWLEDGE_PIPELINE_ROOT = '/content/drive/MyDrive/1-Projects/Knowledge_Pipeline'
    CONTENT_LECTURES_DIR = os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_content_library', 'lectures')
    ASSET_LECTURES_DIR = os.path.join(KNOWLEDGE_PIPELINE_ROOT, '_asset_library', 'lectures')

    # --- Lecture Selection ---
    processed_lectures = sorted([d for d in os.listdir(CONTENT_LECTURES_DIR) if os.path.isdir(os.path.join(CONTENT_LECTURES_DIR, d))])
    if not processed_lectures:
        print("❌ No processed lectures found to crop.")
    else:
        print("\nPlease select the lecture folder whose slides you want to crop:")
        [print(f"  [{i+1}] {name}") for i, name in enumerate(processed_lectures)]
        choice = int(input("\nEnter the number of the lecture: ")) - 1

        if 0 <= choice < len(processed_lectures):
            lecture_name = processed_lectures[choice]
            print(f"\nSelected lecture: {lecture_name}")

            # Find the corresponding JSON and image assets
            json_path = os.path.join(CONTENT_LECTURES_DIR, lecture_name, 'final_ENHANCED_data.json')
            if not os.path.exists(json_path):
                # Fallback to structured data if not enhanced yet
                json_path = os.path.join(CONTENT_LECTURES_DIR, lecture_name, 'final_structured_data.json')

            if not os.path.exists(json_path):
                 print(f"❌ ERROR: Could not find a JSON data file for '{lecture_name}'.")
            else:
                with open(json_path, 'r', encoding='utf-8') as f:
                    lecture_data = json.load(f)

                all_image_paths = [os.path.join(KNOWLEDGE_PIPELINE_ROOT, s['image_path']) for s in lecture_data]

                # --- AI Crop Detection ---
                print("[Cropper] Analyzing random slides to find main content area...")
                # ... (Edge detection logic is unchanged)

                if detected_boxes:
                    # ... (Coordinate calculation is unchanged)

                    print("--> AI has determined the optimal crop. Displaying preview...")
                    preview_img_pil = Image.open(sample_paths[-1]).crop((LEFT_X, TOP_Y, RIGHT_X, BOTTOM_Y))
                    display(preview_img_pil)

                    time.sleep(1)
                    confirm_crop = input("\nIs this crop correct? (yes/no): ").lower().strip()

                    if confirm_crop.startswith('y'):
                        print(f"[Cropper] Applying crop to all {len(all_image_paths)} images...")
                        # ... (Cropping logic is unchanged)
                        print("✅ Cropping complete!")
                    else:
                        print("--> Cropping cancelled by user.")
                else:
                    print("--> AI could not detect a consistent content area. Skipping crop.")
        else:
            print("Invalid number selected.")

except Exception as e:
    print(f"\nAn unexpected error occurred during execution: {e}")


Please select the lecture folder whose slides you want to crop:
  [1] BST_Lecture_1_Grossing
  [2] BST_Lecture_2_SoftTissue1
  [3] BST_Lecture_3_SofTissue2
  [4] BST_Lecture_3_SofTissue2 (1)
  [5] BST_Lecture_4_SoftTissue3
  [6] BST_Lecture_5_Bone1
  [7] BST_Lecture_6_Bone2
  [8] Breast_Lecture_Epithelial
  [9] Breast_Lecture_Fibroepithelial
  [10] Breast_Lecture_Grossing
  [11] Breast_Lecture_IHC
  [12] Breast_Lecture_Invasive
  [13] Breast_Lecture_Lobular
  [14] Breast_Lecture_Normal
  [15] Breast_Lecture_Papillary
  [16] Breast_Lecture_Prognostics
  [17] Breast_Lecture_Rad-Path
  [18] Breast_Lecture_SpindleCell
  [19] Breast_Lecture_Treated
  [20] GI_Lecture_0_Gross_Liver
  [21] GI_Lecture_10_Colon
  [22] GI_Lecture_11_Esophagus
  [23] GI_Lecture_12_SmallIntestine
  [24] GI_Lecture_13_Stomach2
  [25] GI_Lecture_1_Liver1
  [26] GI_Lecture_2_Liver2
  [27] GI_Lecture_3_Liver3
  [28] GI_Lecture_4_Peds1
  [29] GI_Lecture_5_Peds2
  [30] GI_Lecture_7_IBD
  [31] GI_Lecture_8_Pancreas
  [32

KeyboardInterrupt: Interrupted by user

##RAG link stripper

In [None]:
# @title {display-mode: "form"}
# ==============================================================================
# UTILITY BLOCK U1: RAG LINK STRIPPER (v1.0)
# ==============================================================================
#
# PURPOSE:
# This utility is designed to "undo" the work of a previous RAG enrichment run.
# It safely removes all "notebook_links" keys from your selected content JSON
# files, allowing you to re-run the RAG process from a clean slate.
#
# SAFETY:
# This script is non-destructive. Before modifying a file, it creates a
# backup with a .bak extension (e.g., your_file.json.bak). If you need to
# restore the links, you can delete the modified file and rename the backup.
# ==============================================================================

import json
import os
import shutil

print("--- INITIALIZING RAG LINK STRIPPER UTILITY (v1.0) ---")

def strip_links_from_file(file_path: str):
    """
    Reads a JSON file, removes the 'notebook_links' key from each item,
    creates a backup, and saves the modified file.
    """
    print(f"\n-> Processing file: {os.path.basename(file_path)}...")
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)

        if not data or not isinstance(data, list):
            print("  - Skipping: File is empty or not in the expected list format.")
            return

        links_found = 0
        for item in data:
            if 'notebook_links' in item:
                del item['notebook_links']
                links_found += 1

        if links_found > 0:
            # Create a backup
            backup_path = file_path + ".bak"
            shutil.copy2(file_path, backup_path)
            print(f"  - ✅ Backup created at: {os.path.basename(backup_path)}")

            # Save the modified data
            with open(file_path, 'w', encoding='utf-8') as f:
                json.dump(data, f, indent=4)
            print(f"  - ✅ Successfully removed links from {links_found} items.")
        else:
            print("  - No 'notebook_links' found to remove.")

    except (FileNotFoundError, json.JSONDecodeError, Exception) as e:
        print(f"  - ❌ ERROR processing file: {e}")

# ------------------------------------------------------------------------------
# INTERACTIVE EXECUTION
# ------------------------------------------------------------------------------
try:
    if 'PATHS' not in globals():
        raise NameError("PATHS dictionary not found. Please run Block 0 first.")

    CONTENT_LIBRARY_DIR = PATHS['content_library']
    KNOWLEDGE_PIPELINE_ROOT = PATHS['root']

    print("\n[Scanner] Searching for content files with existing notebook links...")
    files_with_links = []
    all_content_files = []

    for root, _, files in os.walk(CONTENT_LIBRARY_DIR):
        for f in files:
            if f.endswith(('_CONTENT.json', '_FIGURES.json', '_ENHANCED_data.json')):
                all_content_files.append(os.path.join(root, f))

    for file_path in all_content_files:
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
            if '"notebook_links":' in content:
                files_with_links.append(file_path)
        except Exception:
            continue

    if not files_with_links:
        print("\n✅ No files with existing 'notebook_links' were found. Your data is already clean!")
    else:
        print("\n--- SELECT FILES TO CLEAN ---")
        print("The following files contain existing notebook links:")
        for i, path in enumerate(files_with_links):
            print(f"  [{i+1}] {os.path.relpath(path, KNOWLEDGE_PIPELINE_ROOT)}")

        user_input = input("\nEnter number(s) to clean (e.g., 1, 3 or 'all'): ").strip().lower()

        selected_files_to_clean = []
        if user_input == 'all':
            selected_files_to_clean = files_with_links
        else:
            try:
                indices = [int(c.strip()) - 1 for c in user_input.split(',')]
                selected_files_to_clean = [files_with_links[i] for i in indices if 0 <= i < len(files_with_links)]
            except ValueError:
                print("❌ Invalid input. Please enter numbers or 'all'.")

        if selected_files_to_clean:
            print("\n--- STARTING CLEANUP PROCESS ---")
            for file_path in selected_files_to_clean:
                strip_links_from_file(file_path)
            print("\n\n🎉 All selected files have been cleaned!")
        else:
            print("\nNo valid files selected for cleanup.")

except Exception as e:
    print(f"\nAn unexpected error occurred: {e}")

--- INITIALIZING RAG LINK STRIPPER UTILITY (v1.0) ---

[Scanner] Searching for content files with existing notebook links...

--- SELECT FILES TO CLEAN ---
The following files contain existing notebook links:
  [1] _content_library/textbooks/Breast_Atlas_FIGURES.json
  [2] _content_library/textbooks/Breast_Biopsy_FIGURES.json
  [3] _content_library/textbooks/SoftTissue_Pattern_CONTENT.json
  [4] _content_library/textbooks/Skin_Elston_FIGURES.json
  [5] _content_library/textbooks/Bone_Dorfman_FIGURES.json
  [6] _content_library/textbooks/SoftTissue_Enzinger_FIGURES.json
  [7] _content_library/textbooks/Cyto_Pattern_FIGURES.json
  [8] _content_library/lectures/BST_Lecture_2_SoftTissue1/final_ENHANCED_data.json
  [9] _content_library/lectures/BST_Lecture_4_SoftTissue3/final_ENHANCED_data.json
  [10] _content_library/lectures/Gyn_Lecture_3_Cervix_Glandular/final_ENHANCED_data.json
  [11] _content_library/lectures/YT_Skin_Granulomatous_Dermatitis_Fung/final_ENHANCED_data.json
  [12] _conten

YouTube to MP4

In [None]:
# @title {display-mode: "form"}
!pip install yt-dlp
!apt-get install ffmpeg
# Prompt the user for a single string containing all video information
multi_video_input = input("Enter your videos in the format [URL1],[File Name1];[URL2],[File Name2];... \n")

# Split the input string into individual video entries based on the semicolon
video_entries = multi_video_input.split(';')

print(f"\nFound {len(video_entries)} video(s) to process.")

# Loop through each video entry
for entry in video_entries:
    # Skip any empty entries that might result from trailing semicolons
    if not entry.strip():
        continue

    try:
        # Split the entry into URL and base filename based on the comma
        video_url, base_filename = entry.split(',', 1)

        # Clean up whitespace from the parts
        video_url = video_url.strip()
        base_filename = base_filename.strip()

        # --- Filename Formatting ---
        # Replace spaces with underscores
        formatted_filename = base_filename.replace(' ', '_')
        # Add the "YT_" prefix
        final_filename = f"YT_{formatted_filename}"

        # Specify the final output path and filename
        output_path = f"/content/drive/MyDrive/1-Projects/Knowledge_Pipeline/_source_materials/videos/{final_filename}"

        # --- Download and Conversion ---
        print("-" * 50)
        print(f"Processing: {base_filename}")
        print(f"Saving as: {output_path}.mp4")

        # Execute the yt-dlp command for the current video
        !yt-dlp --quiet --format 'bestvideo[ext=mp4]+bestaudio[ext=m4a]/best[ext=mp4]/best' --merge-output-format mp4 -o "{output_path}.%(ext)s" {video_url}

        print(f"Successfully downloaded: {final_filename}.mp4")

    except ValueError:
        # Handle cases where an entry is malformed (e.g., no comma)
        print("-" * 50)
        print(f"Skipping malformed entry: '{entry}'")
        print("Please ensure it follows the 'URL,FileName' format.")

print("\n" + "=" * 50)
print("All videos have been processed.")
print("=" * 50)

Enter your videos in the format [URL1],[File Name1];[URL2],[File Name2];... 
https://www.youtube.com/live/I7_96ULhaME?si=9WKm87qZdtq9RPI8,HN Thyroid Thompson

Found 1 video(s) to process.
--------------------------------------------------
Processing: HN Thyroid Thompson
Saving as: /content/drive/MyDrive/1-Projects/Knowledge_Pipeline/_source_materials/videos/YT_HN_Thyroid_Thompson.mp4
Successfully downloaded: YT_HN_Thyroid_Thompson.mp4

All videos have been processed.


##MKV to MP4

In [None]:
# @title {display-mode: "form"}
# Cell 1
import os
from google.colab import drive

# Mount your Google Drive
drive.mount('/content/drive')

# Define the source and destination directories
source_dir = '/content/drive/MyDrive/3-Resources/Pathology (pre UCSF)/Pathology videos (pre UCSF)/Video Lectures/HemePath Videos'
dest_dir = '/content/drive/MyDrive/1-Projects/Knowledge_Pipeline/_source_materials'

# Create the destination directory if it doesn't already exist
os.makedirs(dest_dir, exist_ok=True)

# Loop through all files in the source directory
for filename in os.listdir(source_dir):
  if filename.endswith(".mkv"):
    input_path = os.path.join(source_dir, filename)
    output_filename = os.path.splitext(filename)[0] + ".mp4"
    output_path = os.path.join(dest_dir, output_filename)

    print(f"Processing: {filename}...")

    # Construct the ffmpeg command to copy streams without re-encoding
    # -loglevel error will only show messages if something goes wrong
    command = f'ffmpeg -i "{input_path}" -c copy -loglevel error "{output_path}"'

    # Execute the command
    os.system(command)

    print(f"Successfully converted to: {output_filename}")

print("\nConversion complete! ✅")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Processing: Other_Heme_Reactive_Hodgkin_IHC.mkv...
Successfully converted to: Other_Heme_Reactive_Hodgkin_IHC.mp4
Processing: Other_Heme_Lecture_Myelodysplasia.mkv...
Successfully converted to: Other_Heme_Lecture_Myelodysplasia.mp4
Processing: Other_Heme_Lecture_Myeloid Intro.mkv...
Successfully converted to: Other_Heme_Lecture_Myeloid Intro.mp4
Processing: Other_Heme_Lecture_Eo MDS MPN.mkv...
Successfully converted to: Other_Heme_Lecture_Eo MDS MPN.mp4
Processing: Other_Heme_Lecture_Bone Marrow Failure Syndromes.mkv...
Successfully converted to: Other_Heme_Lecture_Bone Marrow Failure Syndromes.mp4
Processing: Other_Heme_Lecture_Bone Marrow Manifestations of Systemic Disease.mkv...
Successfully converted to: Other_Heme_Lecture_Bone Marrow Manifestations of Systemic Disease.mp4
Processing: Other_Heme_Lecture_AML.mkv...
Successfully converted to: Other_Heme_Lec

##renaming files

In [None]:
# @title {display-mode: "form"}
# Cell 1
import os
from google.colab import drive

# Mount your Google Drive
drive.mount('/content/drive')

# Define the source and destination directories, and the prefix
source_dir = '/content/drive/MyDrive/3-Resources/Pathology (pre UCSF)/Pathology videos (pre UCSF)/Skin LLU Curriculum'
dest_dir = '/content/drive/MyDrive/1-Projects/Knowledge_Pipeline/_source_materials'
prefix = 'Other_Skin_Lecture_'

# Ensure the destination directory exists, create it if it doesn't
os.makedirs(dest_dir, exist_ok=True)

try:
    # List all files in the source directory
    all_files = os.listdir(source_dir)

    # Loop through each file
    for filename in all_files:
        # Check if it's an mp4 file and doesn't already have the prefix
        if filename.endswith(".mp4") and not filename.startswith(prefix):
            old_path = os.path.join(source_dir, filename)
            new_filename = f"{prefix}{filename}"
            # Construct the new path in the destination directory
            new_path = os.path.join(dest_dir, new_filename)

            # This command will rename AND move the file
            os.rename(old_path, new_path)
            print(f"Moved and renamed '{filename}' to '{new_filename}'")

    print("\nProcess complete! ✅")

except FileNotFoundError:
    print(f"Error: A directory was not found. Please check your paths:\nSource: {source_dir}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Moved and renamed '2_Cysts_BenignEpidermal.mp4' to 'Other_Skin_Lecture_2_Cysts_BenignEpidermal.mp4'
Moved and renamed '1_NormalAnat_Intro.mp4' to 'Other_Skin_Lecture_1_NormalAnat_Intro.mp4'
Moved and renamed '20_21_Neural_Vascular.mp4' to 'Other_Skin_Lecture_20_21_Neural_Vascular.mp4'
Moved and renamed '10_Blisters.mp4' to 'Other_Skin_Lecture_10_Blisters.mp4'
Moved and renamed '4_Follicular_Sebaceous.mp4' to 'Other_Skin_Lecture_4_Follicular_Sebaceous.mp4'
Moved and renamed 'BasicExam_Review.mp4' to 'Other_Skin_Lecture_BasicExam_Review.mp4'
Moved and renamed '24_Viral,Parasite.mp4' to 'Other_Skin_Lecture_24_Viral,Parasite.mp4'
Moved and renamed '14_15_Alopecia_Depositional_Metabolic.mp4' to 'Other_Skin_Lecture_14_15_Alopecia_Depositional_Metabolic.mp4'
Moved and renamed '12_Vasculitis.mp4' to 'Other_Skin_Lecture_12_Vasculitis.mp4'
Moved and renamed '3_Malignan

##compress images then upload to imgbb

In [None]:
import os
import json
import base64
import requests
import shutil
from google.colab import userdata
from PIL import Image

# --- Helper Functions ---

# FINAL VERSION: This function now correctly handles both CMYK and RGBA images.
def compress_image(source_path, destination_path, quality=75):
    """
    Compresses a single image, correctly handling CMYK and transparent backgrounds.
    """
    try:
        with Image.open(source_path) as img:
            # **THE FIX**: If the image is CMYK, convert it to RGB.
            if img.mode == 'CMYK':
                img = img.convert('RGB')

            # Also handle transparency (RGBA), just in case.
            if img.mode == 'RGBA':
                background = Image.new("RGB", img.size, (255, 255, 255))
                background.paste(img, mask=img.split()[3])
                img = background

            # Now that the image is standard RGB, save it as a high-quality JPEG.
            img.save(destination_path, 'JPEG', optimize=True, quality=quality)
        return True
    except FileNotFoundError:
        print(f"  [Error] Source image not found at: {source_path}")
        return False
    except Exception as e:
        print(f"  [Error] Could not compress {os.path.basename(source_path)}. Error: {e}")
        return False

def upload_to_imgbb(image_path, api_key):
    """Uploads an image to ImgBB and returns the URL."""
    url = "https://api.imgbb.com/1/upload"
    try:
        with open(image_path, "rb") as file:
            payload = {
                "key": api_key,
                "image": base64.b64encode(file.read()),
            }
            response = requests.post(url, payload)
            response.raise_for_status()
            result = response.json()
            if result["success"]:
                return result["data"]["url"]
            else:
                print(f"  [Error] ImgBB API error: {result.get('error', {}).get('message', 'Unknown error')}")
                return None
    except requests.exceptions.RequestException as e:
        print(f"  [Error] Network/HTTP error: {e}")
        return None
    except Exception as e:
        print(f"  [Error] An unexpected error occurred during upload: {e}")
        return None

# --- Configuration & Setup ---
try:
    API_KEY = userdata.get('ImgBB_API_Key_1')
    if not API_KEY:
        raise ValueError("API key is empty. Please check Colab Secrets.")
except Exception as e:
    print(f"Fatal Error: Could not retrieve API key named 'ImgBB_API_Key_1'. {e}")
    raise SystemExit("API key not configured.")

BASE_DRIVE_PATH = '/content/drive/MyDrive/1-Projects/Knowledge_Pipeline'
CONTENT_LIBRARY_PATH = os.path.join(BASE_DRIVE_PATH, '_content_library/textbooks')
TEMP_COMPRESSION_DIR = '/content/temp_compressed_images'

if os.path.exists(TEMP_COMPRESSION_DIR):
    shutil.rmtree(TEMP_COMPRESSION_DIR)
os.makedirs(TEMP_COMPRESSION_DIR)

# --- Step 1: Find all possible JSON files and get user choice ---
json_files = [f for f in os.listdir(CONTENT_LIBRARY_PATH) if f.endswith('_FIGURES.json')]

if not json_files:
    print(f"No '*_FIGURES.json' files found in {CONTENT_LIBRARY_PATH}. Exiting.")
else:
    print("Please choose a JSON file to process by entering its number:")
    for i, filename in enumerate(json_files):
        print(f"  {i + 1}: {filename}")

    choice = -1
    while True:
        try:
            choice_input = input(f"\nEnter a number (1-{len(json_files)}): ")
            choice = int(choice_input)
            if 1 <= choice <= len(json_files):
                break
            else:
                print("Invalid number. Please try again.")
        except ValueError:
            print("Invalid input. Please enter a number.")

    # --- Step 2: Load the chosen JSON and process images ---
    chosen_json_filename = json_files[choice - 1]
    json_file_path = os.path.join(CONTENT_LIBRARY_PATH, chosen_json_filename)

    print(f"\nLoading JSON file: {chosen_json_filename}")
    with open(json_file_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    print(f"Successfully loaded {len(data)} records. Starting processing...")
    print("-" * 30)

    update_count = 0
    for record in data:
        figure_id = record.get("figure_id", "N/A")
        if not record.get("image_url"):
            relative_image_path = record.get("image_path")
            if not relative_image_path:
                print(f"Skipping record for Figure {figure_id}: missing 'image_path'.")
                continue

            source_image_path = os.path.join(BASE_DRIVE_PATH, relative_image_path)
            image_filename = os.path.basename(source_image_path)
            temp_compressed_path = os.path.join(TEMP_COMPRESSION_DIR, image_filename)

            print(f"Processing Figure {figure_id}:")
            if compress_image(source_image_path, temp_compressed_path):
                print(f"  > Compressed '{image_filename}' successfully.")
                image_url = upload_to_imgbb(temp_compressed_path, API_KEY)
                if image_url:
                    record["image_url"] = image_url
                    update_count += 1
                    print(f"  > Success! URL added: {image_url}")
                else:
                    print(f"  > Upload failed for '{image_filename}'.")
        else:
            print(f"Skipping Figure {figure_id}: 'image_url' already exists.")

    # --- Step 3: Save results and clean up ---
    if update_count > 0:
        print("-" * 30)
        print(f"A total of {update_count} records were updated.")
        print("Saving updated data back to the JSON file...")
        with open(json_file_path, 'w', encoding='utf-8') as f:
            json.dump(data, f, indent=4)
        print(f"Successfully saved updates to {chosen_json_filename}.")
    else:
        print("-" * 30)
        print("No new images were uploaded. The JSON file is unchanged.")

    shutil.rmtree(TEMP_COMPRESSION_DIR)
    print("\nProcess complete! Temporary files have been removed. ✨")

Please choose a JSON file to process by entering its number:
  1: Breast_Atlas_FIGURES.json
  2: Breast_Biopsy_FIGURES.json
  3: GI_Biopsy_Interpretation_(Neoplastic)_FIGURES.json
  4: Cyto_Cibas_FIGURES.json
  5: Cyto_Comprehensive_Part_One_FIGURES.json
  6: Cyto_GU_Paris_FIGURES.json
  7: Cyto_Thyroid_Bethesda_FIGURES.json
  8: Cyto_Milan_FIGURES.json
  9: Cyto_PSC_Lung_FIGURES.json
  10: Breast_FAQ_FIGURES.json
  11: Breast_Pattern_FIGURES.json
  12: Cyto_Comprehensive_Part_Two_FIGURES.json
  13: Cyto_Breast_Yokohama_FIGURES.json
  14: Cyto_Gyn_Bethesda_FIGURES.json
  15: Cyto_Serous_Fluids_FIGURES.json
  16: GI_Biopsy_Interpretation_(Non_Neoplastic)_FIGURES.json
  17: BST_Horvai_FIGURES.json
  18: SoftTissue_Pattern_FIGURES.json
  19: Bone_Pattern_FIGURES.json
  20: Bone_Atlas_FIGURES.json
  21: Skin_Elston_FIGURES.json
  22: Skin_Levers_FIGURES.json
  23: HN_Thompson_FIGURES.json
  24: GI_Intestinal_Atlas1_FIGURES.json
  25: GI_Liver_Macsween_FIGURES.json
  26: Peds_Course_review_

In [None]:
import json
import os
import re
import sys

# Ensure the 'requests' library is available
try:
    import requests
except ImportError:
    print("Error: The 'requests' library is not installed.")
    print("Please install it by running: !pip install requests")
    sys.exit(1)

from urllib.parse import urlparse

def update_json_from_file(json_file_path):
    """
    Reads JSON data from a file, scans a directory for images, validates them
    against URLs, removes the 'WSI_url' key, and updates the JSON with local paths.
    """
    # --- THIS IS THE PLACEHOLDER SECTION ---
    # The script will now load its data from the file you specify.
    print(f"Reading JSON data from '{json_file_path}'...")
    try:
        with open(json_file_path, 'r', encoding='utf-8') as f:
            json_data = json.load(f)
    except FileNotFoundError:
        print(f"Error: The file '{json_file_path}' was not found.")
        return
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON from '{json_file_path}': {e}")
        print("Please validate the JSON file's content.")
        return
    # ----------------------------------------

    root_dir = '/content/drive/MyDrive/1-Projects/Pathology_Images'
    output_filename = 'data_updated.json'

    if not os.path.isdir(root_dir):
        print(f"Error: The root directory '{root_dir}' does not exist.")
        return

    def normalize_key(filename):
        name_without_ext = os.path.splitext(filename)[0]
        return re.sub(r'[^a-z0-9]', '', name_without_ext.lower())

    print(f"Scanning for local files in '{root_dir}'...")
    local_file_map = {}
    for dirpath, _, filenames in os.walk(root_dir):
        for filename in filenames:
            if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
                full_path = os.path.join(dirpath, filename)
                normalized = normalize_key(filename)
                local_file_map[normalized] = full_path
    print(f"Scan complete. Mapped {len(local_file_map)} local files.")

    updated_count, not_found_count, mismatch_count = 0, 0, 0
    urls_to_process = ['Labeled_URL', 'Unlabeled_URL']

    for index, item in enumerate(json_data):
        print(f"Processing item {index + 1}/{len(json_data)}...", end='\r')
        if 'WSI_url' in item:
            del item['WSI_url']

        for url_key in urls_to_process:
            url_value = item.get(url_key)
            if url_value and isinstance(url_value, str) and url_value.startswith('http'):
                url_filename = os.path.basename(urlparse(url_value).path)
                normalized_url_key = normalize_key(url_filename)
                local_path = local_file_map.get(normalized_url_key)

                if local_path:
                    try:
                        response = requests.head(url_value, timeout=5, allow_redirects=True)
                        if response.status_code == 200 and 'content-length' in response.headers:
                            online_size = int(response.headers['content-length'])
                            local_size = os.path.getsize(local_path)

                            if online_size == local_size:
                                thumbnail_key = url_key.replace('_URL', '_Thumbnail_URL')
                                item[url_key] = local_path
                                if thumbnail_key in item:
                                    item[thumbnail_key] = local_path
                                updated_count += 1
                            else:
                                mismatch_count += 1
                        else:
                            mismatch_count += 1
                    except requests.exceptions.RequestException:
                        mismatch_count += 1
                else:
                    not_found_count += 1
    print("\nProcessing complete.")

    print(f"\nSaving updated JSON file to '{output_filename}'...")
    with open(output_filename, 'w') as f:
        json.dump(json_data, f, indent=4)

    print("\n--- Validation & Update Summary ---")
    print(f"Total entries processed: {len(json_data)}")
    print(f"✅ URL fields updated: {updated_count}")
    print(f"⚠️ URLs skipped (mismatch/error): {mismatch_count}")
    print(f"❌ URLs with no local match: {not_found_count}")
    print("---------------------------------")

# --- HOW TO EXECUTE ---
# 1. Save your JSON data into a file named 'data.json'.
# 2. Run the script by calling the function with the filename.
update_json_from_file('data.json')

SyntaxError: f-string: expecting a valid expression after '{' (ipython-input-142705738.py, line 23)

##Add descriptions for PPT image folders

In [None]:
import os
import re
import json
import time
import sys

# Ensure the 'google-generativeai' library is available
try:
    import google.generativeai as genai
    from google.colab import userdata
except ImportError:
    print("Error: Required Google libraries not found.")
    print("Please run '!pip install -q -U google-generativeai' in a cell first.")
    sys.exit(1)

def generate_description_with_gemini(entity_name, model, retries=3, delay=5):
    """
    Calls the Gemini API with a retry mechanism to generate a histologic description.
    """
    prompt = f"""
    TASK: Generate a concise microscopic description for the dermatopathology entity: "{entity_name}".
    INSTRUCTIONS:
    - Focus ONLY on the key histologic features as seen under a microscope.
    - Do not add any introductory or concluding phrases.
    - Your entire response MUST be only the description itself."""

    for attempt in range(retries):
        try:
            response = model.generate_content(prompt)
            clean_text = re.sub(r'[\*\#]', '', response.text).strip()
            return clean_text
        except Exception as e:
            # Don't print the full error for retries, just a small note.
            print(f"  -> API call for '{entity_name}' failed attempt {attempt + 1}. Retrying...")
            time.sleep(delay)

    print(f"  -> All retry attempts failed for '{entity_name}'.")
    return f"Error generating description for {entity_name} after {retries} attempts."


def create_json_with_auto_model():
    """
    Builds a new JSON file by first discovering an available Gemini model, then
    scanning an image folder and using the API to generate all descriptions.
    """
    # --- Configuration ---
    image_folder_to_scan = '/content/drive/MyDrive/1-Projects/Pathology_Images/Skin images'
    output_filename = 'skin_images_api_generated.json'

    # --- 1. Configure Gemini API ---
    print("Configuring Gemini API...")
    try:
        api_key = userdata.get('GEMINI_API_KEY')
        if not api_key:
            raise ValueError("Secret 'GEMINI_API_KEY' not found or is empty.")
        genai.configure(api_key=api_key)
    except Exception as e:
        print(f"FATAL ERROR: Could not configure Gemini API. Details: {e}")
        return

    # --- 2. **DEFINITIVE FIX** Automatically find a suitable model ---
    print("Discovering available models for your API key...")
    suitable_model_name = None
    try:
        models = [m for m in genai.list_models() if 'generateContent' in m.supported_generation_methods]

        # Prioritize finding a 'flash' model, then 'pro', then any other.
        flash_models = [m for m in models if 'flash' in m.name]
        pro_models = [m for m in models if 'pro' in m.name]

        if flash_models:
            suitable_model_name = flash_models[0].name
        elif pro_models:
            suitable_model_name = pro_models[0].name
        elif models:
            suitable_model_name = models[0].name

        if not suitable_model_name:
            raise Exception("No models that support 'generateContent' were found.")

        model = genai.GenerativeModel(suitable_model_name)
        print(f"--> Automatically selected model: '{suitable_model_name}'")

    except Exception as e:
        print(f"FATAL ERROR: Could not find a suitable model. Details: {e}")
        return

    # --- 3. Find all labeled image files ---
    print(f"\nScanning for labeled images (*a.jpg) in: {image_folder_to_scan}")
    image_paths_to_process = [os.path.join(dp, f) for dp, dn, fn in os.walk(image_folder_to_scan) for f in fn if f.lower().endswith('a.jpg')]
    print(f"Found {len(image_paths_to_process)} labeled image files to process.")

    # --- 4. Process each image individually ---
    final_json_data = []
    print("\nGenerating descriptions via Gemini API (this will take some time)...")

    for i, image_path in enumerate(image_paths_to_process):
        base_filename = os.path.basename(image_path)
        entity_name_raw = os.path.splitext(base_filename)[0]
        entity_name_clean = re.sub(r' \d+[a-z]?$', '', entity_name_raw, flags=re.IGNORECASE).strip()

        print(f"\n--- Processing {i+1}/{len(image_paths_to_process)}: {base_filename} ---")
        print(f" -> Querying Gemini API for '{entity_name_clean}'...")

        description = generate_description_with_gemini(entity_name_clean, model)
        print(f"Description Received: {description[:150].strip()}...")

        image_data = {
            "Entity_Name": entity_name_clean,
            "Image_URL": image_path,
            "Description": description
        }
        final_json_data.append(image_data)
        time.sleep(1) # Base delay to respect API rate limits

    # --- 5. Save the final JSON file ---
    print(f"\n\nProcessing complete. Saving results to '{output_filename}'...")
    with open(output_filename, 'w', encoding='utf-8') as f:
        json.dump(final_json_data, f, indent=4)

    print("\n--- Final Summary ---")
    print(f"✅ Total images processed: {len(final_json_data)}")
    print(f"Successfully created new file: {output_filename}")
    print("---------------------")

# --- Execute the script ---
create_json_with_auto_model()

Configuring Gemini API...
Discovering available models for your API key...
--> Automatically selected model: 'models/gemini-2.5-flash-preview-05-20'

Scanning for labeled images (*a.jpg) in: /content/drive/MyDrive/1-Projects/Pathology_Images/Skin images
Found 0 labeled image files to process.

Generating descriptions via Gemini API (this will take some time)...


Processing complete. Saving results to 'skin_images_api_generated.json'...

--- Final Summary ---
✅ Total images processed: 0
Successfully created new file: skin_images_api_generated.json
---------------------


##replace error descriptions

In [None]:
import os
import re
import json
import time
import sys

try:
    import google.generativeai as genai
    from google.colab import userdata
except ImportError:
    print("Error: Required Google libraries not found.")
    print("Please run '!pip install -q -U google-generativeai' in a cell first.")
    sys.exit(1)

def generate_description_with_gemini(entity_name, model):
    """
    Calls the Gemini API to generate a histologic description for a single entity.
    """
    prompt = f"""
    TASK: Generate a concise microscopic description for the dermatopathology entity: "{entity_name}".
    INSTRUCTIONS:
    - Focus ONLY on the key histologic features as seen under a microscope.
    - Do not add any introductory or concluding phrases like "Here is the description...".
    - Your entire response MUST be only the description itself."""
    try:
        response = model.generate_content(prompt)
        clean_text = re.sub(r'[\*\#]', '', response.text).strip()
        return clean_text
    except Exception as e:
        print(f"  -> Gemini API call failed for '{entity_name}': {e}")
        return f"Error generating description for {entity_name}."

def fix_existing_json_descriptions():
    """
    Reads an existing JSON file and uses the Gemini API to fix only the entries
    that have error messages or missing descriptions.
    """
    # --- Configuration ---
    input_json_path = 'skin_images_api_generated.json'
    output_json_path = 'skin_images_api_generated_FIXED.json'

    # --- 1. Configure Gemini API ---
    print("Configuring Gemini API...")
    try:
        api_key = userdata.get('GEMINI_API_KEY')
        if not api_key:
            raise ValueError("Secret 'GEMINI_API_KEY' not found or is empty.")
        genai.configure(api_key=api_key)
        model = genai.GenerativeModel('gemini-1.5-flash')
        print("Gemini API configured successfully.")
    except Exception as e:
        print(f"FATAL ERROR: Could not configure Gemini API. Details: {e}")
        return

    # --- 2. Load the existing JSON file ---
    print(f"\nReading existing data from '{input_json_path}'...")
    try:
        with open(input_json_path, 'r', encoding='utf-8') as f:
            json_data = json.load(f)
        print(f"Successfully loaded {len(json_data)} entries.")
    except FileNotFoundError:
        print(f"FATAL ERROR: The input file was not found. Make sure '{input_json_path}' exists.")
        return
    except json.JSONDecodeError as e:
        print(f"FATAL ERROR: The file is not a valid JSON file. Error: {e}")
        return

    # --- 3. Iterate, check, and fix descriptions ---
    updated_count = 0
    skipped_count = 0

    print("\nChecking and fixing descriptions...")
    for i, item in enumerate(json_data):
        description = item.get("Description", "")
        entity_name = item.get("Entity_Name", "Unknown Entity")

        # **MODIFIED:** This condition now checks if the description is missing
        # or if it starts with the specific error phrase.
        if not description or description.lower().startswith("error generating"):
            print(f"\n--- Fixing {i+1}/{len(json_data)}: {entity_name} ---")
            print(f" -> Bad description found: '{description[:50]}...' Querying API...")

            new_description = generate_description_with_gemini(entity_name, model)
            item["Description"] = new_description
            updated_count += 1

            print(f" -> New Description: {new_description[:150].strip()}...")
            time.sleep(1) # Respect API rate limits
        else:
            skipped_count += 1
            print(f"Skipping {i+1}/{len(json_data)}: {entity_name} (Description OK)", end='\r')

    # --- 4. Save the fixed JSON file ---
    print(f"\n\nProcessing complete. Saving results to '{output_json_path}'...")
    with open(output_json_path, 'w', encoding='utf-8') as f:
        json.dump(json_data, f, indent=4)

    print("\n--- Final Summary ---")
    print(f"✅ Total entries checked: {len(json_data)}")
    print(f"🛠️ Descriptions fixed/generated by API: {updated_count}")
    print(f"👍 Descriptions skipped (already good): {skipped_count}")
    print(f"Successfully saved fixed file to: {output_json_path}")
    print("---------------------")

# --- Execute the script ---
fix_existing_json_descriptions()