# 🏥 Medical Data Ingestion & Embedding Pipeline for Kaggle GPU
## Large-Scale ChromaDB Vector Store Creation

**Purpose**: This notebook performs comprehensive medical data ingestion and embedding for a RAG (Retrieval-Augmented Generation) system. It processes two major medical datasets and creates persistent ChromaDB vector stores optimized for medical consultations.

**Datasets Processed**:
1. **EPFL Guidelines Dataset** - Clinical guidelines and treatment protocols
2. **MedRAG/MedQuAD Dataset** - Medical question-answer pairs

**Key Features**:
- ⚡ **GPU-Optimized** - Leverages Kaggle's GPU environment for fast embedding generation
- 🗄️ **Persistent Storage** - Creates ChromaDB collections that persist across sessions
- 🔧 **Memory Efficient** - Optimized memory management for large-scale processing
- 📦 **Downloadable Output** - Compressed archive ready for deployment

**Output**: Two separate ChromaDB collections in a downloadable archive suitable for medical RAG systems.

## 📦 Section 1: Environment Setup and Dependencies

Installing all required packages for the medical data ingestion pipeline. This cell installs LangChain ecosystem packages, ChromaDB for vector storage, HuggingFace transformers for embeddings, and Datasets library for efficient data loading.

In [1]:
!pip install langchain langchain_community langchain_huggingface chromadb sentence-transformers datasets

print("✅ Package installation completed successfully!")
print("🔧 Environment ready for medical data ingestion pipeline")

Collecting langchain_community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain_huggingface
  Downloading langchain_huggingface-0.3.1-py3-none-any.whl.metadata (996 bytes)
Collecting chromadb
  Downloading chromadb-1.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting langchain-core<1.0.0,>=0.3.66 (from langchain)
  Downloading langchain_core-0.3.71-py3-none-any.whl.metadata (5.8 kB)
Collecting huggingface-hub>=0.33.4 (from langchain_huggingface)
  Downloading huggingface_hub-0.33.4-py3-none-any.whl.metadata (14 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_

## ⚙️ Section 2: Configuration and Constants

Defining global constants for consistent configuration across the entire data ingestion pipeline. These constants ensure proper file paths, collection naming, and embedding model configuration.

In [7]:
import os
import sys
from typing import List, Dict, Any
import gc
import time

# =============================================================================
# GLOBAL CONFIGURATION CONSTANTS - CRITICAL FOR CONSISTENCY
# =============================================================================

# Database and Storage Configuration
DB_PERSIST_DIRECTORY = "/kaggle/working/chroma_db"
GUIDELINES_COLLECTION_NAME = "medical_guidelines"

# =========================================================================
# --- Use the Textbooks dataset for the general collection ---
# =========================================================================
GENERAL_KNOWLEDGE_COLLECTION_NAME = "medical_textbooks" # More descriptive name

# Embedding Model Configuration
EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"

# Text Processing Configuration
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 150

# Dataset Configuration
GUIDELINES_DATASET = "epfl-llm/guidelines"
# =========================================================================
# --- Point to the new Textbooks dataset ---
# =========================================================================
TEXTBOOKS_DATASET = "MedRAG/textbooks"

print("🔧 MEDICAL DATA INGESTION CONFIGURATION")
print("=" * 50)
print(f"📁 Database Directory: {DB_PERSIST_DIRECTORY}")
print(f"🏥 Guidelines Collection: {GUIDELINES_COLLECTION_NAME}")
print(f"📚 Textbooks Collection: {GENERAL_KNOWLEDGE_COLLECTION_NAME}")
print(f"🤖 Embedding Model: {EMBEDDING_MODEL_NAME}")
print(f"📄 Chunk Size: {CHUNK_SIZE} (Overlap: {CHUNK_OVERLAP})")
print(f"📊 Guidelines Dataset: {GUIDELINES_DATASET}")
print(f"📊 Textbooks Dataset: {TEXTBOOKS_DATASET}")
print("=" * 50)
print("✅ Configuration loaded successfully!")

🔧 MEDICAL DATA INGESTION CONFIGURATION
📁 Database Directory: /kaggle/working/chroma_db
🏥 Guidelines Collection: medical_guidelines
📚 Textbooks Collection: medical_textbooks
🤖 Embedding Model: sentence-transformers/all-MiniLM-L6-v2
📄 Chunk Size: 1000 (Overlap: 150)
📊 Guidelines Dataset: epfl-llm/guidelines
📊 Textbooks Dataset: MedRAG/textbooks
✅ Configuration loaded successfully!


## 🤖 Section 3: Embedding Model Initialization

Initializing the HuggingFace embedding model with GPU optimization. The model will automatically detect and use available GPU resources in the Kaggle environment for maximum performance.

In [4]:
from langchain_huggingface import HuggingFaceEmbeddings
import torch

# =============================================================================
# EMBEDDING MODEL INITIALIZATION WITH GPU OPTIMIZATION
# =============================================================================

print("🤖 INITIALIZING EMBEDDING MODEL FOR MEDICAL DATA PROCESSING")
print("=" * 60)

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"🔧 Computing Device: {device}")

if device == "cuda":
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"⚡ GPU: {gpu_name}")
    print(f"💾 GPU Memory: {gpu_memory:.1f} GB")
else:
    print("⚠️  Warning: GPU not available, using CPU (slower processing)")

print(f"🔄 Loading embedding model: {EMBEDDING_MODEL_NAME}")
print("   Note: This model will automatically leverage GPU acceleration")

# =========================================================================
# --- Initialize the all-MiniLM-L6-v2 model ---
# This model does NOT require the 'trust_remote_code' parameter.
# =========================================================================
embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    model_kwargs={
        'device': device            # Use GPU if available
    },
    encode_kwargs={
        'normalize_embeddings': True  # Normalizing embeddings is good practice for this model
    }
)
# --- END OF MODIFICATION ---

print("✅ Embedding model initialized successfully!")
print(f"🎯 Model ready for medical document embedding on {device.upper()}")
print("=" * 60)

🤖 INITIALIZING EMBEDDING MODEL FOR MEDICAL DATA PROCESSING
🔧 Computing Device: cuda
⚡ GPU: Tesla T4
💾 GPU Memory: 14.7 GB
🔄 Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
   Note: This model will automatically leverage GPU acceleration


2025-07-23 12:44:09.549857: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1753274649.871975      36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1753274649.960836      36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Embedding model initialized successfully!
🎯 Model ready for medical document embedding on CUDA


## 🏥 Section 4: Guidelines Dataset Processing

Processing the EPFL Guidelines dataset containing clinical guidelines and treatment protocols. This section loads the dataset, performs text chunking, and ingests the processed documents into a dedicated ChromaDB collection.

In [5]:
from datasets import load_dataset
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.docstore.document import Document

# =============================================================================
# PART A: GUIDELINES DATASET PROCESSING
# =============================================================================

print("🏥 PROCESSING MEDICAL GUIDELINES DATASET")
print("=" * 60)

# Initialize text splitter for chunking documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len
)

print(f"🔧 Text Splitter configured:")
print(f"   📏 Chunk Size: {CHUNK_SIZE} characters")
print(f"   🔄 Overlap: {CHUNK_OVERLAP} characters")

# Load the EPFL Guidelines dataset
print(f"📊 Loading dataset: {GUIDELINES_DATASET}")
start_time = time.time()

try:
    guidelines_dataset = load_dataset(GUIDELINES_DATASET, split="train")
    load_time = time.time() - start_time
    
    print(f"✅ Full dataset loaded successfully in {load_time:.2f} seconds")
    print(f"📋 Full dataset size: {len(guidelines_dataset)} documents")
    
    # FOR DEMONSTRATION: Sample 10,000 documents
    print(f"\n🔥 Taking a random sample of 10,000 documents for demonstration purposes...")
    if len(guidelines_dataset) > 10000:
        guidelines_dataset = guidelines_dataset.shuffle(seed=42).select(range(10000))
        print(f"✅ Sampled dataset size: {len(guidelines_dataset)} documents")
    else:
        print("⚠️  Dataset has fewer than 10,000 documents, using all of them.")
    
    if len(guidelines_dataset) > 0:
        sample = guidelines_dataset[0]
        print(f"📄 Sample fields: {list(sample.keys())}")
        if 'clean_text' in sample:
            print(f"📝 Sample text length: {len(sample['clean_text'])} characters")
    
except Exception as e:
    print(f"❌ Error loading or sampling dataset: {e}")
    raise

# Process and chunk the guidelines documents
print(f"\n🔄 Processing and chunking {len(guidelines_dataset)} guidelines documents...")
documents = []

for i, item in enumerate(guidelines_dataset):
    if i % 1000 == 0:
        print(f"   Processing document {i+1}/{len(guidelines_dataset)}")
    
    if 'clean_text' in item and item['clean_text']:
        doc = Document(
            page_content=item['clean_text'],
            metadata={'source': 'epfl_guidelines', 'doc_id': i, 'dataset': GUIDELINES_DATASET}
        )
        documents.append(doc)

print(f"✅ Created {len(documents)} documents from guidelines dataset")

# Chunk all documents
print(f"🔄 Chunking documents...")
chunked_documents = text_splitter.split_documents(documents)

print(f"✅ Created {len(chunked_documents)} chunks from {len(documents)} documents")
print(f"📊 Average chunks per document: {len(chunked_documents) / len(documents):.1f}")

# Ingest into ChromaDB
print(f"\n💾 Ingesting chunks into ChromaDB...")
print(f"   🗄️  Collection: {GUIDELINES_COLLECTION_NAME}")
print(f"   📁 Persist Directory: {DB_PERSIST_DIRECTORY}")

os.makedirs(DB_PERSIST_DIRECTORY, exist_ok=True)

try:
    guidelines_db = Chroma.from_documents(
        documents=chunked_documents,
        embedding=embedding_model,
        persist_directory=DB_PERSIST_DIRECTORY,
        collection_name=GUIDELINES_COLLECTION_NAME
    )
    
    collection_count = guidelines_db._collection.count()
    
    print(f"✅ GUIDELINES DATASET INGESTION COMPLETED SUCCESSFULLY!")
    print(f"📊 Total chunks ingested into '{GUIDELINES_COLLECTION_NAME}': {collection_count}")
    print(f"💾 Database persisted to: {DB_PERSIST_DIRECTORY}")
    
except Exception as e:
    print(f"❌ Error during ChromaDB ingestion: {e}")
    raise

# Clear memory
guidelines_db = None
documents = None
chunked_documents = None
guidelines_dataset = None
gc.collect()

print(f"🧹 Memory cleared for next dataset processing")
print("=" * 60)

🏥 PROCESSING MEDICAL GUIDELINES DATASET
🔧 Text Splitter configured:
   📏 Chunk Size: 1000 characters
   🔄 Overlap: 150 characters
📊 Loading dataset: epfl-llm/guidelines


README.md: 0.00B [00:00, ?B/s]

open_guidelines.jsonl:   0%|          | 0.00/878M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

✅ Full dataset loaded successfully in 7.94 seconds
📋 Full dataset size: 37970 documents

🔥 Taking a random sample of 10,000 documents for demonstration purposes...
✅ Sampled dataset size: 10000 documents
📄 Sample fields: ['id', 'source', 'title', 'clean_text', 'raw_text', 'url', 'overview']
📝 Sample text length: 42213 characters

🔄 Processing and chunking 10000 guidelines documents...
   Processing document 1/10000
   Processing document 1001/10000
   Processing document 2001/10000
   Processing document 3001/10000
   Processing document 4001/10000
   Processing document 5001/10000
   Processing document 6001/10000
   Processing document 7001/10000
   Processing document 8001/10000
   Processing document 9001/10000
✅ Created 10000 documents from guidelines dataset
🔄 Chunking documents...
✅ Created 188808 chunks from 10000 documents
📊 Average chunks per document: 18.9

💾 Ingesting chunks into ChromaDB...
   🗄️  Collection: medical_guidelines
   📁 Persist Directory: /kaggle/working/chrom

## ❓ Section 5: MedQuAD Dataset Processing

Processing the MedRAG/MedQuAD dataset containing medical question-answer pairs. This section formats the Q&A pairs into structured documents, applies text chunking, and ingests them into a separate ChromaDB collection for medical consultation queries.

In [9]:
# =============================================================================
# PART B: MEDICAL TEXTBOOKS DATASET PROCESSING
# =============================================================================

print("📚 PROCESSING MEDICAL TEXTBOOKS DATASET")
print("=" * 60)

# Load the MedRAG/textbooks dataset
print(f"📊 Loading dataset: {TEXTBOOKS_DATASET}")
start_time = time.time()

try:
    textbooks_dataset = load_dataset(TEXTBOOKS_DATASET, split="train")
    load_time = time.time() - start_time
    
    print(f"✅ Full dataset loaded successfully in {load_time:.2f} seconds")
    print(f"📋 Full dataset size: {len(textbooks_dataset)} documents")
    
    # FOR DEMONSTRATION: Sample 10,000 documents
    print(f"\n🔥 Taking a random sample of 10,000 documents for demonstration purposes...")
    if len(textbooks_dataset) > 10000:
        textbooks_dataset = textbooks_dataset.shuffle(seed=42).select(range(10000))
        print(f"✅ Sampled dataset size: {len(textbooks_dataset)} documents")
    else:
        print("⚠️  Dataset has fewer than 10,000 documents, using all of them.")

    # Show sample data structure for the textbooks dataset
    if len(textbooks_dataset) > 0:
        sample = textbooks_dataset[0]
        print(f"📄 Sample fields: {list(sample.keys())}")
        # =========================================================================
        # --- Use the 'content' column name for the check ---
        # =========================================================================
        if 'content' in sample:
            print(f"📝 Sample text length: {len(sample['content'])} characters")
        if 'title' in sample:
            print(f"📖 Sample from book (using title): {sample['title']}")
    
except Exception as e:
    print(f"❌ Error loading or sampling dataset: {e}")
    raise

# Process and prepare documents from the textbook dataset
print(f"\n🔄 Processing {len(textbooks_dataset)} textbook documents...")
textbook_documents = []

for i, item in enumerate(textbooks_dataset):
    if i % 1000 == 0:
        print(f"   Processing textbook document {i+1}/{len(textbooks_dataset)}")
    
    # =========================================================================
    # --- Use the 'content' column name for processing ---
    # =========================================================================
    if 'content' in item and item['content']:
        doc = Document(
            page_content=item['content'], # Use 'content' here
            metadata={
                'source': 'medrag_textbooks',
                'title': item.get('title', 'Unknown'), # Use 'title' for book name
                'chapter': item.get('contents', 'Unknown'), # Use 'contents' for chapter
                'doc_id': item.get('id', i)
            }
        )
        textbook_documents.append(doc)

print(f"✅ Created {len(textbook_documents)} documents from the textbooks dataset")

# Chunk the formatted textbook documents using the same text splitter
print(f"🔄 Chunking textbook documents...")
textbook_chunked_documents = text_splitter.split_documents(textbook_documents)

print(f"✅ Created {len(textbook_chunked_documents)} chunks from {len(textbook_documents)} documents")

# This will no longer cause a ZeroDivisionError
if len(textbook_documents) > 0:
    print(f"📊 Average chunks per textbook document: {len(textbook_chunked_documents) / len(textbook_documents):.1f}")

# Ingest into ChromaDB with the new collection name
print(f"\n💾 Ingesting textbook chunks into ChromaDB...")
print(f"   🗄️  Collection: {GENERAL_KNOWLEDGE_COLLECTION_NAME}")
print(f"   📁 Persist Directory: {DB_PERSIST_DIRECTORY}")

try:
    textbooks_db = Chroma.from_documents(
        documents=textbook_chunked_documents,
        embedding=embedding_model,
        persist_directory=DB_PERSIST_DIRECTORY,
        collection_name=GENERAL_KNOWLEDGE_COLLECTION_NAME
    )
    
    textbooks_collection_count = textbooks_db._collection.count()
    
    print(f"✅ TEXTBOOKS DATASET INGESTION COMPLETED SUCCESSFULLY!")
    print(f"📊 Total chunks ingested into '{GENERAL_KNOWLEDGE_COLLECTION_NAME}': {textbooks_collection_count}")
    print(f"💾 Database persisted to: {DB_PERSIST_DIRECTORY}")
    
except Exception as e:
    print(f"❌ Error during ChromaDB ingestion: {e}")
    raise

# Clear memory after processing
textbooks_db = None
textbook_documents = None
textbook_chunked_documents = None
textbooks_dataset = None
gc.collect()

print(f"🧹 Memory cleared after textbook dataset processing")
print("=" * 60)

# Summary of both collections
print(f"\n📊 INGESTION SUMMARY:")
print(f"   🏥 Guidelines Collection: '{GUIDELINES_COLLECTION_NAME}'")
print(f"   📚 Textbooks Collection: '{GENERAL_KNOWLEDGE_COLLECTION_NAME}'")
print(f"   📁 Database Location: {DB_PERSIST_DIRECTORY}")
print("✅ Both medical datasets successfully ingested into separate ChromaDB collections!")

📚 PROCESSING MEDICAL TEXTBOOKS DATASET
📊 Loading dataset: MedRAG/textbooks


Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

✅ Full dataset loaded successfully in 0.51 seconds
📋 Full dataset size: 125847 documents

🔥 Taking a random sample of 10,000 documents for demonstration purposes...
✅ Sampled dataset size: 10000 documents
📄 Sample fields: ['id', 'title', 'content', 'contents']
📝 Sample text length: 548 characters
📖 Sample from book (using title): Neurology_Adams

🔄 Processing 10000 textbook documents...
   Processing textbook document 1/10000
   Processing textbook document 1001/10000
   Processing textbook document 2001/10000
   Processing textbook document 3001/10000
   Processing textbook document 4001/10000
   Processing textbook document 5001/10000
   Processing textbook document 6001/10000
   Processing textbook document 7001/10000
   Processing textbook document 8001/10000
   Processing textbook document 9001/10000
✅ Created 10000 documents from the textbooks dataset
🔄 Chunking textbook documents...
✅ Created 10000 chunks from 10000 documents
📊 Average chunks per textbook document: 1.0

💾 Ingest

## 📦 Section 6: Final Verification and Archive Creation

Creating a compressed archive of the complete ChromaDB database for easy download from Kaggle. This section also performs final verification to ensure both collections were created successfully and are ready for deployment.

In [10]:
# =============================================================================
# FINAL VERIFICATION AND ARCHIVE CREATION
# =============================================================================

print("🔍 FINAL VERIFICATION AND ARCHIVE CREATION")
print("=" * 60)

# Verify that the database directory exists and contains expected files
print(f"📁 Verifying database directory: {DB_PERSIST_DIRECTORY}")

if os.path.exists(DB_PERSIST_DIRECTORY):
    print("✅ Database directory exists")
    
    # List contents of the database directory
    db_contents = os.listdir(DB_PERSIST_DIRECTORY)
    print(f"📂 Database directory contents: {len(db_contents)} items")
    
    for item in sorted(db_contents):
        item_path = os.path.join(DB_PERSIST_DIRECTORY, item)
        if os.path.isdir(item_path):
            print(f"   📁 {item}/ (directory)")
        else:
            file_size = os.path.getsize(item_path) / 1024 / 1024  # MB
            print(f"   📄 {item} ({file_size:.2f} MB)")
    
    # Verify collections by attempting to connect to them
    try:
        print(f"\n🔍 Verifying collections...")
        
        # Test Guidelines collection
        guidelines_test_db = Chroma(
            persist_directory=DB_PERSIST_DIRECTORY,
            embedding_function=embedding_model,
            collection_name=GUIDELINES_COLLECTION_NAME
        )
        guidelines_count = guidelines_test_db._collection.count()
        print(f"✅ Guidelines collection '{GUIDELINES_COLLECTION_NAME}': {guidelines_count} documents")
        
        # Test Q&A collection
        qna_test_db = Chroma(
            persist_directory=DB_PERSIST_DIRECTORY,
            embedding_function=embedding_model,
            collection_name=QNA_COLLECTION_NAME
        )
        qna_count = qna_test_db._collection.count()
        print(f"✅ Q&A collection '{QNA_COLLECTION_NAME}': {qna_count} documents")
        
        total_documents = guidelines_count + qna_count
        print(f"📊 Total documents across both collections: {total_documents}")
        
        # Clean up test connections
        guidelines_test_db = None
        qna_test_db = None
        
    except Exception as e:
        print(f"⚠️  Warning: Could not verify collections: {e}")
        print("   Database files exist but verification failed")
    
else:
    print(f"❌ Database directory does not exist: {DB_PERSIST_DIRECTORY}")
    raise FileNotFoundError(f"Database directory not found: {DB_PERSIST_DIRECTORY}")

# Create compressed archive for download
print(f"\n📦 Creating compressed archive for download...")
archive_name = "chroma_db.zip"

try:
    # Use the zip command to create a compressed archive
    zip_command = f"cd /kaggle/working && zip -r {archive_name} chroma_db"
    exit_code = os.system(zip_command)
    
    if exit_code == 0:
        archive_path = f"/kaggle/working/{archive_name}"
        if os.path.exists(archive_path):
            archive_size = os.path.getsize(archive_path) / 1024 / 1024  # MB
            print(f"✅ Archive created successfully: {archive_name}")
            print(f"📦 Archive size: {archive_size:.2f} MB")
            print(f"📁 Archive location: {archive_path}")
        else:
            print(f"❌ Archive file not found after creation")
    else:
        print(f"❌ Zip command failed with exit code: {exit_code}")
        
    # Alternative method using Python's zipfile if the command fails
    if not os.path.exists(f"/kaggle/working/{archive_name}"):
        print("🔄 Attempting alternative zip creation method...")
        import zipfile
        import shutil
        
        with zipfile.ZipFile(f"/kaggle/working/{archive_name}", 'w', zipfile.ZIP_DEFLATED) as zipf:
            for root, dirs, files in os.walk(DB_PERSIST_DIRECTORY):
                for file in files:
                    file_path = os.path.join(root, file)
                    arc_path = os.path.relpath(file_path, os.path.dirname(DB_PERSIST_DIRECTORY))
                    zipf.write(file_path, arc_path)
        
        if os.path.exists(f"/kaggle/working/{archive_name}"):
            archive_size = os.path.getsize(f"/kaggle/working/{archive_name}") / 1024 / 1024
            print(f"✅ Alternative archive creation successful: {archive_size:.2f} MB")
        else:
            print(f"❌ Alternative archive creation failed")
    
except Exception as e:
    print(f"❌ Error creating archive: {e}")

# Final completion message
print(f"\n🎉 MEDICAL DATA INGESTION PIPELINE COMPLETED SUCCESSFULLY!")
print("=" * 60)
print(f"✅ Two medical datasets processed and ingested:")
print(f"   🏥 {GUIDELINES_DATASET} → '{GUIDELINES_COLLECTION_NAME}' collection")
print(f"   ❓ {MEDQUAD_DATASET} → '{QNA_COLLECTION_NAME}' collection")
print(f"💾 ChromaDB database location: {DB_PERSIST_DIRECTORY}")
print(f"📦 Downloadable archive: /kaggle/working/{archive_name}")
print(f"🤖 Embedding model: {EMBEDDING_MODEL_NAME}")
print(f"⚡ Processed on: {device.upper()}")
print("=" * 60)
print("🚀 The vector database is ready for deployment in medical RAG systems!")
print("📥 Download the zip file to use in your medical consultation application.")

🔍 FINAL VERIFICATION AND ARCHIVE CREATION
📁 Verifying database directory: /kaggle/working/chroma_db
✅ Database directory exists
📂 Database directory contents: 3 items
   📁 2f13116e-5472-4689-a051-f846525d265a/ (directory)
   📁 98b16966-98a1-4177-aa1a-7e56bfb1ac5d/ (directory)
   📄 chroma.sqlite3 (1011.16 MB)

🔍 Verifying collections...
✅ Guidelines collection 'medical_guidelines': 188808 documents
✅ Q&A collection 'medrag_qna': 0 documents
📊 Total documents across both collections: 188808

📦 Creating compressed archive for download...
  adding: chroma_db/ (stored 0%)
  adding: chroma_db/98b16966-98a1-4177-aa1a-7e56bfb1ac5d/ (stored 0%)
  adding: chroma_db/98b16966-98a1-4177-aa1a-7e56bfb1ac5d/header.bin (deflated 54%)
  adding: chroma_db/98b16966-98a1-4177-aa1a-7e56bfb1ac5d/index_metadata.pickle

  guidelines_test_db = Chroma(


 (deflated 43%)
  adding: chroma_db/98b16966-98a1-4177-aa1a-7e56bfb1ac5d/link_lists.bin (deflated 64%)
  adding: chroma_db/98b16966-98a1-4177-aa1a-7e56bfb1ac5d/data_level0.bin (deflated 10%)
  adding: chroma_db/98b16966-98a1-4177-aa1a-7e56bfb1ac5d/length.bin (deflated 70%)
  adding: chroma_db/2f13116e-5472-4689-a051-f846525d265a/ (stored 0%)
  adding: chroma_db/2f13116e-5472-4689-a051-f846525d265a/header.bin (deflated 57%)
  adding: chroma_db/2f13116e-5472-4689-a051-f846525d265a/index_metadata.pickle (deflated 45%)
  adding: chroma_db/2f13116e-5472-4689-a051-f846525d265a/link_lists.bin (deflated 76%)
  adding: chroma_db/2f13116e-5472-4689-a051-f846525d265a/data_level0.bin (deflated 10%)
  adding: chroma_db/2f13116e-5472-4689-a051-f846525d265a/length.bin (deflated 50%)
  adding: chroma_db/chroma.sqlite3 (deflated 49%)
✅ Archive created successfully: chroma_db.zip
📦 Archive size: 814.07 MB
📁 Archive location: /kaggle/working/chroma_db.zip

🎉 MEDICAL DATA INGESTION PIPELINE COMPLETED SUCC

## 🎯 Summary and Usage Instructions

### ✅ **What This Notebook Accomplished:**

1. **📦 Environment Setup** - Installed all required packages for Kaggle GPU environment
2. **⚙️ Configuration** - Set up consistent constants for file paths and model configuration  
3. **🤖 GPU-Optimized Embedding** - Initialized Nomic AI embedding model with GPU acceleration
4. **🏥 Guidelines Processing** - Processed EPFL guidelines dataset into `medical_guidelines` collection
5. **❓ Q&A Processing** - Processed MedRAG/MedQuAD dataset into `medrag_qna` collection
6. **📦 Archive Creation** - Created downloadable `chroma_db.zip` file

### 🗄️ **Output Structure:**
```
chroma_db.zip
└── chroma_db/
    ├── medical_guidelines/     # Clinical guidelines collection
    └── medrag_qna/            # Medical Q&A collection
```

### 🚀 **How to Use the Generated Database:**

1. **Download** the `chroma_db.zip` file from Kaggle
2. **Extract** to your medical RAG system directory
3. **Connect** to collections using:
   ```python
   from langchain_community.vectorstores import Chroma
   
   # Guidelines collection
   guidelines_db = Chroma(
       persist_directory="./chroma_db",
       collection_name="medical_guidelines"
   )
   
   # Q&A collection  
   qna_db = Chroma(
       persist_directory="./chroma_db", 
       collection_name="medrag_qna"
   )
   ```

### 📊 **Performance Optimizations Applied:**
- ⚡ **GPU Acceleration** - Leveraged Kaggle GPU for embedding generation
- 🧠 **Memory Management** - Cleared variables between processing stages  
- 📄 **Optimal Chunking** - 1000 char chunks with 150 char overlap
- 🗄️ **Separate Collections** - Isolated guidelines and Q&A for specialized retrieval

**The database is now ready for integration into your medical RAG consultation system!**