# HoneyBee Workshop Part 1: Clinical Text Preprocessing

## 1. Setup and Imports

In [None]:
import sys
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Add HoneyBee to path
sys.path.append('/mnt/f/Projects/HoneyBee')

# Import HoneyBee components
from honeybee.loaders import Reader
from honeybee.processors import ClinicalProcessor
from honeybee.models import HuggingFaceEmbedder

print("HoneyBee clinical processing modules loaded successfully!")

HoneyBee clinical processing modules loaded successfully!


## 2. Load Sample Clinical Data

We'll use the sample PDF clinical report provided in the examples folder.

In [3]:
# Path to sample clinical report
sample_pdf = "../samples/sample.pdf"

# Initialize the clinical data loader
reader = Reader.reader.PDF()

# Load the PDF
print(f"Loading clinical report from: {sample_pdf}")
clinical_text = reader.read(sample_pdf)

# Display first 500 characters
print("\nFirst 500 characters of the clinical report:")
print("-" * 50)
print(clinical_text[:500])
print("-" * 50)

Loading clinical report from: ../samples/sample.pdf

First 500 characters of the clinical report:
--------------------------------------------------
Patient Name:  PATIENT P.N 1 AGESEX: M :RIN NAME :  AGESEX: PHYSICIAN:MATH.NO MED. REC. NO: SURGERY DATE: RECEIVE DATE:UUID:4854A37F- 5F68-4EA0-99F7-E0572EA9533F TCGA-06-0150-01A-PR Redacted iii 111111111111111111111111a111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111 111 III ---------------------------------------------------------------- PATHOLOGICAL DIAGNOSIS: BRAIN BIOPSY: GLIOBLASTOMA 
--------------------------------------------------


## 3. Process Clinical Text

The ClinicalProcessor handles:
- Entity extraction (diseases, medications, procedures)
- Medical code normalization (SNOMED-CT, RxNorm, ICD-O-3)
- Temporal information extraction
- Text cleaning and structuring

In [4]:
import json

# Initialize the clinical processor
processor = ClinicalProcessor()

# Process the clinical text
print("Processing clinical text...")
processed_data = processor.process_text(clinical_text)

# Display extracted entities
print("\nExtracted Medical Entities:")
print("-" * 50)
print(json.dumps(processed_data, indent=2))

Processing clinical text...

Extracted Medical Entities:
--------------------------------------------------
{
  "text": "Patient Name:  PATIENT P.N 1 AGESEX: M :RIN NAME :  AGESEX: PHYSICIAN:MATH.NO MED. REC. NO: SURGERY DATE: RECEIVE DATE:UUID:4854A37F- 5F68-4EA0-99F7-E0572EA9533F TCGA-06-0150-01A-PR Redacted iii 111111111111111111111111a111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 1111 111 III ---------------------------------------------------------------- PATHOLOGICAL DIAGNOSIS: BRAIN BIOPSY: GLIOBLASTOMA PROLIFERATIVE INDEX, MIB-1: MORE THAN 40. ---------------------------------------------------------------- OperationSpecimen: Brain tumor. F.S. Clinical History and Pre-Op Dx: None given. GROSS PATHOLOGY: The specimen is labeled brain biopsy consiting of 0.2 x 0.4 cm pink-white tissue. A portion of the specimen has been examined by frozen section and is submitted in cassette 1

## 4. Generate Clinical Embeddings

HoneyBee supports multiple clinical embedding models:
- **GatorTron**: Large clinical language model
- **BioBERT**: Biomedical BERT
- **PubMedBERT**: BERT trained on PubMed abstracts
- **Clinical-T5**: T5 model for clinical text

In [5]:
# Initialize embedder with BioBERT (lighter weight for demo)
embedder = HuggingFaceEmbedder(model_name="UFNLP/gatortron-base", pooling_method='pooler_output')

embeddings = embedder.generate_embeddings(clinical_text)
embeddings = np.array(embeddings)

print(f"\nEmbedding shape: {embeddings.shape}")


Embedding shape: (1, 1024)


## 5. Batch Processing Multiple Files

In real scenarios, you'll process multiple clinical documents. Here's how to handle batch processing:

In [6]:
# Example: Process multiple clinical texts
# In practice, you would have multiple files
sample_texts = [
    "Patient presents with adenocarcinoma of the lung. Started on carboplatin and paclitaxel.",
    "Follow-up CT scan shows partial response to treatment. No new metastases identified.",
    "Pathology report confirms invasive ductal carcinoma, ER+/PR+/HER2-. Grade 2."
]

# Process and embed each text
all_embeddings = []
all_metadata = []

for i, text in enumerate(sample_texts):
    print(f"\nProcessing text {i+1}...")
    
    # Process
    processed = processor.process(text)
    
    # Generate embedding
    embedding = embedder.generate_embeddings(text)
    
    all_embeddings.append(embedding)
    all_metadata.append({
        'text_id': f'sample_{i+1}',
        'text_preview': text[:50] + '...',
        'entities': processed.get('entities', {})
    })

# Stack embeddings
embeddings_matrix = np.vstack(all_embeddings)
print(f"\nTotal embeddings shape: {embeddings_matrix.shape}")

INFO:honeybee.processors.clinical_processor:Processing document: Patient presents with adenocarcinoma of the lung. Started on carboplatin and paclitaxel.
ERROR:honeybee.processors.clinical_processor:Error processing Patient presents with adenocarcinoma of the lung. Started on carboplatin and paclitaxel.: Unsupported file format: 
INFO:honeybee.processors.clinical_processor:Processing document: Follow-up CT scan shows partial response to treatment. No new metastases identified.
ERROR:honeybee.processors.clinical_processor:Error processing Follow-up CT scan shows partial response to treatment. No new metastases identified.: Unsupported file format: 
INFO:honeybee.processors.clinical_processor:Processing document: Pathology report confirms invasive ductal carcinoma, ER+/PR+/HER2-. Grade 2.
ERROR:honeybee.processors.clinical_processor:Error processing Pathology report confirms invasive ductal carcinoma, ER+/PR+/HER2-. Grade 2.: Unsupported file format: 



Processing text 1...

Processing text 2...

Processing text 3...

Total embeddings shape: (3, 1024)


## 6. Save Embeddings for Downstream Tasks

Save embeddings in formats compatible with downstream analysis:

In [7]:
# Create output directory
output_dir = Path("/mnt/f/Projects/HoneyBee/examples/mayo/outputs")
output_dir.mkdir(exist_ok=True)

# Save embeddings as numpy array
np.save(output_dir / "clinical_embeddings.npy", embeddings_matrix)
print(f"Saved embeddings to: {output_dir / 'clinical_embeddings.npy'}")

# Save metadata
metadata_df = pd.DataFrame(all_metadata)
metadata_df.to_csv(output_dir / "clinical_metadata.csv", index=False)
print(f"Saved metadata to: {output_dir / 'clinical_metadata.csv'}")

Saved embeddings to: /mnt/f/Projects/HoneyBee/examples/mayo/outputs/clinical_embeddings.npy
Saved metadata to: /mnt/f/Projects/HoneyBee/examples/mayo/outputs/clinical_metadata.csv


## 7. Advanced: Using Different Embedding Models

Let's compare embeddings from different models:

In [8]:
# Compare different clinical embedding models
models_to_compare = [
    ("dmis-lab/biobert-v1.1", "BioBERT"),
    ("emilyalsentzer/Bio_ClinicalBERT", "ClinicalBERT"),
    # Add more models as needed
]

model_embeddings = {}
sample_text = sample_texts[0]  # Use first sample

for model_name, display_name in models_to_compare:
    try:
        print(f"\nGenerating embeddings with {display_name}...")
        embedder = HuggingFaceEmbedder(model_name=model_name, pooling_method='pooler_output')
        embedding = embedder.generate_embeddings(sample_text)
        model_embeddings[display_name] = embedding
        print(f"  Shape: {embedding.shape}")
    except Exception as e:
        print(f"  Error: {str(e)}")

print("\nEmbedding generation complete!")


Generating embeddings with BioBERT...
  Shape: (1, 768)

Generating embeddings with ClinicalBERT...
  Shape: (1, 768)

Embedding generation complete!


## 8. Integration with HuggingFace Datasets

For larger scale processing, we can load pre-computed TCGA embeddings:

In [9]:
# Example of loading pre-computed TCGA embeddings
from datasets import load_dataset

print("Loading TCGA clinical embeddings from HuggingFace...")
print("Dataset available at: https://huggingface.co/datasets/Lab-Rasool/TCGA")

# This would load the actual dataset
# clinical_dataset = load_dataset("Lab-Rasool/TCGA", "clinical", split="gatortron")

# For now, let's show the structure
print("\nTCGA dataset structure:")
print("- Clinical embeddings")
print("- Patient metadata")
print("- Cancer type labels")
print("- Survival information")

INFO:datasets:PyTorch version 2.7.0 available.
INFO:datasets:Duckdb version 1.2.1 available.


Loading TCGA clinical embeddings from HuggingFace...
Dataset available at: https://huggingface.co/datasets/Lab-Rasool/TCGA

TCGA dataset structure:
- Clinical embeddings
- Patient metadata
- Cancer type labels
- Survival information


## Summary and Next Steps

In this workshop, you learned to:
1. ✅ Load clinical text from PDFs and text files
2. ✅ Extract and normalize medical entities
3. ✅ Generate embeddings using clinical language models
4. ✅ Save embeddings for downstream analysis

**Next Workshop**: Part 2 - Radiology DICOM Processing

**Key Takeaways**:
- Clinical text requires specialized processing for medical entities
- Different embedding models capture different aspects of clinical language
- Proper preprocessing improves downstream task performance

**Exercise**: Try processing your own clinical text files and comparing different embedding models!