# HoneyBee Workshop Part 4: Downstream Classification Tasks

## Overview
In this workshop, you'll learn how to:
1. Load pre-computed embeddings from HuggingFace or local files
2. Perform cancer type classification using embeddings
3. Evaluate model performance with proper metrics
4. Compare different modalities and fusion strategies

**Duration**: 30 minutes

**Prerequisites**: 
- Completed Parts 1-3 or access to pre-computed embeddings
- Understanding of basic machine learning concepts

## 1. Setup and Imports

In [1]:
import os
import sys
import numpy as np
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset
import warnings
warnings.filterwarnings('ignore')

# Machine learning imports
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries loaded successfully!")

Libraries loaded successfully!


## 2. Load Pre-computed TCGA Embeddings

We'll use the TCGA embeddings from HuggingFace or local pre-computed files.

In [11]:
# Option 1: Load from HuggingFace (if available)
print("Loading TCGA embeddings...")
print("Dataset available at: https://huggingface.co/datasets/Lab-Rasool/TCGA")

# Option 2: Load from local pre-computed embeddings
# (Using the shared_data folder from results)
local_path = Path("/mnt/f/Projects/HoneyBee/results/shared_data/embeddings")

# Load clinical embeddings as example
if local_path.exists():
    print("\nLoading from local pre-computed embeddings...")
    
    # Clinical embeddings
    clinical_emb_path = local_path / "clinical_embeddings.pkl"
    if clinical_emb_path.exists():
        clinical_embeddings = pd.read_pickle(clinical_emb_path)

Loading TCGA embeddings...
Dataset available at: https://huggingface.co/datasets/Lab-Rasool/TCGA

Loading from local pre-computed embeddings...


## 3. Prepare Data for Classification

In [12]:
# Align embeddings with labels
common_patients = list(set(clinical_embeddings.index) & set(labels_df['patient_id']))
print(f"Common patients: {len(common_patients)}")

# Filter and align
X = clinical_embeddings.loc[common_patients]
y_df = labels_df[labels_df['patient_id'].isin(common_patients)]
y_df = y_df.set_index('patient_id').loc[common_patients]
y = y_df['cancer_type']

# Encode labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)

print(f"\nFinal dataset shape: {X.shape}")
print(f"Classes: {le.classes_}")
print(f"Class distribution:")
print(y.value_counts())

AttributeError: 'dict' object has no attribute 'index'

## 4. Visualize Embeddings with t-SNE

In [None]:
# Reduce dimensionality for visualization
print("Computing t-SNE visualization...")

# First apply PCA to speed up t-SNE
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X)

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_pca)

# Create visualization
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], 
                     c=y_encoded, cmap='tab10', alpha=0.7)
plt.colorbar(scatter, ticks=range(len(le.classes_)), label='Cancer Type')
plt.clim(-0.5, len(le.classes_) - 0.5)

# Add legend
handles = [plt.scatter([], [], c=plt.cm.tab10(i), s=100) 
           for i in range(len(le.classes_))]
plt.legend(handles, le.classes_, title="Cancer Type", 
          bbox_to_anchor=(1.15, 1), loc='upper left')

plt.title('t-SNE Visualization of Clinical Embeddings')
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.tight_layout()
plt.show()

## 5. Train Classification Model

We'll use Random Forest as it works well with high-dimensional embeddings.

In [None]:
# Initialize classifier
rf_classifier = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)

# Perform cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("Performing 5-fold cross-validation...")
cv_scores = cross_val_score(rf_classifier, X, y_encoded, cv=cv, scoring='accuracy')

print(f"\nCross-validation scores: {cv_scores}")
print(f"Mean accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

## 6. Detailed Performance Evaluation

In [None]:
# Train on full dataset for detailed evaluation
# In practice, use train-test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, stratify=y_encoded, random_state=42
)

# Train model
rf_classifier.fit(X_train, y_train)

# Make predictions
y_pred = rf_classifier.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Test Accuracy: {accuracy:.4f}")
print(f"Test F1-score (weighted): {f1:.4f}")

# Detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_))

## 7. Confusion Matrix Visualization

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=le.classes_, yticklabels=le.classes_)
plt.title('Confusion Matrix - Cancer Type Classification')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

# Calculate per-class accuracy
per_class_accuracy = cm.diagonal() / cm.sum(axis=1)
for i, cancer_type in enumerate(le.classes_):
    print(f"{cancer_type}: {per_class_accuracy[i]:.3f}")

## 8. Feature Importance Analysis

In [None]:
# Get feature importances
importances = rf_classifier.feature_importances_
indices = np.argsort(importances)[::-1]

# Plot top 20 features
plt.figure(figsize=(10, 6))
plt.title("Top 20 Feature Importances")
plt.bar(range(20), importances[indices[:20]])
plt.xlabel("Feature Index")
plt.ylabel("Importance")
plt.tight_layout()
plt.show()

print(f"Total features: {len(importances)}")
print(f"Features with importance > 0.001: {(importances > 0.001).sum()}")

## 9. Multi-Modal Classification

Compare and combine different modalities for improved performance.

In [None]:
# Load or create embeddings for multiple modalities
modalities = ['clinical', 'pathology', 'radiology']
modality_embeddings = {}

# Create mock embeddings for demonstration
for modality in modalities:
    if modality == 'clinical':
        modality_embeddings[modality] = X  # Use existing clinical
    else:
        # Mock other modalities
        modality_embeddings[modality] = pd.DataFrame(
            np.random.randn(len(X), 1024),  # Different embedding sizes
            index=X.index
        )

# Fusion strategies
def fuse_embeddings(embeddings_dict, method='concat'):
    """
    Fuse embeddings from multiple modalities
    """
    if method == 'concat':
        return pd.concat(embeddings_dict.values(), axis=1)
    elif method == 'mean':
        # Average embeddings
        stacked = np.stack([emb.values for emb in embeddings_dict.values()])
        return pd.DataFrame(stacked.mean(axis=0), index=list(embeddings_dict.values())[0].index)
    else:
        raise ValueError(f"Unknown fusion method: {method}")

# Compare fusion strategies
fusion_methods = ['concat', 'mean']
fusion_results = {}

for method in fusion_methods:
    print(f"\nTesting {method} fusion...")
    X_fused = fuse_embeddings(modality_embeddings, method)
    
    # Ensure alignment with labels
    X_fused = X_fused.loc[X.index]
    
    # Cross-validation
    cv_scores = cross_val_score(rf_classifier, X_fused, y_encoded, cv=cv, scoring='accuracy')
    fusion_results[method] = cv_scores.mean()
    
    print(f"{method} fusion accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

# Plot comparison
plt.figure(figsize=(8, 6))
methods = list(fusion_results.keys())
accuracies = list(fusion_results.values())
plt.bar(methods, accuracies)
plt.xlabel('Fusion Method')
plt.ylabel('Accuracy')
plt.title('Multi-Modal Fusion Comparison')
plt.ylim(0, 1)
for i, v in enumerate(accuracies):
    plt.text(i, v + 0.01, f'{v:.3f}', ha='center')
plt.tight_layout()
plt.show()

## 10. Save Results and Model

In [None]:
# Create output directory
output_dir = Path("/mnt/f/Projects/HoneyBee/examples/mayo/outputs")
output_dir.mkdir(exist_ok=True)

# Save results
results = {
    'accuracy': accuracy,
    'f1_score': f1,
    'cv_scores': cv_scores.tolist(),
    'confusion_matrix': cm.tolist(),
    'classes': le.classes_.tolist(),
    'fusion_results': fusion_results
}

import json
with open(output_dir / 'classification_results.json', 'w') as f:
    json.dump(results, f, indent=2)

# Save model
import joblib
joblib.dump(rf_classifier, output_dir / 'cancer_classifier.pkl')
joblib.dump(le, output_dir / 'label_encoder.pkl')

print(f"Results saved to: {output_dir}")
print(f"Model saved as: cancer_classifier.pkl")
print(f"Label encoder saved as: label_encoder.pkl")

## Summary and Next Steps

In this workshop, you learned to:
1. ✅ Load and prepare embedding data for classification
2. ✅ Visualize high-dimensional embeddings with t-SNE
3. ✅ Train and evaluate classification models
4. ✅ Generate confusion matrices and performance metrics
5. ✅ Compare fusion strategies for multi-modal data

**Next Workshop**: Part 5 - Retrieval Evaluation

**Key Takeaways**:
- Pre-trained embeddings capture meaningful cancer-specific patterns
- Random Forest works well with high-dimensional embeddings
- Multi-modal fusion can improve classification performance
- Proper evaluation requires cross-validation and multiple metrics

**Exercise**: Try different classifiers (SVM, XGBoost) and embedding combinations!