# HoneyBee Workshop Part 7: Multi-Modal Integration

## Overview
In this workshop, you'll learn how to:
1. Combine embeddings from multiple modalities
2. Implement different fusion strategies
3. Evaluate multi-modal performance
4. Visualize integrated embeddings
5. Build end-to-end multi-modal pipelines

**Duration**: 30 minutes

**Prerequisites**: 
- Completed Parts 1-6
- Understanding of multi-modal learning concepts

## 1. Setup and Imports

In [None]:
import os
import sys
import numpy as np
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Machine learning imports
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score

# Add HoneyBee to path
sys.path.append('/mnt/f/Projects/HoneyBee')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
np.random.seed(42)

print("Libraries loaded successfully!")

## 2. Load Multi-Modal Embeddings

In [None]:
# Define modalities
modalities = ['clinical', 'pathology', 'radiology', 'molecular']

# Load embeddings (mock data for demonstration)
n_samples = 300
embeddings = {}

# Different embedding dimensions for each modality
embedding_dims = {
    'clinical': 768,    # BioBERT
    'pathology': 1024,  # UNI
    'radiology': 2048,  # RadImageNet
    'molecular': 512    # Molecular features
}

# Create correlated mock embeddings
base_signal = np.random.randn(n_samples, 50)  # Shared signal
patient_ids = [f"TCGA-{i:04d}" for i in range(n_samples)]

for modality in modalities:
    dim = embedding_dims[modality]
    
    # Project base signal to modality dimension
    projection = np.random.randn(50, dim)
    modality_signal = base_signal @ projection
    
    # Add modality-specific noise
    noise = np.random.randn(n_samples, dim) * 0.5
    
    embeddings[modality] = pd.DataFrame(
        modality_signal + noise,
        index=patient_ids
    )
    
    print(f"{modality.capitalize()} embeddings: {embeddings[modality].shape}")

# Create labels
cancer_types = ['BRCA', 'LUAD', 'KIRC', 'THCA', 'PRAD']
labels = pd.Series(
    np.random.choice(cancer_types, n_samples),
    index=patient_ids,
    name='cancer_type'
)

print(f"\nTotal patients: {n_samples}")
print(f"Cancer types: {labels.value_counts().to_dict()}")

## 3. Implement Fusion Strategies

In [None]:
class MultiModalFusion:
    """
    Implements various fusion strategies for multi-modal embeddings
    """
    
    @staticmethod
    def early_fusion(embeddings_dict, method='concat'):
        """
        Early fusion: Combine embeddings before learning
        """
        if method == 'concat':
            # Simple concatenation
            return pd.concat(embeddings_dict.values(), axis=1)
        
        elif method == 'kronecker':
            # Kronecker product (for 2 modalities)
            if len(embeddings_dict) != 2:
                raise ValueError("Kronecker product requires exactly 2 modalities")
            
            emb1, emb2 = list(embeddings_dict.values())
            # Reduce dimensions first
            pca1 = PCA(n_components=50).fit_transform(emb1)
            pca2 = PCA(n_components=50).fit_transform(emb2)
            
            # Compute Kronecker product for each sample
            kron_features = []
            for i in range(len(pca1)):
                kron = np.kron(pca1[i], pca2[i])
                kron_features.append(kron)
            
            return pd.DataFrame(kron_features, index=emb1.index)
    
    @staticmethod
    def intermediate_fusion(embeddings_dict, n_components=100):
        """
        Intermediate fusion: Project to common space
        """
        # Project each modality to same dimension
        projected = {}
        
        for modality, emb in embeddings_dict.items():
            pca = PCA(n_components=n_components)
            proj = pca.fit_transform(emb)
            projected[modality] = proj
        
        # Average in common space
        fused = np.mean(list(projected.values()), axis=0)
        return pd.DataFrame(fused, index=list(embeddings_dict.values())[0].index)
    
    @staticmethod
    def attention_fusion(embeddings_dict, temperature=1.0):
        """
        Attention-based fusion
        """
        # Simple attention mechanism
        n_modalities = len(embeddings_dict)
        n_samples = len(list(embeddings_dict.values())[0])
        
        # Learn attention weights (mock for demonstration)
        attention_weights = np.random.dirichlet(np.ones(n_modalities), size=n_samples)
        
        # Apply temperature scaling
        attention_weights = np.exp(attention_weights / temperature)
        attention_weights = attention_weights / attention_weights.sum(axis=1, keepdims=True)
        
        # Weighted combination
        fused_list = []
        for i in range(n_samples):
            weighted_sum = 0
            for j, (modality, emb) in enumerate(embeddings_dict.items()):
                # Project to common dimension
                pca = PCA(n_components=100)
                proj = pca.fit_transform(emb)
                weighted_sum += attention_weights[i, j] * proj[i]
            fused_list.append(weighted_sum)
        
        return pd.DataFrame(
            fused_list, 
            index=list(embeddings_dict.values())[0].index
        ), attention_weights

# Initialize fusion module
fusion = MultiModalFusion()
print("Fusion strategies implemented: early (concat, kronecker), intermediate, attention")

## 4. Compare Fusion Strategies

In [None]:
# Apply different fusion strategies
fusion_results = {}

# 1. Early fusion - Concatenation
print("Applying early fusion (concatenation)...")
fused_concat = fusion.early_fusion(embeddings, method='concat')
fusion_results['Concatenation'] = fused_concat

# 2. Early fusion - Kronecker (using 2 modalities)
print("Applying early fusion (Kronecker product)...")
two_modalities = {k: v for k, v in list(embeddings.items())[:2]}
fused_kronecker = fusion.early_fusion(two_modalities, method='kronecker')
fusion_results['Kronecker'] = fused_kronecker

# 3. Intermediate fusion
print("Applying intermediate fusion...")
fused_intermediate = fusion.intermediate_fusion(embeddings)
fusion_results['Intermediate'] = fused_intermediate

# 4. Attention fusion
print("Applying attention fusion...")
fused_attention, attention_weights = fusion.attention_fusion(embeddings)
fusion_results['Attention'] = fused_attention

# Display fusion results
for method, fused_data in fusion_results.items():
    print(f"\n{method} fusion shape: {fused_data.shape}")

## 5. Evaluate Fusion Performance

In [None]:
# Evaluate each fusion method on classification task
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
performance_results = {}

# Also evaluate individual modalities
all_embeddings = {**{f'Single-{k}': v for k, v in embeddings.items()}, **fusion_results}

for method, emb_data in all_embeddings.items():
    print(f"\nEvaluating {method}...")
    
    # Prepare data
    X = emb_data.values
    y = labels.values
    
    # Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Cross-validation
    clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
    scores = cross_val_score(clf, X_scaled, y, cv=cv, scoring='accuracy')
    
    performance_results[method] = {
        'mean_accuracy': scores.mean(),
        'std_accuracy': scores.std(),
        'scores': scores
    }
    
    print(f"  Accuracy: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

# Plot comparison
plt.figure(figsize=(12, 6))
methods = list(performance_results.keys())
accuracies = [performance_results[m]['mean_accuracy'] for m in methods]
errors = [performance_results[m]['std_accuracy'] * 2 for m in methods]

# Color code single vs fusion
colors = ['lightblue' if m.startswith('Single') else 'orange' for m in methods]

bars = plt.bar(methods, accuracies, yerr=errors, capsize=5, color=colors)
plt.xlabel('Method')
plt.ylabel('Accuracy')
plt.title('Classification Performance: Single Modality vs Fusion')
plt.xticks(rotation=45, ha='right')
plt.ylim(0, 1)
plt.axhline(y=np.mean(accuracies), color='red', linestyle='--', alpha=0.5, label='Average')
plt.legend()
plt.tight_layout()
plt.show()

## 6. Visualize Attention Weights

In [None]:
# Visualize attention weights for different cancer types
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for idx, cancer_type in enumerate(cancer_types):
    # Get attention weights for this cancer type
    cancer_mask = labels == cancer_type
    cancer_attention = attention_weights[cancer_mask]
    
    # Average attention per modality
    avg_attention = cancer_attention.mean(axis=0)
    
    # Plot
    ax = axes[idx]
    bars = ax.bar(modalities, avg_attention)
    ax.set_title(f'{cancer_type} (n={cancer_mask.sum()})')
    ax.set_ylabel('Average Attention Weight')
    ax.set_ylim(0, 1)
    
    # Color highest attention
    max_idx = np.argmax(avg_attention)
    bars[max_idx].set_color('red')

# Remove empty subplot
axes[-1].remove()

plt.suptitle('Attention Weights by Cancer Type', fontsize=16)
plt.tight_layout()
plt.show()

# Heatmap of attention weights
plt.figure(figsize=(10, 8))
sns.heatmap(attention_weights[:50].T, cmap='YlOrRd', 
            xticklabels=False, yticklabels=modalities)
plt.xlabel('Patients (first 50)')
plt.ylabel('Modality')
plt.title('Attention Weights Heatmap')
plt.tight_layout()
plt.show()

## 7. Visualize Integrated Embeddings

In [None]:
# Compare t-SNE visualizations
fig, axes = plt.subplots(2, 2, figsize=(12, 12))

# Select methods to visualize
viz_methods = ['Single-clinical', 'Concatenation', 'Intermediate', 'Attention']

for idx, method in enumerate(viz_methods):
    ax = axes[idx // 2, idx % 2]
    
    # Get embeddings
    if method in all_embeddings:
        emb = all_embeddings[method]
    else:
        continue
    
    # Apply t-SNE
    print(f"Computing t-SNE for {method}...")
    
    # Reduce dimensions if needed
    if emb.shape[1] > 50:
        pca = PCA(n_components=50)
        emb_reduced = pca.fit_transform(emb)
    else:
        emb_reduced = emb.values
    
    tsne = TSNE(n_components=2, random_state=42)
    emb_2d = tsne.fit_transform(emb_reduced)
    
    # Plot
    for cancer_type in cancer_types:
        mask = labels == cancer_type
        ax.scatter(emb_2d[mask, 0], emb_2d[mask, 1], 
                  label=cancer_type, alpha=0.7, s=50)
    
    ax.set_title(method)
    ax.set_xlabel('t-SNE 1')
    ax.set_ylabel('t-SNE 2')
    if idx == 0:
        ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

plt.suptitle('t-SNE Visualization of Different Fusion Methods', fontsize=16)
plt.tight_layout()
plt.show()

## 8. Missing Modality Handling

In [None]:
# Simulate missing modalities
missing_rate = 0.2  # 20% missing
n_missing = int(n_samples * missing_rate)

# Create missing patterns
missing_patterns = {
    'Missing Pathology': {'pathology': np.random.choice(n_samples, n_missing, replace=False)},
    'Missing Radiology': {'radiology': np.random.choice(n_samples, n_missing, replace=False)},
    'Missing Both': {
        'pathology': np.random.choice(n_samples, n_missing//2, replace=False),
        'radiology': np.random.choice(n_samples, n_missing//2, replace=False)
    }
}

# Evaluate performance with missing modalities
missing_results = {}

for pattern_name, missing_indices in missing_patterns.items():
    print(f"\nEvaluating {pattern_name}...")
    
    # Create copy of embeddings
    embeddings_missing = embeddings.copy()
    
    # Set missing values to zero (simple imputation)
    for modality, indices in missing_indices.items():
        embeddings_missing[modality].iloc[indices] = 0
    
    # Apply fusion
    fused_missing = fusion.early_fusion(embeddings_missing, method='concat')
    
    # Evaluate
    X = fused_missing.values
    y = labels.values
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    scores = cross_val_score(clf, X_scaled, y, cv=cv, scoring='accuracy')
    
    missing_results[pattern_name] = scores.mean()
    print(f"  Accuracy: {scores.mean():.3f}")

# Add complete data result
missing_results['Complete Data'] = performance_results['Concatenation']['mean_accuracy']

# Plot missing modality impact
plt.figure(figsize=(10, 6))
patterns = list(missing_results.keys())
accuracies = list(missing_results.values())

bars = plt.bar(patterns, accuracies)
bars[-1].set_color('green')  # Highlight complete data

plt.xlabel('Missing Pattern')
plt.ylabel('Accuracy')
plt.title('Impact of Missing Modalities on Performance')
plt.ylim(0, 1)

for bar, acc in zip(bars, accuracies):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{acc:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

## 9. Build End-to-End Pipeline

In [None]:
class MultiModalPipeline:
    """
    End-to-end pipeline for multi-modal cancer analysis
    """
    
    def __init__(self, fusion_method='attention'):
        self.fusion_method = fusion_method
        self.fusion = MultiModalFusion()
        self.scaler = StandardScaler()
        self.classifier = RandomForestClassifier(n_estimators=100, random_state=42)
        self.is_fitted = False
    
    def preprocess(self, embeddings_dict):
        """
        Preprocess and fuse embeddings
        """
        # Apply fusion
        if self.fusion_method == 'concat':
            fused = self.fusion.early_fusion(embeddings_dict, method='concat')
        elif self.fusion_method == 'attention':
            fused, self.attention_weights = self.fusion.attention_fusion(embeddings_dict)
        else:
            fused = self.fusion.intermediate_fusion(embeddings_dict)
        
        return fused
    
    def fit(self, embeddings_dict, labels):
        """
        Fit the pipeline
        """
        # Preprocess
        X = self.preprocess(embeddings_dict)
        
        # Scale
        X_scaled = self.scaler.fit_transform(X)
        
        # Train classifier
        self.classifier.fit(X_scaled, labels)
        self.is_fitted = True
        
        return self
    
    def predict(self, embeddings_dict):
        """
        Make predictions
        """
        if not self.is_fitted:
            raise ValueError("Pipeline must be fitted before prediction")
        
        # Preprocess
        X = self.preprocess(embeddings_dict)
        
        # Scale
        X_scaled = self.scaler.transform(X)
        
        # Predict
        return self.classifier.predict(X_scaled)
    
    def predict_proba(self, embeddings_dict):
        """
        Predict probabilities
        """
        if not self.is_fitted:
            raise ValueError("Pipeline must be fitted before prediction")
        
        X = self.preprocess(embeddings_dict)
        X_scaled = self.scaler.transform(X)
        return self.classifier.predict_proba(X_scaled)

# Create and test pipeline
pipeline = MultiModalPipeline(fusion_method='attention')

# Split data
from sklearn.model_selection import train_test_split
train_idx, test_idx = train_test_split(
    range(n_samples), test_size=0.2, stratify=labels, random_state=42
)

# Prepare train/test embeddings
train_embeddings = {mod: emb.iloc[train_idx] for mod, emb in embeddings.items()}
test_embeddings = {mod: emb.iloc[test_idx] for mod, emb in embeddings.items()}
train_labels = labels.iloc[train_idx]
test_labels = labels.iloc[test_idx]

# Fit pipeline
print("Training multi-modal pipeline...")
pipeline.fit(train_embeddings, train_labels)

# Make predictions
predictions = pipeline.predict(test_embeddings)
probabilities = pipeline.predict_proba(test_embeddings)

# Evaluate
accuracy = accuracy_score(test_labels, predictions)
f1 = f1_score(test_labels, predictions, average='weighted')

print(f"\nPipeline Performance:")
print(f"  Test Accuracy: {accuracy:.3f}")
print(f"  Test F1-score: {f1:.3f}")

# Show example predictions
print("\nExample Predictions:")
for i in range(5):
    true_label = test_labels.iloc[i]
    pred_label = predictions[i]
    confidence = probabilities[i].max()
    print(f"  Patient {test_labels.index[i]}: True={true_label}, Pred={pred_label}, Conf={confidence:.2f}")

## 10. Save Pipeline and Results

In [None]:
# Create output directory
output_dir = Path("/mnt/f/Projects/HoneyBee/examples/mayo/outputs")
output_dir.mkdir(exist_ok=True)

# Save pipeline
import pickle
with open(output_dir / 'multimodal_pipeline.pkl', 'wb') as f:
    pickle.dump(pipeline, f)

# Save fusion comparison results
import json
fusion_summary = {
    method: {
        'mean_accuracy': float(results['mean_accuracy']),
        'std_accuracy': float(results['std_accuracy'])
    }
    for method, results in performance_results.items()
}

with open(output_dir / 'fusion_comparison.json', 'w') as f:
    json.dump(fusion_summary, f, indent=2)

# Save attention weights
if hasattr(pipeline, 'attention_weights'):
    np.save(output_dir / 'attention_weights.npy', pipeline.attention_weights)

# Create summary report
report = f"""
Multi-Modal Integration Workshop Results
======================================

Dataset:
- Patients: {n_samples}
- Modalities: {', '.join(modalities)}
- Cancer Types: {', '.join(cancer_types)}

Best Fusion Method: {max(fusion_summary, key=lambda x: fusion_summary[x]['mean_accuracy'])}
Best Accuracy: {max(fusion_summary[x]['mean_accuracy'] for x in fusion_summary):.3f}

Pipeline Performance:
- Test Accuracy: {accuracy:.3f}
- Test F1-score: {f1:.3f}

Missing Modality Impact:
- Complete Data: {missing_results['Complete Data']:.3f}
- Missing Pathology: {missing_results['Missing Pathology']:.3f}
- Missing Radiology: {missing_results['Missing Radiology']:.3f}
"""

with open(output_dir / 'multimodal_summary.txt', 'w') as f:
    f.write(report)

print(f"Results saved to: {output_dir}")
print("\nSummary:")
print(report)

## Summary and Next Steps

In this workshop, you learned to:
1. ✅ Implement multiple fusion strategies (early, intermediate, attention)
2. ✅ Compare fusion performance across methods
3. ✅ Visualize attention weights and integrated embeddings
4. ✅ Handle missing modalities gracefully
5. ✅ Build end-to-end multi-modal pipelines

**Workshop Series Complete!**

**Key Takeaways**:
- Multi-modal fusion often outperforms single modalities
- Attention mechanisms can reveal modality importance
- Missing modality handling is crucial for real-world deployment
- Different fusion strategies work better for different tasks

**Next Steps**:
- Try advanced fusion methods (tensor fusion, graph neural networks)
- Implement cross-modal learning and translation
- Explore interpretability of multi-modal models
- Deploy your pipeline on new cancer datasets!

**Thank you for completing the HoneyBee Workshop Series!**