# MPEG-G Microbiome Classification - Federated Solution


**Performance Achieved:**
- **Loss: 0.0296**
- **Accuracy: 0.9962**
  
**Federated Learning**: Neural Networks with FedAvg

**Approach**: Deterministic federated training

**Pipeline:** kmercount Data preprocessing ‚Üí Federated client setup ‚Üí Neural Networks ‚Üí FedAvg ‚Üí Predictions

**Runtime Tracking:** All execution times are captured for each step

## 1. Setup and Configuration

In [2]:

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, accuracy_score
from scipy.stats import gmean
import warnings
import time
from pathlib import Path
from collections import defaultdict
import os
import random

warnings.filterwarnings('ignore')

# COMPLETE DETERMINISTIC SETUP - MUST BE FIRST
def setup_deterministic_environment(seed=42):
    """Complete deterministic setup for reproducible results"""
    # Python random
    random.seed(seed)
    
    # NumPy random
    np.random.seed(seed)
    
    # PyTorch random
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    
    # Environment variables
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'
    
    # PyTorch deterministic operations
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    
    # Force deterministic algorithms
    try:
        torch.use_deterministic_algorithms(True)
        print("‚úÖ Full deterministic mode enabled")
    except Exception as e:
        print(f"‚ö†Ô∏è Partial deterministic mode: {e}")

# Apply deterministic setup IMMEDIATELY
SEED = 42
setup_deterministic_environment(SEED)

# Runtime tracking 
runtime_log = {}
start_time_total = time.time()

def log_runtime(step_name, start_time):
    """Log runtime for a step - compatible with microbiome_solution_notebook"""
    elapsed = time.time() - start_time
    runtime_log[step_name] = elapsed
    print(f"‚è±Ô∏è  {step_name}: {elapsed:.2f}s")
    return elapsed

print("üîß Setup with FULL DETERMINISTIC MODE")
print(f"üìÖ Pipeline started at: {time.strftime('%Y-%m-%d %H:%M:%S')}")

# Kaggle paths 
DATA_DIR = Path("/kaggle/input/microbiome-challengezindi-kmercount") # Uploaded dataset train_kmercount.csv, test_kmercount.csv, Train.csv, Test.csv
OUTPUT_DIR = Path("/kaggle/working")
OUTPUT_DIR.mkdir(exist_ok=True)

# Check files
files_to_check = [
    DATA_DIR / "Train.csv",
    DATA_DIR / "train_kmercount.csv", 
    DATA_DIR / "test_kmercount.csv"
]

for file in files_to_check:
    if file.exists():
        size_mb = file.stat().st_size / (1024**2)
        print(f"‚úÖ {file.name}: {size_mb:.1f} MB")
    else:
        print(f"‚ùå Missing: {file}")

# TEST DETERMINISM
print("\nüß™ Testing determinism...")
np.random.seed(SEED)
test_array1 = np.random.random(5)
np.random.seed(SEED)  
test_array2 = np.random.random(5)
print(f"NumPy deterministic: {np.array_equal(test_array1, test_array2)}")

torch.manual_seed(SEED)
test_tensor1 = torch.randn(3, 3)
torch.manual_seed(SEED)
test_tensor2 = torch.randn(3, 3)
print(f"PyTorch deterministic: {torch.equal(test_tensor1, test_tensor2)}")

‚úÖ Full deterministic mode enabled
üîß Setup with FULL DETERMINISTIC MODE
üìÖ Pipeline started at: 2025-09-18 14:17:24
‚úÖ Train.csv: 0.1 MB
‚úÖ train_kmercount.csv: 220.7 MB
‚úÖ test_kmercount.csv: 80.4 MB

üß™ Testing determinism...
NumPy deterministic: True
PyTorch deterministic: True


In [3]:
# CONFIGURATION 
# =====================================================

# Random seed for reproducibility 
RANDOM_STATE = 42
SEED = 42 

# COMPLETE DETERMINISTIC SETUP - MUST BE FIRST
def setup_deterministic_environment(seed=42):
    """Complete deterministic setup for reproducible results"""
    # Python random
    random.seed(seed)
    
    # NumPy random
    np.random.seed(seed)
    
    # PyTorch random
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    
    # Environment variables
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'
    
    # PyTorch deterministic operations
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    
    # Force deterministic algorithms
    try:
        torch.use_deterministic_algorithms(True)
        print("‚úÖ Full deterministic mode enabled")
    except Exception as e:
        print(f"‚ö†Ô∏è Partial deterministic mode: {e}")

# Apply deterministic setup IMMEDIATELY
setup_deterministic_environment(SEED)

# ML settings 
MAX_FEATURES = 2000
PSEUDOCOUNT = 1e-6

# Neural network specific settings
BATCH_SIZE = 64
LR = 0.001
EPOCHS = 20
DROPOUT = 0.2

# Device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Configuration set:")
print(f"  Random state: {RANDOM_STATE}")
print(f"  Max features: {MAX_FEATURES}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Learning rate: {LR}")
print(f"  Device: {device}")
print(f"üéØ FIXED PARAMETERS FOR REPRODUCIBILITY")

‚úÖ Full deterministic mode enabled
Configuration set:
  Random state: 42
  Max features: 2000
  Batch size: 64
  Learning rate: 0.001
  Device: cuda
üéØ FIXED PARAMETERS FOR REPRODUCIBILITY


## 2. Data Preprocessing Pipeline

In [4]:
step_start = time.time()
print("üìä STEP 1: DATA PREPROCESSING PIPELINE")
print("-" * 40)

def clr_transform(data, pseudocount=PSEUDOCOUNT):
    """Apply Centered Log-Ratio transformation to compositional data"""
    # Add pseudocount to avoid log(0)
    data_pseudo = data + pseudocount
    # Calculate geometric mean for each sample
    geom_means = gmean(data_pseudo, axis=1)
    # Apply CLR transformation
    clr_data = np.log(data_pseudo / geom_means[:, np.newaxis])
    return clr_data

def deterministic_mutual_info(X, y):
    """Deterministic mutual information with fixed random state"""
    return mutual_info_classif(X, y, random_state=RANDOM_STATE)

# Load data 
kmercount_df = pd.read_csv(DATA_DIR / "train_kmercount.csv")
train_labels_df = pd.read_csv(DATA_DIR / "Train.csv")
test_kmercount_df = pd.read_csv(DATA_DIR / "test_kmercount.csv")

print(f"K-mer counts: {kmercount_df.shape}")
print(f"Train labels: {train_labels_df.shape}")
print(f"Test k-mer counts: {test_kmercount_df.shape}")

# Fix filename mismatch - SAME AS CENTRALIZED
train_labels_df['filename'] = train_labels_df['filename'].str.replace('.mgb', '')

# Merge k-mer counts with sample types - SAME AS CENTRALIZED
train_features_df = kmercount_df.merge(train_labels_df, on='filename', how='inner')
print(f"Merged data: {train_features_df.shape}")


filename_col = 'filename'
sample_type_col = 'SampleType'

print("Sample distribution:", train_features_df[sample_type_col].value_counts().to_dict())

# Feature engineering 
feature_columns = [col for col in train_features_df.columns
                  if col not in [filename_col, sample_type_col, 'SubjectID', 'SampleID']]

X_full = train_features_df[feature_columns].fillna(0)

# Apply CLR transformation 
print("üî¨ Applying CLR transformation...")
clr_start = time.time()
X_clr = clr_transform(X_full.values)
X_full = pd.DataFrame(X_clr, columns=feature_columns, index=X_full.index)
X_full = X_full.replace([np.inf, -np.inf], np.nan).fillna(0)
clr_time = time.time() - clr_start
print(f"   CLR transformation completed in {clr_time:.2f}s")

y = train_features_df[sample_type_col]

# Label encoding 
le = LabelEncoder()
y_encoded = le.fit_transform(y)
class_names = le.classes_

print(f"Features: {X_full.shape}")
print(f"Classes: {list(class_names)}")

# Feature selection
print("üìà Feature selection...")
fs_start = time.time()
non_zero_var_features = X_full.var() > 0
X_filtered = X_full.loc[:, non_zero_var_features]

n_features_to_select = min(MAX_FEATURES, X_filtered.shape[1])
selector = SelectKBest(score_func=deterministic_mutual_info, k=n_features_to_select)
X_selected = selector.fit_transform(X_filtered, y_encoded)
fs_time = time.time() - fs_start
print(f"   Feature selection completed in {fs_time:.2f}s")

print(f"Selected features: {X_selected.shape}")

# Scaling 
print("üìè Scaling...")
scaler = StandardScaler()
X_train_final = scaler.fit_transform(X_selected)

# Process test data with SAME pipeline
print("üß™ Processing test data...")
test_filename_col = 'filename'
test_feature_cols = [col for col in test_kmercount_df.columns if col != test_filename_col]

# Create test feature matrix with same columns as training
X_test_full = pd.DataFrame(0.0, index=test_kmercount_df.index, columns=feature_columns)

# Fill with common features
common_features = set(feature_columns) & set(test_feature_cols)
for feature in common_features:
    X_test_full[feature] = test_kmercount_df[feature].fillna(0)

print(f"Test features aligned: {len(common_features)} common features")

# Apply CLR transformation to test data
X_test_clr = clr_transform(X_test_full.values)
X_test_full = pd.DataFrame(X_test_clr, columns=feature_columns, index=X_test_full.index)
X_test_full = X_test_full.replace([np.inf, -np.inf], np.nan).fillna(0)

# Apply feature selection and scaling
X_test_filtered = X_test_full.loc[:, non_zero_var_features]
X_test_selected = selector.transform(X_test_filtered)
X_test_final = scaler.transform(X_test_selected)

print(f"Test data processed: {X_test_final.shape}")

# Store data dictionary 
data = {
    'X_train': X_train_final,
    'y_train': y_encoded,
    'X_test': X_test_final,
    'classes': class_names,
    'n_classes': len(class_names),
    'label_encoder': le,
    'test_ids': test_kmercount_df[test_filename_col].values,
    'n_features': X_train_final.shape[1]
}

log_runtime("data_preprocessing", step_start)
print("‚úÖ Data preprocessing complete!")
print(f"   Training data: {data['X_train'].shape}")
print(f"   Test data: {data['X_test'].shape}")
print(f"   Classes: {list(data['classes'])}")
print(f"   Features: {data['n_features']}")

üìä STEP 1: DATA PREPROCESSING PIPELINE
----------------------------------------
K-mer counts: (2901, 32897)
Train labels: (2901, 4)
Test k-mer counts: (1068, 32897)
Merged data: (2901, 32900)
Sample distribution: {'Stool': 811, 'Skin': 787, 'Nasal': 710, 'Mouth': 593}
üî¨ Applying CLR transformation...
   CLR transformation completed in 2.82s
Features: (2901, 32896)
Classes: ['Mouth', 'Nasal', 'Skin', 'Stool']
üìà Feature selection...
   Feature selection completed in 377.07s
Selected features: (2901, 2000)
üìè Scaling...
üß™ Processing test data...
Test features aligned: 32896 common features
Test data processed: (1068, 2000)
‚è±Ô∏è  data_preprocessing: 459.50s
‚úÖ Data preprocessing complete!
   Training data: (2901, 2000)
   Test data: (1068, 2000)
   Classes: ['Mouth', 'Nasal', 'Skin', 'Stool']
   Features: 2000


## 3. Federated Learning Setup

In [5]:
class SimpleDataset(Dataset):
    def __init__(self, X, y=None):
        self.X = torch.FloatTensor(X)
        self.y = torch.LongTensor(y) if y is not None else None
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        if self.y is not None:
            return self.X[idx], self.y[idx]
        return self.X[idx]

def create_deterministic_dataloader(dataset, batch_size, shuffle=True):
    """Create DataLoader with deterministic behavior"""
    generator = torch.Generator()
    generator.manual_seed(SEED)
    
    return DataLoader(
        dataset, 
        batch_size=batch_size, 
        shuffle=shuffle,
        generator=generator,
        worker_init_fn=lambda worker_id: torch.manual_seed(SEED + worker_id),
        drop_last=False
    )

class SimpleMLP(nn.Module):
    """Simple MLP with deterministic initialization"""
    def __init__(self, input_dim, num_classes):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.ReLU(),
            nn.BatchNorm1d(512),
            nn.Dropout(DROPOUT),
            
            nn.Linear(512, 256),
            nn.ReLU(), 
            nn.BatchNorm1d(256),
            nn.Dropout(DROPOUT),
            
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.BatchNorm1d(128),
            nn.Dropout(DROPOUT/2),
            
            nn.Linear(128, num_classes)
        )
        
        # Deterministic weight initialization
        self.reset_parameters()
    
    def reset_parameters(self):
        """Reset parameters with deterministic initialization"""
        torch.manual_seed(SEED)  # Ensure deterministic init
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm1d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
    
    def forward(self, x):
        return self.network(x)

## 4. Federated Training

In [6]:
def create_deterministic_federated_splits():
    """Create deterministic federated splits"""
    step_start = time.time()
    print("\nüåê Creating DETERMINISTIC federated splits...")
    
    # Reset random state for consistent splits
    setup_deterministic_environment(SEED)
    
    X_train = data['X_train']
    y_train = data['y_train']
    classes = data['classes']
    
    client_data = {}
    
    # DETERMINISTIC strategy: each client gets all of one class + deterministic samples from others
    for class_idx, class_name in enumerate(classes):
        print(f"   Client {class_name}:")
        
        # Reset random state for each client
        np.random.seed(SEED + class_idx)
        
        # Get all samples of this class
        class_mask = y_train == class_idx
        class_X = X_train[class_mask]
        class_y = y_train[class_mask]
        
        # Add some samples from other classes (deterministic selection)
        n_other_per_class = min(50, len(class_X) // 5)
        other_X_list = [class_X]
        other_y_list = [class_y]
        
        for other_idx in range(len(classes)):
            if other_idx != class_idx:
                other_mask = y_train == other_idx
                other_X = X_train[other_mask]
                other_y = y_train[other_mask]
                
                if len(other_X) >= n_other_per_class:
                    # Deterministic selection instead of random
                    indices = np.arange(len(other_X))
                    np.random.shuffle(indices)  # Uses seeded random state
                    selected_indices = indices[:n_other_per_class]
                    other_X_list.append(other_X[selected_indices])
                    other_y_list.append(other_y[selected_indices])
        
        # Combine
        client_X = np.vstack(other_X_list)
        client_y = np.concatenate(other_y_list)
        
        # Deterministic shuffle
        indices = np.arange(len(client_X))
        np.random.shuffle(indices)  # Uses seeded random state
        client_X = client_X[indices]
        client_y = client_y[indices]
        
        # Deterministic train/val split
        val_size = max(20, int(0.15 * len(client_X)))
        
        client_data[class_name] = {
            'X_train': client_X[:-val_size],
            'y_train': client_y[:-val_size],
            'X_val': client_X[-val_size:],
            'y_val': client_y[-val_size:]
        }
        
        train_dist = np.bincount(client_y[:-val_size], minlength=len(classes))
        print(f"      Train: {len(client_X)-val_size}, Val: {val_size}")
        print(f"      Distribution: {dict(zip(classes, train_dist))}")
    
    log_runtime("federated_splits", step_start)
    return client_data

class DeterministicFederatedClient:
    def __init__(self, name, client_data, n_features, n_classes):
        self.name = name
        self.data = client_data
        
        # Model with deterministic initialization
        self.model = SimpleMLP(n_features, n_classes).to(device)
        self.criterion = nn.CrossEntropyLoss()
        
        # Deterministic data loaders
        self.train_loader = create_deterministic_dataloader(
            SimpleDataset(client_data['X_train'], client_data['y_train']),
            batch_size=BATCH_SIZE, shuffle=True
        )
        self.val_loader = create_deterministic_dataloader(
            SimpleDataset(client_data['X_val'], client_data['y_val']),
            batch_size=BATCH_SIZE, shuffle=False
        )
    
    def train_local(self, global_weights=None, epochs=5):
        if global_weights:
            self.model.load_state_dict(global_weights)
        
        # Deterministic optimizer
        torch.manual_seed(SEED)
        optimizer = torch.optim.Adam(self.model.parameters(), lr=LR)
        self.model.train()
        
        total_loss = 0
        total_samples = 0
        
        for epoch in range(epochs):
            for batch_x, batch_y in self.train_loader:
                batch_x, batch_y = batch_x.to(device), batch_y.to(device)
                
                optimizer.zero_grad()
                outputs = self.model(batch_x)
                loss = self.criterion(outputs, batch_y)
                loss.backward()
                optimizer.step()
                
                total_loss += loss.item() * len(batch_x)
                total_samples += len(batch_x)
        
        avg_loss = total_loss / total_samples
        return self.model.state_dict(), len(self.data['X_train']), avg_loss
    
    def evaluate(self):
        self.model.eval()
        total_loss = 0
        correct = 0
        total = 0
        
        with torch.no_grad():
            for batch_x, batch_y in self.val_loader:
                batch_x, batch_y = batch_x.to(device), batch_y.to(device)
                outputs = self.model(batch_x)
                loss = self.criterion(outputs, batch_y)
                
                total_loss += loss.item() * len(batch_x)
                correct += (outputs.argmax(1) == batch_y).sum().item()
                total += len(batch_x)
        
        return total_loss / total, correct / total

## 5. Federated Training Execution

In [7]:
step_start = time.time()
print("üöÄ STEP 2: FEDERATED LEARNING TRAINING")
print("-" * 40)

# Create deterministic federated splits
client_data_dict = create_deterministic_federated_splits()

# Initialize clients with deterministic setup
clients = {}
for name, client_data in client_data_dict.items():
    clients[name] = DeterministicFederatedClient(name, client_data, data['n_features'], data['n_classes'])

print(f"\nüåê Federated clients ready: {sorted(list(clients.keys()))}")

def federated_averaging(client_weights, client_sizes):
    """Deterministic FedAvg"""
    total_size = sum(client_sizes)
    averaged = {}
    
    for key in client_weights[0].keys():
        averaged[key] = sum(
            w[key] * size / total_size 
            for w, size in zip(client_weights, client_sizes)
        )
    
    return averaged

# Enhanced federated parameters for better performance
FEDERATED_CONFIG = {
    'n_rounds': 10,              # Increased rounds
    'client_epochs': 5,          # Epochs per client per round
    'early_stopping': True,      # Stop if converged
    'patience': 5,               # Rounds to wait for improvement
    'learning_rate_decay': 0.95, # Decay LR each round
    'verbose': True              # Show detailed progress
}

def run_deterministic_federated_training(clients, config=FEDERATED_CONFIG):
    """DETERMINISTIC federated training with consistent client ordering"""
    step_start = time.time()
    print(f"üöÄ DETERMINISTIC Federated Training ({config['n_rounds']} max rounds)...")

    # Initialize global model with deterministic weights
    global_model = SimpleMLP(data['n_features'], data['n_classes']).to(device)
    global_weights = global_model.state_dict()

    history = []
    best_loss = float('inf')
    patience_counter = 0
    current_lr = LR

    # CRITICAL: Always process clients in same sorted order
    client_names = sorted(clients.keys())
    print(f"   üìã Client order: {client_names}")

    for round_num in range(1, config['n_rounds'] + 1):
        print(f"\nüìç Round {round_num}/{config['n_rounds']} (LR: {current_lr:.6f})")

        # Reset random state for each round
        setup_deterministic_environment(SEED + round_num)

        # Client training with DETERMINISTIC ORDER
        client_weights = []
        client_sizes = []
        round_losses = []

        for name in client_names:  # Use sorted order, not dict iteration
            client = clients[name]
            print(f"   üèãÔ∏è {name}...", end=" ")

            weights, size, loss = client.train_local(
                global_weights,
                epochs=config['client_epochs']
            )
            client_weights.append(weights)
            client_sizes.append(size)
            round_losses.append(loss)
            print(f"loss={loss:.4f}")

        # Aggregate
        print("   ‚öñÔ∏è Aggregating...")
        global_weights = federated_averaging(client_weights, client_sizes)

        # Update all clients
        for name in client_names:  # Same order
            clients[name].model.load_state_dict(global_weights)

        # Global evaluation
        total_loss = 0
        total_acc = 0
        total_samples = 0

        for name in client_names:  # Same order
            client = clients[name]
            loss, acc = client.evaluate()
            samples = len(client.data['X_val'])

            total_loss += loss * samples
            total_acc += acc * samples
            total_samples += samples

        global_loss = total_loss / total_samples
        global_acc = total_acc / total_samples

        # Store history
        round_info = {
            'round': round_num,
            'loss': global_loss,
            'acc': global_acc,
            'client_losses': round_losses,
            'lr': current_lr
        }
        history.append(round_info)

        # Progress reporting
        avg_client_loss = sum(round_losses) / len(round_losses)
        print(f"   üåê Global: loss={global_loss:.4f}, acc={global_acc:.4f} ({global_acc*100:.1f}%)")
        print(f"   üìä Clients avg: {avg_client_loss:.4f}")

        # Check for improvement
        if global_loss < best_loss:
            best_loss = global_loss
            best_weights = global_weights.copy()
            patience_counter = 0
            print(f"   üåü New best loss: {best_loss:.4f}")
        else:
            patience_counter += 1

            # Patience exceeded
            if patience_counter >= config['patience']:
                print(f"   ‚è∞ Early stopping: no improvement for {config['patience']} rounds")
                break

        # Learning rate decay
        if round_num % 5 == 0:  # Every 5 rounds
            current_lr *= config['learning_rate_decay']
            print(f"   üìâ Learning rate decayed to: {current_lr:.6f}")

    # Load best weights
    if 'best_weights' in locals():
        global_model.load_state_dict(best_weights)
        print(f"   üìà Loaded best weights (loss: {best_loss:.4f})")

    log_runtime("federated_training", step_start)
    
    print(f"\n‚úÖ DETERMINISTIC federated training complete!")
    print(f"   üèÜ Best loss: {best_loss:.4f}")
    print(f"   üìä Rounds completed: {len(history)}")
    print(f"   üéØ Final accuracy: {history[-1]['acc']:.4f} ({history[-1]['acc']*100:.1f}%)")

    return global_model, history

# Run deterministic federated training
federated_model, fed_history = run_deterministic_federated_training(clients)

# Results - SAME VARIABLE NAMES AS CENTRALIZED
fed_final_loss = fed_history[-1]['loss']
fed_final_acc = fed_history[-1]['acc']

print(f"\nüìä DETERMINISTIC Federated Results:")
print(f"   Final loss: {fed_final_loss:.4f}")
print(f"   Final accuracy: {fed_final_acc:.4f}")

model_to_use = federated_model
print(f"\nüéØ Using DETERMINISTIC federated model for predictions")

log_runtime("federated_training_execution", step_start)

üöÄ STEP 2: FEDERATED LEARNING TRAINING
----------------------------------------

üåê Creating DETERMINISTIC federated splits...
‚úÖ Full deterministic mode enabled
   Client Mouth:
      Train: 632, Val: 111
      Distribution: {'Mouth': 508, 'Nasal': 41, 'Skin': 39, 'Stool': 44}
   Client Nasal:
      Train: 731, Val: 129
      Distribution: {'Mouth': 43, 'Nasal': 601, 'Skin': 40, 'Stool': 47}
   Client Skin:
      Train: 797, Val: 140
      Distribution: {'Mouth': 47, 'Nasal': 44, 'Skin': 662, 'Stool': 44}
   Client Stool:
      Train: 817, Val: 144
      Distribution: {'Mouth': 44, 'Nasal': 38, 'Skin': 44, 'Stool': 691}
‚è±Ô∏è  federated_splits: 0.14s

üåê Federated clients ready: ['Mouth', 'Nasal', 'Skin', 'Stool']
üöÄ DETERMINISTIC Federated Training (10 max rounds)...
   üìã Client order: ['Mouth', 'Nasal', 'Skin', 'Stool']

üìç Round 1/10 (LR: 0.001000)
‚úÖ Full deterministic mode enabled
   üèãÔ∏è Mouth... loss=0.6467
   üèãÔ∏è Nasal... loss=0.5684
   üèãÔ∏è Skin... l

15.965360879898071

## 6. Test Data Processing & Predictions

In [8]:
step_start = time.time()
print("üéØ STEP 3: TEST DATA PROCESSING & PREDICTIONS")
print("-" * 40)

def generate_deterministic_predictions(model):
    """Generate deterministic test predictions"""
    print("Generating DETERMINISTIC test predictions...")
    
    # Reset random state for deterministic inference
    setup_deterministic_environment(SEED)
    
    # Create test dataset
    test_dataset = SimpleDataset(data['X_test'])
    test_loader = create_deterministic_dataloader(test_dataset, BATCH_SIZE, shuffle=False)
    
    # Generate predictions
    model.eval()
    all_predictions = []
    
    with torch.no_grad():
        for batch_x in test_loader:
            batch_x = batch_x.to(device)
            outputs = model(batch_x)
            probs = torch.softmax(outputs, dim=1).cpu().numpy()
            all_predictions.append(probs)
    
    test_predictions = np.vstack(all_predictions)
    
    # Verify prediction quality
    max_probs = test_predictions.max(axis=1)
    min_probs = test_predictions.min(axis=1)
    
    print(f"üìä Prediction Quality Analysis:")
    print(f"   Max probability - mean: {max_probs.mean():.3f}, min: {max_probs.min():.3f}")
    print(f"   Min probability - mean: {min_probs.mean():.3f}, max: {min_probs.max():.3f}")
    
    # Check if predictions are confident (max > 0.7, others < 0.15)
    confident_samples = np.sum(max_probs > 0.7)
    uniform_samples = np.sum((max_probs < 0.4) & (max_probs > 0.2))  # Uniform-like
    
    print(f"   Confident predictions (max > 0.7): {confident_samples}/{len(max_probs)} ({confident_samples/len(max_probs)*100:.1f}%)")
    print(f"   Uniform-like predictions (0.2-0.4): {uniform_samples}/{len(max_probs)} ({uniform_samples/len(max_probs)*100:.1f}%)")
    
    # Create submission - SAME FORMAT AS CENTRALIZED
    submission_df = pd.DataFrame({
        'filename': data['test_ids']
    })
    
    # Add probability columns for each class - SAME AS CENTRALIZED
    for i, class_name in enumerate(data['classes']):
        submission_df[class_name] = test_predictions[:, i]
    
    # Save submission file - SAME NAMING AS CENTRALIZED
    output_file = f"submission_federated_logloss{fed_final_loss:.4f}.csv"
    submission_df.to_csv(output_file, index=False)
    
    print(f"‚úÖ Submission saved: {output_file}")
    print("First 5 predictions:")
    print(submission_df.head())
    
    print(f"\nüìä Prediction Statistics:")
    print(f"   Total test samples: {len(test_predictions)}")
    print(f"   Prediction shape: {test_predictions.shape}")
    print(f"   Class names: {list(data['classes'])}")
    
    # Warning if predictions look uniform
    if uniform_samples > len(max_probs) * 0.5:
        print(f"\n   ‚ö†Ô∏è WARNING: Many predictions look uniform! Model may not be learning properly.")
        print(f"      Consider: more training, different architecture, or data issues")
    elif confident_samples > len(max_probs) * 0.7:
        print(f"\n   üéâ EXCELLENT: Most predictions are confident!")
    else:
        print(f"\n   ‚úÖ GOOD: Reasonable prediction confidence")
    
    return submission_df, output_file

# Generate final deterministic predictions
final_submission, output_filename = generate_deterministic_predictions(model_to_use)

log_runtime("prediction_generation", step_start)
print("\n‚úÖ Test data processing & predictions complete!")

üéØ STEP 3: TEST DATA PROCESSING & PREDICTIONS
----------------------------------------
Generating DETERMINISTIC test predictions...
‚úÖ Full deterministic mode enabled
üìä Prediction Quality Analysis:
   Max probability - mean: 0.999, min: 0.534
   Min probability - mean: 0.000, max: 0.000
   Confident predictions (max > 0.7): 1067/1068 (99.9%)
   Uniform-like predictions (0.2-0.4): 0/1068 (0.0%)
‚úÖ Submission saved: submission_federated_logloss0.0296.csv
First 5 predictions:
    filename         Mouth         Nasal          Skin         Stool
0  ID_ABHFUP  2.238452e-18  1.000000e+00  6.444092e-21  6.888429e-09
1  ID_ADBLNY  7.651327e-20  1.000000e+00  8.019523e-22  1.023180e-10
2  ID_AFAEMB  9.999987e-01  1.075555e-14  1.339720e-06  1.376729e-19
3  ID_AFBBWK  1.000000e+00  3.751782e-16  1.540580e-09  4.850357e-22
4  ID_AGHEZK  1.000000e+00  2.585032e-14  3.140381e-08  1.276789e-19

üìä Prediction Statistics:
   Total test samples: 1068
   Prediction shape: (1068, 4)
   Class name

## 7. Final Results and Runtime Summary

In [9]:
total_time = time.time() - start_time_total
runtime_log['total_pipeline'] = total_time

print("\n" + "=" * 60)
print("üéâ FEDERATED LEARNING PIPELINE COMPLETED SUCCESSFULLY!")
print("=" * 60)

print(f"üìä Final Results:")
print(f"   Federated Loss: {fed_final_loss:.4f}")
print(f"   Federated Accuracy: {fed_final_acc:.4f} ({fed_final_acc*100:.1f}%)")

print(f"\n‚è±Ô∏è  Detailed Runtime Summary:")
print(f"   {'Step':<30} {'Time (s)':<12} {'Time (min)':<12}")
print("-" * 54)

for step, runtime in runtime_log.items():
    print(f"   {step:<30} {runtime:<12.1f} {runtime/60:<12.1f}")

print(f"\nüìÅ Output files:")
print(f"   Federated submission: {output_filename}")

print(f"\nüéØ Key Deterministic Features Applied:")
print(f"   ‚úÖ Complete random state control (Python/NumPy/PyTorch)")
print(f"   ‚úÖ Deterministic DataLoader with generators")
print(f"   ‚úÖ Fixed model weight initialization")
print(f"   ‚úÖ Consistent client ordering in federated rounds")
print(f"   ‚úÖ Deterministic mutual information feature selection")
print(f"   ‚úÖ Seeded data splits and shuffling")

print(f"\nüåê Federated Learning Features:")
print(f"   ‚úÖ 4 federated clients (one per body site)")
print(f"   ‚úÖ FedAvg aggregation algorithm")
print(f"   ‚úÖ Deterministic client training order")
print(f"   ‚úÖ Neural network models per client")

print(f"\nüïí Total pipeline runtime: {total_time:.1f}s ({total_time/60:.1f} minutes)")
print(f"üìÖ Completed at: {time.strftime('%Y-%m-%d %H:%M:%S')}")



üéâ FEDERATED LEARNING PIPELINE COMPLETED SUCCESSFULLY!
üìä Final Results:
   Federated Loss: 0.0296
   Federated Accuracy: 0.9962 (99.6%)

‚è±Ô∏è  Detailed Runtime Summary:
   Step                           Time (s)     Time (min)  
------------------------------------------------------
   data_preprocessing             459.5        7.7         
   federated_splits               0.1          0.0         
   federated_training             15.4         0.3         
   federated_training_execution   16.0         0.3         
   prediction_generation          0.1          0.0         
   total_pipeline                 852.7        14.2        

üìÅ Output files:
   Federated submission: submission_federated_logloss0.0296.csv

üéØ Key Deterministic Features Applied:
   ‚úÖ Complete random state control (Python/NumPy/PyTorch)
   ‚úÖ Deterministic DataLoader with generators
   ‚úÖ Fixed model weight initialization
   ‚úÖ Consistent client ordering in federated rounds
   ‚úÖ Deterministi