# Soft-Q-Routing vs Q-Routing Training Comparison in IAB Networks

This notebook demonstrates and compares the training behavior of two reinforcement learning routing algorithms in Integrated Access Backhaul (IAB) networks:

##  Experiment Objectives
- **Compare Training Convergence**: Analyze how quickly each algorithm learns optimal routing policies
- **Training Stability**: Observe the stability of learning curves during training
- **Algorithm Behavior**: Understand the differences in how each algorithm explores and exploits the network

##  Algorithm Overview

### Q-Routing (Traditional)
- **Deterministic Policy**: Uses max Q-value for action selection (greedy with ε-exploration)
- **Hard Decisions**: Makes crisp routing decisions based on highest Q-values
- **Exploration**: Relies on ε-greedy exploration strategy
- **Update Rule**: Standard Q-learning update with temporal difference

### Soft-Q-Routing (Energy-Based)
- **Probabilistic Policy**: Uses Boltzmann/softmax distribution for action selection
- **Soft Decisions**: Considers all possible actions weighted by their Q-values
- **Temperature Parameter**: Controls exploration vs exploitation trade-off
- **Robust Learning**: More resilient to local optima and network changes

##  What We'll Analyze
1. **Training Progress**: Reward accumulation over episodes
2. **Learning Curves**: How quickly each algorithm improves
3. **Convergence Patterns**: Stability and final performance
4. **Action Selection**: Differences in routing decision patterns

##  Experiment Setup
- **Training Only**: Focus on learning behavior without evaluation overhead
- **Same Environment**: Both algorithms train on identical network conditions
- **Controlled Comparison**: Same hyperparameters where applicable
- **Visualization**: Real-time training progress and final comparison

## 1. Install and Import Required Libraries

We'll start by installing all necessary packages for our experiment. This includes:
- **Core ML Libraries**: NumPy, pandas for data handling
- **Visualization**: Matplotlib, seaborn for plotting training progress
- **Deep Learning**: PyTorch for neural network components (if used)
- **Project Modules**: Custom IAB network simulation environment and agents

In [None]:
# Install required packages for the IAB routing experiment
import subprocess
import sys


# Define required packages for our routing experiment
required_packages = [
    'bunch==1.0.1',         # Data structure utilities
    'gym==0.26.2',          # Reinforcement learning environment framework
    'matplotlib==3.5.1',    # Plotting and visualization
    'networkx==2.7.1',      # Network topology analysis
    'numpy==1.21.5',        # Numerical computing
    'pandas==1.4.2',        # Data analysis and manipulation
    'scipy==1.7.3',         # Scientific computing
    'seaborn==0.11.2',      # Statistical data visualization
    'torch==1.13.0',        # Deep learning framework
    'tqdm==4.64.0',         # Progress bars for training loops
    'jupyter',              # Jupyter notebook support
    'ipykernel',            # IPython kernel for notebooks
    'ipywidgets'            # Interactive widgets for notebooks
]



In [None]:
# Import all required libraries for our routing experiment
import os
import sys
import logging
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import json
from datetime import datetime
import time

# Configure matplotlib for beautiful inline plotting
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11

print("Importing project-specific modules...")


try:
    # Import custom IAB network simulation modules
    from Utils import Statistics                    # Performance metrics collection
    from Utils.ml_flow_utils import preprocess_meta_data  # Configuration loading
    from Environment import env                     # Network environment simulation
    from Agents import Agent                        # Q-learning and Soft-Q agents
    from trainer import trainer, evaluator         # Training and evaluation frameworks
    
    print("All project modules imported successfully!")
    
except ImportError as e:
    print(f"Error importing project modules: {e}")
    print("Make sure you're running this notebook from the project root directory.")
    
# Setup experiment logging
logging.basicConfig(
    level=logging.INFO, 
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger("soft_q_vs_q_experiment")

print("\nSystem Information:")
print(f"Python version: {sys.version.split()[0]}")
print(f"NumPy version: {np.__version__}")
print(f"Matplotlib version: {plt.matplotlib.__version__}")
print(f"Pandas version: {pd.__version__}")

print("\n Ready to start the Soft-Q vs Q-Routing experiment!")

## 2. Load and Configure Experiment Settings

Here we'll load the network configuration and set up our experiment parameters. We'll focus on training-only parameters since we're primarily interested in comparing the learning behavior of the two algorithms.

### Configuration Components:
- **Network Topology**: Number of base stations, users, and network capacity
- **Training Parameters**: Episodes, time steps, learning rates
- **Algorithm Settings**: Exploration parameters, update frequencies
- **Logging**: Where to save training progress and results

In [None]:
# Load and configure experiment settings
print(" Loading experiment configuration...")

try:
    # Try to load existing configuration
    setting, args, temp_device, experiment = preprocess_meta_data()
    print(" Configuration loaded from existing settings!")
    
except Exception as e:
    print(f" Could not load existing configuration: {e}")
    print(" Creating default configuration for training experiment...")
    
    # Create a comprehensive configuration for our training experiment
    setting = {
        "NETWORK": {
            "holding capacity": 10,      # Buffer size at each node
            "number Basestation": 6,     # Number of base stations in IAB network
            "number user": 12            # Number of user equipment (UE) nodes
        },
        "Simulation": {
            # Training-focused parameters (no test settings needed)
            "training_episodes": 50,     # Number of training episodes
            "max_allowed_time_step_per_episode": 2000,  # Steps per episode
            "num_time_step_to_update_target_network": 200,  # Target network update frequency
        },
        "AGENT": {
            "enable_train": True,        # Enable training mode
            "checkpoint_frequency": 10,  # Save model every N episodes
            "learning_freq": 1,          # Learning frequency (every step)
            "rewardfunction": "latency_throughput",  # Reward function type
            "learning_rate": 0.001,      # Learning rate for both algorithms
            "epsilon": 0.1,              # Exploration rate for Q-learning
            "epsilon_decay": 0.995,      # Epsilon decay rate
            "temperature": 1.0,          # Temperature for Soft-Q (controls exploration)
            "temperature_decay": 0.99    # Temperature decay rate
        },
        "seed": 42,                      # Random seed for reproducibility
        "result_dir": "./training_results"  # Directory to save training results
    }
    experiment = None

# Configure derived network parameters
setting["capacity"] = setting["NETWORK"]["holding capacity"]
setting["num_nodes"] = setting["NETWORK"]["number Basestation"] + setting["NETWORK"]["number user"]
setting["num_bs"] = setting["NETWORK"]["number Basestation"]

# Create results directory
os.makedirs(setting["result_dir"], exist_ok=True)

print("\n" + "="*60)
print(" EXPERIMENT CONFIGURATION")
print("="*60)
print(f" Network Configuration:")
print(f"   • Base Stations: {setting['num_bs']}")
print(f"   • User Equipment: {setting['NETWORK']['number user']}")
print(f"   • Total Nodes: {setting['num_nodes']}")
print(f"   • Buffer Capacity: {setting['capacity']} packets per node")

print(f"\n Training Configuration:")
print(f"   • Training Episodes: {setting['Simulation']['training_episodes']}")
print(f"   • Time Steps per Episode: {setting['Simulation']['max_allowed_time_step_per_episode']}")
print(f"   • Target Network Updates: Every {setting['Simulation']['num_time_step_to_update_target_network']} steps")
print(f"   • Checkpoint Frequency: Every {setting['AGENT']['checkpoint_frequency']} episodes")

print(f"\n Algorithm Parameters:")
print(f"   • Learning Rate: {setting['AGENT']['learning_rate']}")
print(f"   • Q-Learning Epsilon: {setting['AGENT']['epsilon']} (decay: {setting['AGENT']['epsilon_decay']})")
print(f"   • Soft-Q Temperature: {setting['AGENT']['temperature']} (decay: {setting['AGENT']['temperature_decay']})")
print(f"   • Reward Function: {setting['AGENT']['rewardfunction']}")

print(f"\n Output:")
print(f"   • Results Directory: {setting['result_dir']}")
print(f"   • Random Seed: {setting['seed']}")
print("="*60)

print("\n Configuration ready for training experiment!")

## 3. Initialize Environment and Agent Classes

Now we'll set up the different components needed for our comparative training experiment:

### Environment Setup
- **Same Environment**: Both algorithms will train on identical network topologies
- **Dynamic Network**: The IAB network changes over time to test adaptability
- **Packet Routing**: Agents learn to route packets efficiently through the network

### Agent Differences
- **Q-Agent**: Traditional Q-learning with ε-greedy exploration
- **SoftQ-Agent**: Soft Q-learning with Boltzmann exploration policy

### Why This Comparison Matters
- **Exploration Strategy**: Different approaches to exploring unknown network states
- **Action Selection**: Hard vs soft decision making in routing
- **Convergence**: How stable and fast each algorithm learns optimal policies

In [None]:
# Define environment and agent lookup tables for our experiment
print(" Setting up environment and agent configurations...")

# Environment Classes - Both algorithms use the same Q-learning environment
# This ensures fair comparison since the underlying network dynamics are identical
envsLut = {
    'Q-Routing': env.dynetworkEnvQlearning,      # Standard Q-learning environment
    'Soft-Q-Routing': env.dynetworkEnvQlearning,  # Same environment for Soft-Q
}

# Agent Classes - This is where the key difference lies
agentsLut = {
    'Q-Routing': Agent.QAgent,        # Traditional Q-learning agent (ε-greedy)
    'Soft-Q-Routing': Agent.SoftQAgent,  # Soft Q-learning agent (Boltzmann policy)
}

# Trainer Classes - Both use the same trainer framework
trainerLut = {
    'Q-Routing': trainer.RLTabularTrainer,      # Standard RL trainer
    'Soft-Q-Routing': trainer.RLTabularTrainer,  # Same trainer for fair comparison
}

# Since we're focusing on training only, we'll simplify the evaluator setup
# We'll use basic statistics collection instead of full evaluation
evaluatorLut = {
    'Q-Routing': None,        # No evaluation, training metrics only
    'Soft-Q-Routing': None,   # No evaluation, training metrics only
}

# Define the algorithms we'll compare
agent_names = ['Q-Routing', 'Soft-Q-Routing']

print("\n" + "="*50)
print(" EXPERIMENT SETUP SUMMARY")
print("="*50)

for i, name in enumerate(agent_names, 1):
    env_class = envsLut[name]
    agent_class = agentsLut[name]
    trainer_class = trainerLut[name]
    
    print(f"\n{i}. {name}:")
    print(f"    Environment: {env_class.__name__}")
    print(f"    Agent Type: {agent_class.__name__}")
    print(f"    Trainer: {trainer_class.__name__}")
    
    # Explain the key differences
    if name == 'Q-Routing':
        print(f"    Policy: ε-greedy (deterministic with random exploration)")
        print(f"    Action Selection: Choose action with highest Q-value")
    else:
        print(f"    Policy: Boltzmann/Softmax (probabilistic)")
        print(f"    Action Selection: Sample from Q-value probability distribution")

print(f"\n Comparison Focus: Training behavior of {' vs '.join(agent_names)}")
print("="*50)

print("\n Environment and agent classes configured successfully!")

## 4. Create Training Experiment Class

We'll create a simplified experiment class focused entirely on training comparison. This class will:

- **Initialize Agents**: Set up both Q-learning and Soft-Q agents with identical network conditions
- **Training Infrastructure**: Prepare trainers and statistics collectors for both algorithms
- **Progress Tracking**: Monitor training progress in real-time
- **Fair Comparison**: Ensure both algorithms have identical starting conditions and training parameters

In [None]:
# Define a simplified training-focused experiment class
class TrainingExperiment:
    """
    Simplified experiment class focused on comparing training behavior
    of Q-Routing vs Soft-Q-Routing algorithms
    """
    
    def __init__(self, setting, experiment, agent_names):
        """
        Initialize the training experiment
        
        Args:
            setting: Configuration dictionary
            experiment: Experiment metadata (can be None)
            agent_names: List of algorithm names to compare
        """
        print(" Initializing Training Experiment...")
        
        self.setting = setting
        self.experiment = experiment
        self.agent_names = agent_names
        
        # Storage for experiment components
        self.agents = {}      # Agent instances
        self.envs = {}        # Environment instances  
        self.trainers = {}    # Trainer instances
        self.stats = {}       # Statistics collectors
        self.paths = {}       # Result directories
        self.training_metrics = {}  # Training progress tracking
        
        self._initialize_components()
        
    def _initialize_components(self):
        """Initialize all components for each algorithm"""
        
        for name in self.agent_names:
            print(f"\n Setting up {name}...")
            
            # Create result directories
            self.paths[name] = os.path.join(self.setting["result_dir"], name)
            os.makedirs(self.paths[name], exist_ok=True)
            
            # Get component classes from lookup tables
            EnvClass = envsLut[name]
            AgentClass = agentsLut[name]
            TrainerClass = trainerLut[name]
            
            print(f"    Creating environment: {EnvClass.__name__}")
            # Initialize environment with consistent settings
            env_instance = EnvClass(
                setting=self.setting, 
                seed=self.setting["seed"], 
                algorithm=name, 
                rewardfun=self.setting["AGENT"]["rewardfunction"]
            )
            
            print(f"    Creating agent: {AgentClass.__name__}")
            # Get state space dimensions and create agent
            state_space = env_instance.get_state_space_dim(self.setting)
            agent_instance = AgentClass(env_instance.dynetwork, self.setting, state_space, None)
            
            # Store instances
            self.envs[name] = env_instance
            self.agents[name] = agent_instance
            
            print(f"    Setting up statistics collection...")
            # Initialize training statistics collector
            self.stats[name] = Statistics.TrainQLStatisticsCollector(
                setting=self.setting, 
                result_dir=self.paths[name], 
                algorithms=[name]
            )
            
            print(f"    Creating trainer: {TrainerClass.__name__}")
            # Initialize trainer for this algorithm
            self.trainers[name] = TrainerClass(
                time_steps=self.setting["Simulation"]["max_allowed_time_step_per_episode"],
                TARGET_UPDATE=self.setting["Simulation"]["num_time_step_to_update_target_network"],
                agent=agent_instance,
                stat_collector=self.stats[name],
                env=env_instance,
                name=name,
                writer=self.setting.get("train_writer", None),
                experiment=self.experiment,
                update_freq=self.setting["AGENT"]["learning_freq"]
            )
            
            # Initialize training metrics tracking
            self.training_metrics[name] = {
                'episode_rewards': [],
                'episode_lengths': [],
                'losses': [],
                'exploration_rates': [],
                'timestamps': []
            }
            
            print(f"    {name} setup complete!")
        
        print(f"\n Training experiment ready with {len(self.agent_names)} algorithms!")
        print(f" Results will be saved to: {self.setting['result_dir']}")

# Create the training experiment instance
print("Creating training experiment instance...")
experiment_instance = TrainingExperiment(
    setting=setting, 
    experiment=experiment, 
    agent_names=agent_names
)

print("\n" + "="*60)
print(" EXPERIMENT INITIALIZATION COMPLETE!")
print("="*60)
print(f" Ready to compare: {' vs '.join(agent_names)}")
print(f" Training episodes: {setting['Simulation']['training_episodes']}")
print(f" Steps per episode: {setting['Simulation']['max_allowed_time_step_per_episode']}")
print("="*60)

## 5. Verify Training Setup

Let's verify that our training components are properly configured and ready to run. We'll check:

- **Agent Configuration**: Confirm both agents are properly initialized
- **Training Parameters**: Verify learning rates, exploration settings
- **Environment State**: Ensure both algorithms start with identical conditions
- **Statistics Collection**: Confirm we can track training progress

In [None]:
# Verify that all training components are properly set up
print(" Verifying training setup...")

print("\n" + "="*50)
print(" TRAINING COMPONENT VERIFICATION")
print("="*50)

all_ready = True

for i, name in enumerate(agent_names, 1):
    print(f"\n{i}. {name} Configuration:")
    
    # Check agent
    agent = experiment_instance.agents[name]
    print(f"    Agent: {type(agent).__name__} ")
    
    # Check environment
    env = experiment_instance.envs[name]
    print(f"    Environment: {type(env).__name__} ")
    
    # Check trainer
    trainer = experiment_instance.trainers[name]
    if trainer is not None:
        print(f"    Trainer: {type(trainer).__name__} ")
    else:
        print(f"    Trainer: Missing ")
        all_ready = False
    
    # Check statistics collector
    stats = experiment_instance.stats[name]
    print(f"    Statistics: {type(stats).__name__} ")
    
    # Check result directory
    path = experiment_instance.paths[name]
    if os.path.exists(path):
        print(f"    Results Path: {path} ")
    else:
        print(f"    Results Path: {path} ")
        all_ready = False
    
    # Display algorithm-specific parameters
    if name == 'Q-Routing':
        print(f"    Epsilon: {setting['AGENT']['epsilon']} (decay: {setting['AGENT']['epsilon_decay']})")
    else:  # Soft-Q-Routing
        print(f"    Temperature: {setting['AGENT']['temperature']} (decay: {setting['AGENT']['temperature_decay']})")

print(f"\n Shared Training Parameters:")
print(f"   • Learning Rate: {setting['AGENT']['learning_rate']}")
print(f"   • Training Episodes: {setting['Simulation']['training_episodes']}")
print(f"   • Steps per Episode: {setting['Simulation']['max_allowed_time_step_per_episode']}")
print(f"   • Checkpoint Frequency: {setting['AGENT']['checkpoint_frequency']} episodes")
print(f"   • Target Network Update: Every {setting['Simulation']['num_time_step_to_update_target_network']} steps")

print(f"\n Network Environment:")
print(f"   • Total Nodes: {setting['num_nodes']}")
print(f"   • Base Stations: {setting['num_bs']}")
print(f"   • User Equipment: {setting['NETWORK']['number user']}")
print(f"   • Buffer Capacity: {setting['capacity']} packets/node")

print("="*50)

if all_ready:
    print(" ALL COMPONENTS READY FOR TRAINING!")
    print(" Ready to start the comparative training experiment!")
else:
    print(" Some components are not properly configured.")
    print("Please check the error messages above before proceeding.")

print("="*50)

## 6. Run Training Experiment

Now we'll run the main training experiment! This will:

### Training Process
1. **Initialize**: Both algorithms start with random policies
2. **Learn**: Each algorithm experiences the same network episodes
3. **Track Progress**: Monitor rewards, convergence, and learning curves
4. **Save Checkpoints**: Periodically save model states

### What to Expect
- **Q-Routing**: Should show more aggressive exploration initially, then quick convergence
- **Soft-Q-Routing**: Should show smoother exploration and potentially more stable convergence
- **Progress Bars**: Real-time training progress for both algorithms
- **Live Metrics**: Episode rewards and learning statistics

### Training Duration
This will take several minutes depending on your hardware. We'll see real-time progress as both algorithms learn!

In [None]:
# Run the training experiment with detailed progress tracking
def run_training_experiment(experiment_instance):
    """
    Run training for both algorithms and track their progress
    
    Returns:
        dict: Training results and metrics for both algorithms
    """
    
    print(" Starting Soft-Q vs Q-Routing Training Experiment!")
    print(f" Start time: {datetime.now().strftime('%H:%M:%S')}")
    print("="*70)
    
    results = {}
    training_start_time = time.time()
    
    # Train each algorithm
    for algorithm_idx, name in enumerate(experiment_instance.agent_names):
        print(f"\n{'Q' if name == 'Q-Routing' else 'SOftQ'} TRAINING {name.upper()}")
        print("="*70)
        
        # Get components for this algorithm
        trainer = experiment_instance.trainers[name]
        agent = experiment_instance.agents[name]
        metrics = experiment_instance.training_metrics[name]
        
        if trainer is None:
            print(f" No trainer available for {name}")
            results[name] = None
            continue
        
        # Training parameters
        total_episodes = experiment_instance.setting["Simulation"]["training_episodes"]
        checkpoint_freq = experiment_instance.setting["AGENT"]["checkpoint_frequency"]
        
        print(f" Episodes to train: {total_episodes}")
        print(f" Checkpoint frequency: Every {checkpoint_freq} episodes")
        print(f" Max steps per episode: {experiment_instance.setting['Simulation']['max_allowed_time_step_per_episode']}")
        
        # Training loop with progress tracking
        episode_rewards = []
        algorithm_start_time = time.time()
        
        for episode in tqdm(range(total_episodes), 
                           desc=f"Training {name}", 
                           ncols=100, 
                           colour='blue' if name == 'Q-Routing' else 'green'):
            
            try:
                # Record episode start time
                episode_start_time = time.time()
                
                # Train one episode
                trainer.train(episode)
                
                # Track episode metrics
                episode_duration = time.time() - episode_start_time
                
                # Get episode reward (if available from trainer)
                episode_reward = getattr(trainer, 'last_episode_reward', 0)
                episode_rewards.append(episode_reward)
                
                # Store metrics
                metrics['episode_rewards'].append(episode_reward)
                metrics['episode_lengths'].append(episode_duration)
                metrics['timestamps'].append(time.time())
                
                # Get current exploration parameter
                if name == 'Q-Routing':
                    exploration_param = getattr(agent, 'epsilon', setting['AGENT']['epsilon'])
                else:  # Soft-Q-Routing
                    exploration_param = getattr(agent, 'temperature', setting['AGENT']['temperature'])
                metrics['exploration_rates'].append(exploration_param)
                
                # Save checkpoint
                if episode % checkpoint_freq == 0 and episode > 0:
                    agent.save_agent(experiment_instance.paths[name])
                    
                    # Print progress update
                    avg_reward = np.mean(episode_rewards[-checkpoint_freq:])
                    print(f"\\n Episode {episode}: Avg Reward = {avg_reward:.3f}, "
                          f"Exploration = {exploration_param:.4f}")
                
            except Exception as e:
                print(f"\\n Error in episode {episode}: {e}")
                # Continue training despite errors
                continue
        
        # Final save
        agent.save_agent(experiment_instance.paths[name])
        algorithm_duration = time.time() - algorithm_start_time
        
        # Calculate training statistics
        total_reward = sum(episode_rewards)
        avg_reward = np.mean(episode_rewards) if episode_rewards else 0
        final_exploration = metrics['exploration_rates'][-1] if metrics['exploration_rates'] else 0
        
        print(f"\\n {name} Training Complete!")
        print(f"     Training Time: {algorithm_duration:.1f} seconds")
        print(f"    Total Reward: {total_reward:.2f}")
        print(f"    Average Reward: {avg_reward:.3f}")
        print(f"    Final Exploration: {final_exploration:.4f}")
        print(f"    Model saved to: {experiment_instance.paths[name]}")
        
        # Store results
        results[name] = {
            'episode_rewards': episode_rewards,
            'training_time': algorithm_duration,
            'total_reward': total_reward,
            'average_reward': avg_reward,
            'final_exploration': final_exploration,
            'metrics': metrics,
            'agent': agent,
            'trainer': trainer
        }
    
    total_training_time = time.time() - training_start_time
    
    print("\\n" + "="*70)
    print(" TRAINING EXPERIMENT COMPLETED!")
    print("="*70)
    print(f" Total Experiment Time: {total_training_time:.1f} seconds")
    print(f" Results Summary:")
    
    for name, result in results.items():
        if result is not None:
            print(f"    {name}: {result['average_reward']:.3f} avg reward, "
                  f"{result['training_time']:.1f}s training")
        else:
            print(f"    {name}: Training failed")
    
    print("="*70)
    return results

# Execute the training experiment
print(" Executing training experiment...")
print("This will train both algorithms and track their learning progress.")
print(" Grab a coffee - this may take several minutes!\\n")

# Run the experiment
training_results = run_training_experiment(experiment_instance)

print("\\n Training experiment completed successfully!")
print(" Ready to analyze and visualize the results!")

## 7. Analyze Training Results

Now let's analyze and visualize the training results! We'll create comprehensive plots to understand:

### Learning Curves Analysis
- **Episode Rewards**: How reward accumulates over training episodes
- **Convergence Speed**: Which algorithm learns faster
- **Training Stability**: How consistent each algorithm's learning is
- **Exploration Decay**: How exploration parameters change over time

### Algorithm Comparison
- **Final Performance**: Which algorithm achieved better final performance
- **Learning Efficiency**: Reward gained per training time
- **Exploration Strategy**: Differences in exploration vs exploitation balance
- **Training Dynamics**: Patterns in learning behavior

This analysis will help us understand the practical differences between traditional Q-Routing and Soft-Q-Routing in IAB networks.

In [None]:
# Comprehensive results visualization and comparison
def visualize_experiment_results(experiment_results, agent_names):
    """Create comprehensive visualizations comparing the algorithms"""
    
    # Set up the plotting style
    plt.style.use('seaborn-v0_8')
    colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728']
    
    # Create figure with subplots
    fig = plt.figure(figsize=(20, 15))
    
    # 1. Training Progress Comparison
    plt.subplot(3, 3, 1)
    for i, name in enumerate(agent_names):
        if experiment_results[name] is not None:
            try:
                train_stats = experiment_results[name]['train_stats']
                # Plot training metrics if available
                plt.plot(range(len(getattr(train_stats, 'rewards', []))), 
                        getattr(train_stats, 'rewards', []), 
                        label=name, color=colors[i], linewidth=2)
            except Exception as e:
                print(f"Could not plot training progress for {name}: {e}")
    
    plt.title('Training Progress: Cumulative Rewards', fontsize=14, fontweight='bold')
    plt.xlabel('Training Episodes')
    plt.ylabel('Cumulative Reward')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # 2. Convergence Analysis
    plt.subplot(3, 3, 2)
    for i, name in enumerate(agent_names):
        if experiment_results[name] is not None:
            try:
                train_stats = experiment_results[name]['train_stats']
                # Plot convergence metrics
                losses = getattr(train_stats, 'losses', [])
                if losses:
                    plt.plot(losses, label=f'{name} Loss', color=colors[i], linewidth=2)
            except Exception as e:
                print(f"Could not plot convergence for {name}: {e}")
    
    plt.title('Algorithm Convergence', fontsize=14, fontweight='bold')
    plt.xlabel('Training Steps')
    plt.ylabel('Loss/Error')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.yscale('log')
    
    # 3. Test Performance Metrics
    plt.subplot(3, 3, 3)
    performance_data = {'Algorithm': [], 'Metric': [], 'Value': []}