# Basic Simulation Workflow

This notebook demonstrates how to set up a basic simulation of distributed deep learning training using the Distributed Training Simulator. We'll cover:

1. Setting up communication, compute, and memory models
2. Configuring the simulation parameters
3. Running a simulation and analyzing results
4. Visualizing the simulation outcomes

## Setup

First, let's import the necessary modules:

In [None]:
import dist_training_sim as dts
from dist_training_sim.visualization import Visualizer
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

## Creating Models

The distributed training simulator consists of three core models:

1. **Communication Model**: Simulates the network topology and communication patterns
2. **Compute Model**: Simulates computational resources and operations
3. **Memory Model**: Simulates memory systems and data movement

Let's create instances of each:

In [None]:
# Create a communication model with 8 nodes in a ring topology
comm_model = dts.CommunicationModel(num_nodes=8, topology=dts.Topology.RING)

# Create a compute model with 8 GPU devices
compute_model = dts.ComputeModel(num_devices=8, device_type=dts.DeviceType.GPU)

# Create a memory model with 8 devices
memory_model = dts.MemoryModel(num_devices=8)

## Configuring the Communication Model

Let's customize our communication model to have different link properties:

In [None]:
# Define high-speed links for our ring
high_speed_links = dts.LinkProperties()
high_speed_links.bandwidth_gbps = 100.0  # 100 Gbps
high_speed_links.latency_us = 5.0        # 5 microseconds
high_speed_links.error_rate = 0.0

# Set uniform link properties across all connections
comm_model.setUniformLinkProperties(high_speed_links)

# Simulate a point-to-point transfer of 100MB
transfer_time_ms = comm_model.simulateSendRecv(0, 1, 100 * 1024 * 1024)
print(f"Time to transfer 100MB between nodes 0 and 1: {transfer_time_ms:.2f} ms")

# Simulate a collective operation (all-reduce) with 100MB per node
allreduce_time_ms = comm_model.simulateAllReduce(100 * 1024 * 1024)
print(f"Time to perform all-reduce with 100MB per node: {allreduce_time_ms:.2f} ms")

## Configuring the Compute Model

Now let's customize our compute model with specific device properties and workload characteristics:

In [None]:
# Define custom GPU properties (based on high-end GPU)
gpu_props = dts.DeviceProperties()
gpu_props.flops = 15.0e12         # 15 TFLOPS
gpu_props.memory_bandwidth = 900.0 # 900 GB/s
gpu_props.memory_size = 32768      # 32 GB
gpu_props.cores = 108              # 108 SMs

# Set uniform device properties
compute_model.setUniformDeviceProperties(gpu_props)

# Configure batch size and model size
compute_model.setBatchSize(256)
compute_model.setModelSize(parameters=300e6, activations=150e6)  # 300M params, 150M activations

# Simulate a forward pass on device 0
forward_time_ms = compute_model.simulateOperation(dts.Operation.FORWARD_PASS, 0)
print(f"Forward pass time on device 0: {forward_time_ms:.2f} ms")

# Simulate a backward pass on device 0
backward_time_ms = compute_model.simulateOperation(dts.Operation.BACKWARD_PASS, 0)
print(f"Backward pass time on device 0: {backward_time_ms:.2f} ms")

# Simulate a full training iteration
iteration_time_ms = compute_model.simulateFullTrainingIteration(0)
print(f"Full training iteration time on device 0: {iteration_time_ms:.2f} ms")

## Configuring the Memory Model

Let's customize our memory model to reflect a specific hardware configuration:

In [None]:
# Define a custom high-bandwidth memory tier
hbm_tier = dts.MemoryTier()
hbm_tier.type = dts.MemoryType.HBM
hbm_tier.capacity_bytes = 32 * 1024 * 1024 * 1024  # 32 GB
hbm_tier.bandwidth_gbps = 1200.0                   # 1.2 TB/s
hbm_tier.latency_ns = 100.0                        # 100 ns

# Add this memory tier to device 0
memory_model.addMemoryTier(0, hbm_tier)

# Simulate allocating 10 GB on device 0's HBM
allocation_time_ms = memory_model.simulateAllocation(0, dts.MemoryType.HBM, 10 * 1024 * 1024 * 1024)
print(f"Time to allocate 10GB on device 0's HBM: {allocation_time_ms:.2f} ms")

# Simulate transferring 5 GB from device 0's HBM to device 0's CPU RAM
transfer_time_ms = memory_model.simulateTransfer(
    0, dts.MemoryType.HBM, 
    0, dts.MemoryType.CPU_RAM, 
    5 * 1024 * 1024 * 1024
)
print(f"Time to transfer 5GB from HBM to CPU RAM: {transfer_time_ms:.2f} ms")

## Creating a Simulator

Now, let's combine these models to create a complete training simulator:

In [None]:
# Create the simulator with our models
simulator = dts.TrainingSimulator(comm_model, compute_model, memory_model)

# Configure the simulator
simulator.setBatchSize(256)
simulator.setModelSize(parameters=300e6, activations=150e6)
simulator.setParallelStrategy(dts.ParallelStrategy.DATA_PARALLEL)
simulator.setOptimizer(dts.OptimizerType.ADAM)

## Running the Simulation

Let's run a simulation of a training iteration and examine the results:

In [None]:
# Run the simulation
results = simulator.simulateTrainingIteration()

# Print summary statistics
print(f"Total iteration time: {results.total_time_ms:.2f} ms")
print(f"Compute time: {results.compute_time_ms:.2f} ms")
print(f"Communication time: {results.communication_time_ms:.2f} ms")
print(f"Memory operation time: {results.memory_time_ms:.2f} ms")
print(f"Compute efficiency: {results.compute_efficiency:.2f}%")
print(f"Communication efficiency: {results.communication_efficiency:.2f}%")

## Visualizing the Results

Let's visualize the simulation results using the visualization module:

In [None]:
# Create a visualizer
vis = Visualizer()

# Plot the training timeline
fig = vis.plot_training_timeline(results.events)
plt.show()

# Plot the communication graph
fig = vis.plot_communication_graph(results.comm_adjacency)
plt.show()

# Plot memory usage by device
memory_data = {
    "GPU VRAM": results.memory_usage_by_device,
    "CPU RAM": results.host_memory_usage_by_device
}
fig = vis.plot_memory_usage(range(8), memory_data, results.memory_capacity_by_device)
plt.show()

## Scaling Simulation

Let's simulate how the system scales across different numbers of devices:

In [None]:
# Define a range of device counts to test
device_counts = [1, 2, 4, 8, 16, 32]
throughputs = []
efficiencies = []

# Base throughput for efficiency calculation
base_samples_per_sec = None

for count in device_counts:
    # Create models with the specified device count
    temp_comm_model = dts.CommunicationModel(num_nodes=count, topology=dts.Topology.RING)
    temp_compute_model = dts.ComputeModel(num_devices=count, device_type=dts.DeviceType.GPU)
    temp_memory_model = dts.MemoryModel(num_devices=count)
    
    # Configure models
    temp_comm_model.setUniformLinkProperties(high_speed_links)
    temp_compute_model.setUniformDeviceProperties(gpu_props)
    temp_compute_model.setBatchSize(256 * count)  # Scale batch size with device count
    temp_compute_model.setModelSize(parameters=300e6, activations=150e6)
    
    # Create simulator
    temp_simulator = dts.TrainingSimulator(temp_comm_model, temp_compute_model, temp_memory_model)
    temp_simulator.setBatchSize(256 * count)
    temp_simulator.setModelSize(parameters=300e6, activations=150e6)
    temp_simulator.setParallelStrategy(dts.ParallelStrategy.DATA_PARALLEL)
    
    # Run simulation
    temp_results = temp_simulator.simulateTrainingIteration()
    
    # Calculate throughput (samples/sec)
    samples_per_sec = (256 * count) / (temp_results.total_time_ms / 1000)
    throughputs.append(samples_per_sec)
    
    # Calculate efficiency
    if base_samples_per_sec is None:
        base_samples_per_sec = samples_per_sec
        efficiencies.append(100.0)
    else:
        ideal_throughput = base_samples_per_sec * count
        efficiency = 100 * samples_per_sec / ideal_throughput
        efficiencies.append(efficiency)
    
    print(f"{count} devices: {samples_per_sec:.2f} samples/sec, {efficiencies[-1]:.2f}% efficiency")

In [None]:
# Plot scaling results
fig = vis.plot_scaling_efficiency(device_counts, throughputs, efficiencies)
plt.show()

## Conclusion

In this notebook, we've demonstrated the basic workflow for using the Distributed Training Simulator:

1. Creating and configuring the three core models (communication, compute, and memory)
2. Setting up a simulation environment
3. Running simulations and analyzing results
4. Visualizing the outcomes of simulations
5. Investigating scaling performance

This provides a foundation for more advanced simulations and analyses of distributed training systems.