# Studio 2: Hot-Reloading with Workspace Synchronization

Welcome to Studio 2! In this notebook, you'll learn one of Monarch's most powerful features: **workspace synchronization**.

## The Problem

In traditional distributed training:
1. You launch a multi-node job (takes 5-10 minutes)
2. You realize you need to change a config value (e.g., learning rate)
3. You have to **stop everything** and restart (another 5-10 minutes)
4. Rinse and repeat...

This is incredibly frustrating and wastes valuable time and compute resources!

## The Solution: Workspace Sync

With Monarch's `proc_mesh.sync_workspace()`:
1. Launch your multi-node job once
2. Edit configs or code **locally**
3. Run `sync_workspace()` to propagate changes to all remote nodes
4. Re-run training with updated configs - **no restart needed!**

## What You'll Learn

- How workspace synchronization works
- Creating and modifying training configs locally
- Syncing changes to remote worker nodes
- Verifying synchronization across the cluster
- Practical hot-reload workflows

## Prerequisites

**Required:** Complete [Studio 1: Getting Started](./studio_1_getting_started.ipynb) first!

You should have:
- A running multi-node Lightning job
- An initialized Monarch process mesh
- Basic understanding of Monarch actors

**New to Monarch?** Start with [Studio 0: Monarch Basics](./studio_0_monarch_basics.ipynb) to learn the fundamentals!

## Lightning Studios Series

This is **Studio 2** of the series:

- **[Studio 0: Monarch Basics](./studio_0_monarch_basics.ipynb)** - Learn Monarch fundamentals
- **[Studio 1: Getting Started](./studio_1_getting_started.ipynb)** - Multi-node training
- **Studio 2: Workspace Sync** - Hot-reload configs (YOU ARE HERE)
- **[Studio 3: Interactive Debugging](./studio_3_interactive_debugging.ipynb)** - Debug distributed systems

## Quick Recap from Studio 1

If you completed Studio 1, you should have:
- `job` - Your Lightning MMT job
- `proc_mesh` - Your Monarch process mesh
- `NUM_NODES` and `NUM_GPUS` configured

If you need to restart, run the setup cells from Studio 1 first.

Let's get started!

---

# Setup (If Starting Fresh)

If you're continuing from Studio 1, **skip this section**. If you're starting fresh, run these cells to set up your environment.

In [None]:
# Only run if starting fresh (not continuing from Studio 1)
from lightning_sdk import Machine, MMT, Studio
import os

NUM_NODES = 2
NUM_GPUS = 8
TEAMSPACE = "general"
USER = "your-username"
MMT_JOB_NAME = f"Monarch-MMT-{NUM_NODES}-nodes"
REMOTE_ALLOWED_PORT_RANGE = "26601..26611"

os.environ["MONARCH_V0_WORKAROUND_DO_NOT_USE"] = "1"
os.environ["MONARCH_FILE_LOG"] = "debug"

# Launch job (see Studio 1 for full details)
# job, studio = launch_mmt_job(...)
# proc_mesh = setup_proc_mesh_from_job(job, NUM_NODES, NUM_GPUS)

---

# Workspace Synchronization Workflow

Let's dive into workspace sync with a practical example!

## Define File Checker Actor

First, we'll create an actor that can read and verify file contents on remote nodes. This helps us confirm that files are properly synchronized.

In [None]:
from monarch.actor import Actor, endpoint, current_rank
import os
import socket


class FileCheckerActor(Actor):
    """Actor to read and verify file contents on remote nodes."""

    def __init__(self):
        self.rank = current_rank().rank
        self.hostname = socket.gethostname()

    @endpoint
    def read_file(self, file_path: str) -> dict:
        """Read a file and return its contents."""
        try:
            with open(file_path, 'r') as f:
                content = f.read()
            return {
                "rank": self.rank,
                "hostname": self.hostname,
                "file_path": file_path,
                "content": content,
                "exists": True,
                "size": len(content)
            }
        except FileNotFoundError:
            return {
                "rank": self.rank,
                "hostname": self.hostname,
                "file_path": file_path,
                "exists": False,
                "error": "File not found"
            }
        except Exception as e:
            return {
                "rank": self.rank,
                "hostname": self.hostname,
                "file_path": file_path,
                "exists": False,
                "error": str(e)
            }

    @endpoint
    def file_exists(self, file_path: str) -> dict:
        """Check if a file exists on the remote node."""
        exists = os.path.exists(file_path)
        return {
            "rank": self.rank,
            "hostname": self.hostname,
            "file_path": file_path,
            "exists": exists
        }

## Spawn File Checker Actor

Spawn the file checker actor across all nodes.

In [None]:
# Spawn the file checker actor
file_checker = proc_mesh.spawn("file_checker", FileCheckerActor)
print("FileCheckerActor spawned across all nodes")

## Create a Local Configuration File

Let's create a training configuration file locally. This simulates a common workflow where you want to tweak hyperparameters.

In [None]:
# Create a local workspace directory for our custom config
local_workspace = "/teamspace/studios/this_studio/monarch_sync_example"
os.makedirs(local_workspace, exist_ok=True)

# Create a custom training configuration file
config_file_name = "custom_training_config.toml"
local_config_path = os.path.join(local_workspace, config_file_name)

# Write initial configuration
initial_config = """# TorchTitan Custom Training Configuration
# Version 1.0 - Initial configuration

[training]
batch_size = 32
learning_rate = 0.001
max_steps = 100
warmup_steps = 10

[model]
model_type = "llama3_8b"
seq_len = 1024

[optimizer]
optimizer_type = "AdamW"
weight_decay = 0.01
"""

with open(local_config_path, 'w') as f:
    f.write(initial_config)

print(f"✓ Created local config file: {local_config_path}")
print(f"\nInitial configuration:\n{'-'*50}")
print(initial_config)
print(f"{'-'*50}")

## Create Workspace and Perform Initial Sync

Now we'll create a Monarch `Workspace` object and sync our local directory to all remote nodes.

**This is the magic step!** 🪄

In [None]:
from monarch.tools.config.workspace import Workspace
from pathlib import Path

# Create a Workspace object pointing to our local directory
workspace = Workspace(dirs=[Path(local_workspace)])

print(f"Workspace configured: {workspace.dirs}")
print(f"\n🔄 Syncing workspace to {NUM_NODES * NUM_GPUS} remote processes...")

# Perform initial sync
await proc_mesh.sync_workspace(workspace=workspace, conda=False, auto_reload=False)

print("\n✅ Initial workspace sync completed!")

## Verify File on Remote Nodes

Let's verify that our config file was successfully synced to all remote worker nodes.

In [None]:
# Construct the remote file path (files are synced to WORKSPACE_DIR)
remote_workspace_root = os.environ.get("WORKSPACE_DIR", "/workspace")
remote_config_path = os.path.join(remote_workspace_root, "monarch_sync_example", config_file_name)

print(f"Checking file on remote nodes: {remote_config_path}\n")

# Check file existence on all nodes (just check first rank of each node)
exists_results = await file_checker.file_exists.call(remote_config_path)

# Group by hostname to show node-level status
nodes_checked = set()
for result in exists_results:
    hostname = result['hostname']
    if hostname not in nodes_checked:
        status = "✓ EXISTS" if result['exists'] else "✗ NOT FOUND"
        print(f"  Node {hostname}: {status}")
        nodes_checked.add(hostname)

# Read file content from rank 0 to verify
print(f"\n📄 Reading config from rank 0:")
print(f"{'-'*50}")
read_results = await file_checker.read_file.call(remote_config_path)
if read_results[0]['exists']:
    print(read_results[0]['content'])
else:
    print(f"Error: {read_results[0].get('error', 'Unknown error')}")
print(f"{'-'*50}")

---

# Hot-Reload: Modify and Re-Sync

Now comes the powerful part! Let's modify our config locally and sync it again - **without restarting anything**.

## Modify Local Configuration

Let's say we want to:
- Decrease the learning rate (0.001 → 0.0005)
- Increase max steps (100 → 200)
- Change sequence length (1024 → 2048)

In [None]:
# Modify the configuration
updated_config = """# TorchTitan Custom Training Configuration
# Version 2.0 - Updated after initial run

[training]
batch_size = 32
learning_rate = 0.0005  # ← CHANGED: Reduced from 0.001
max_steps = 200          # ← CHANGED: Increased from 100
warmup_steps = 10

[model]
model_type = "llama3_8b"
seq_len = 2048           # ← CHANGED: Increased from 1024

[optimizer]
optimizer_type = "AdamW"
weight_decay = 0.01
"""

# Write updated config locally
with open(local_config_path, 'w') as f:
    f.write(updated_config)

print(f"✓ Updated local config file: {local_config_path}")
print(f"\nUpdated configuration:\n{'-'*50}")
print(updated_config)
print(f"{'-'*50}")

## Re-Sync to Remote Nodes

Now sync the changes to all remote nodes. This is instant - no job restart required!

In [None]:
print(f"🔄 Re-syncing updated workspace to remote nodes...")

# Sync again - Monarch only transfers what changed!
await proc_mesh.sync_workspace(workspace=workspace, conda=False, auto_reload=False)

print("\n✅ Workspace re-sync completed!")
print("\n💡 The updated config is now available on all remote nodes!")

## Verify Updated File on Remote Nodes

Let's confirm the updated config made it to the remote nodes.

In [None]:
print(f"📄 Reading updated config from rank 0:")
print(f"{'-'*50}")

read_results = await file_checker.read_file.call(remote_config_path)
if read_results[0]['exists']:
    remote_content = read_results[0]['content']
    print(remote_content)
    
    # Verify it matches our local update
    if "learning_rate = 0.0005" in remote_content and "max_steps = 200" in remote_content:
        print(f"{'-'*50}")
        print("\n✅ SUCCESS! Remote config matches local changes:")
        print("  ✓ Learning rate: 0.001 → 0.0005")
        print("  ✓ Max steps: 100 → 200")
        print("  ✓ Sequence length: 1024 → 2048")
    else:
        print(f"{'-'*50}")
        print("\n⚠️ Warning: Remote config may not have updated correctly")
else:
    print(f"Error: {read_results[0].get('error', 'Unknown error')}")
    print(f"{'-'*50}")

---

# Real-World Workflow Example

Here's how you'd use workspace sync in a real training scenario:

## Workflow: Iterative Training with Config Changes

```python
# 1. Initial training run
await async_main(config)  # Train with initial settings

# 2. Review results, decide to adjust learning rate
# Edit local config file...

# 3. Sync changes (< 1 second)
await proc_mesh.sync_workspace(workspace=workspace)

# 4. Re-run training with new config (no restart!)
config = config_manager.parse_args(manual_args)  # Reload config
await async_main(config)  # Train with updated settings

# 5. Repeat as needed!
```

### Time Savings

**Without Monarch:**
- Change config: 1 min
- Stop job: 1 min
- Restart job: 5-10 min
- **Total per iteration: ~7-12 min**

**With Monarch:**
- Change config: 1 min
- Sync: < 1 sec
- **Total per iteration: ~1 min**

**10x faster iteration!** 🚀

## Advanced: Syncing Multiple Files and Directories

You can sync entire directory trees, not just single files!

In [None]:
# Example: Sync multiple directories
from pathlib import Path

# Create a workspace with multiple directories
multi_dir_workspace = Workspace(dirs=[
    Path("/teamspace/studios/this_studio/configs"),
    Path("/teamspace/studios/this_studio/custom_modules"),
    Path("/teamspace/studios/this_studio/data_processors"),
])

# Sync all directories at once
# await proc_mesh.sync_workspace(workspace=multi_dir_workspace)

print("\n💡 Tip: You can sync entire project directories, not just config files!")
print("This enables hot-reloading of:")
print("  • Training scripts")
print("  • Model definitions")
print("  • Data preprocessing code")
print("  • Custom layers and modules")
print("  • And more!")

---

# 🎉 Congratulations! 🎉

You've mastered **workspace synchronization** with Monarch!

## What You Learned

- Creating a Monarch `Workspace` for local directories
- Syncing files to remote nodes with `proc_mesh.sync_workspace()`
- Verifying synchronization across the cluster
- Hot-reloading configs without job restarts
- Real-world iterative training workflows

## Key Takeaways

- **10x faster iteration** - No more waiting for job restarts
- **Edit locally, run remotely** - Keep your familiar dev environment
- **Sync is smart** - Only changed files are transferred
- **Works with any files** - Configs, code, data processors, etc.

## Next Steps

### 🐛 Studio 3: Interactive Debugging (Recommended Next)
Learn advanced debugging techniques:
- Set breakpoints in distributed actors
- Debug specific ranks with `monarch debug`
- Inspect and modify environment variables
- Troubleshoot training issues interactively

### 📚 Back to Studio 1
Review the basics: [Studio 1: Getting Started](./studio_1_getting_started.ipynb)

---

## Try It Yourself!

Before moving on, try modifying the config one more time:
1. Change the batch size to 64
2. Sync the workspace
3. Verify the changes

This workflow will become second nature!