# Interactive GPU Computing with srun on Explorer Cluster

This notebook provides a comprehensive guide to using `srun` for interactive GPU computing on Northeastern's Explorer cluster. You'll learn how to request GPU resources, monitor usage, and optimize your interactive sessions.

## Table of Contents
1. [Understanding srun](#understanding-srun)
2. [Basic GPU Requests](#basic-gpu-requests)
3. [Advanced Resource Configuration](#advanced-resource-configuration)
4. [Monitoring Your Session](#monitoring-session)
5. [Best Practices](#best-practices)
6. [Common Use Cases](#common-use-cases)
7. [Troubleshooting](#troubleshooting)

## 1. Understanding srun <a name="understanding-srun"></a>

The `srun` command is used to submit **interactive jobs** to the SLURM scheduler. Unlike batch jobs, interactive jobs give you a shell session on compute nodes where you can run commands in real-time.

### Key Benefits:
- Real-time interaction with compute nodes
- Perfect for development, debugging, and testing
- Immediate feedback from your code
- Ideal for Jupyter notebooks and interactive Python sessions

### Basic Syntax:
```bash
srun [options] [command]
```

## 2. Basic GPU Requests <a name="basic-gpu-requests"></a>

### 2.1 Simple GPU Request

The most basic way to request a GPU for interactive use:

```bash
srun --partition=gpu-interactive --nodes=1 --ntasks=1 --gres=gpu:1 --mem=8G --time=01:00:00 --pty /bin/bash
```

**Breakdown:**
- `--partition=gpu-interactive`: Use the interactive GPU partition
- `--nodes=1`: Request 1 compute node
- `--ntasks=1`: Run 1 task
- `--gres=gpu:1`: Request 1 GPU
- `--mem=8G`: Request 8GB of memory
- `--time=01:00:00`: Request 1 hour of time
- `--pty /bin/bash`: Start an interactive bash shell

### 2.2 Quick GPU Session

For quick testing (15 minutes):

```bash
srun --partition=gpu-interactive --gres=gpu:1 --time=00:15:00 --pty bash
```

### 2.3 Extended Development Session

For longer development work (4 hours with more resources):

```bash
srun --partition=gpu-interactive --nodes=1 --ntasks=1 --gres=gpu:1 --mem=32G --cpus-per-task=4 --time=04:00:00 --pty /bin/bash
```

## 3. Advanced Resource Configuration <a name="advanced-resource-configuration"></a>

### 3.1 Multiple GPUs

Request multiple GPUs for multi-GPU training:

```bash
# Request 2 GPUs
srun --partition=gpu-interactive --gres=gpu:2 --mem=64G --cpus-per-task=8 --time=02:00:00 --pty bash

# Request 4 GPUs
srun --partition=gpu-interactive --gres=gpu:4 --mem=128G --cpus-per-task=16 --time=02:00:00 --pty bash
```

### 3.2 Specific GPU Types

Request specific GPU models if available:

```bash
# Request specific GPU type (if available)
srun --partition=gpu-interactive --gres=gpu:a100:1 --mem=32G --time=02:00:00 --pty bash
srun --partition=gpu-interactive --gres=gpu:v100:1 --mem=32G --time=02:00:00 --pty bash
```

### 3.3 Memory and CPU Configuration

Different memory configurations for different workloads:

```bash
# Light workload (small models, testing)
srun --partition=gpu-interactive --gres=gpu:1 --mem=16G --cpus-per-task=2 --time=01:00:00 --pty bash

# Medium workload (standard training)
srun --partition=gpu-interactive --gres=gpu:1 --mem=32G --cpus-per-task=4 --time=02:00:00 --pty bash

# Heavy workload (large models, data processing)
srun --partition=gpu-interactive --gres=gpu:1 --mem=64G --cpus-per-task=8 --time=04:00:00 --pty bash
```

## 4. Monitoring Your Session <a name="monitoring-session"></a>

### 4.1 Check Your Job Status

Once you submit an srun request, you can monitor it:

In [None]:
# Run this in a terminal to check your jobs
# squeue -u $USER

# Example output explanation
print("""
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
12345 gpu-inter     bash username  R       5:23      1 gpu-node-01

Status meanings:
- R: Running
- PD: Pending (waiting for resources)
- CG: Completing
""")

### 4.2 Monitor GPU Usage

Once you're in your interactive session, monitor GPU usage:

In [None]:
# Commands to run in your srun session
print("""
# Check GPU status
nvidia-smi

# Monitor GPU usage continuously (update every 2 seconds)
watch -n 2 nvidia-smi

# Check GPU memory usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
""")

### 4.3 Python GPU Monitoring

Monitor GPU usage from within Python:

In [None]:
import torch
import subprocess
import time

def check_gpu_status():
    """Check GPU availability and memory usage"""
    if torch.cuda.is_available():
        print(f"GPU Available: {torch.cuda.is_available()}")
        print(f"GPU Count: {torch.cuda.device_count()}")
        print(f"Current GPU: {torch.cuda.current_device()}")
        print(f"GPU Name: {torch.cuda.get_device_name(0)}")
        
        # Memory usage
        memory_allocated = torch.cuda.memory_allocated(0) / 1024**3  # GB
        memory_cached = torch.cuda.memory_reserved(0) / 1024**3  # GB
        print(f"Memory Allocated: {memory_allocated:.2f} GB")
        print(f"Memory Cached: {memory_cached:.2f} GB")
    else:
        print("No GPU available")

# Run the check
check_gpu_status()

## 5. Best Practices <a name="best-practices"></a>

### 5.1 Resource Optimization

- **Start small**: Begin with minimal resources and scale up as needed
- **Time limits**: Request only the time you actually need
- **Memory**: Don't over-request memory - it affects queue times
- **CPUs**: Match CPU count to your workload (typically 2-4 CPUs per GPU)

### 5.2 Queue Time Optimization

```bash
# Check queue status before submitting
sinfo -p gpu-interactive

# Check estimated start time
squeue -u $USER --start
```

### 5.3 Session Management

- Use `screen` or `tmux` for persistent sessions
- Save your work frequently
- Clean up GPU memory when switching tasks

In [None]:
# GPU memory cleanup in Python
import torch

def cleanup_gpu():
    """Clean up GPU memory"""
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        print("GPU cache cleared")
    else:
        print("No GPU to clean up")

# Use this between different experiments
cleanup_gpu()

## 6. Common Use Cases <a name="common-use-cases"></a>

### 6.1 Jupyter Notebook Session

Start a GPU session for Jupyter notebooks:

```bash
# Request GPU resources
srun --partition=gpu-interactive --gres=gpu:1 --mem=32G --time=03:00:00 --pty bash

# Once in the session, load your environment and start Jupyter
module load anaconda3
conda activate ml_course_env
jupyter notebook --no-browser --port=8888
```

### 6.2 Interactive Python Development

```bash
# Start interactive session
srun --partition=gpu-interactive --gres=gpu:1 --mem=16G --time=02:00:00 --pty bash

# Load environment and start Python
module load anaconda3
conda activate ml_course_env
python
```

### 6.3 Model Testing and Debugging

```bash
# Quick testing session
srun --partition=gpu-interactive --gres=gpu:1 --mem=16G --time=01:00:00 --pty bash

# Run your test script
python test_model.py
```

## 7. Troubleshooting <a name="troubleshooting"></a>

### 7.1 Common Issues and Solutions

**Issue: Job stays in pending (PD) state**
```bash
# Check why job is pending
squeue -u $USER
squeue -j <JOBID> --start

# Common reasons:
# - Resources: Requested resources not available
# - Priority: Other jobs have higher priority
# - Limits: Hit partition or user limits
```

**Issue: Out of memory errors**
```bash
# Request more memory
srun --partition=gpu-interactive --gres=gpu:1 --mem=64G --time=02:00:00 --pty bash

# Or optimize your code to use less memory
```

**Issue: Session disconnected**
```bash
# Use screen or tmux for persistent sessions
srun --partition=gpu-interactive --gres=gpu:1 --mem=32G --time=02:00:00 --pty bash
screen -S gpu_session
# Your work here
# Ctrl+A, D to detach
# screen -r gpu_session to reattach
```

### 7.2 Useful Commands for Troubleshooting

In [None]:
# Commands to run in terminal for troubleshooting
troubleshooting_commands = """
# Check partition information
sinfo -p gpu-interactive

# Check your job queue
squeue -u $USER

# Check job details
scontrol show job <JOBID>

# Check node information
sinfo -N -l

# Check your account limits
sacctmgr show assoc user=$USER

# Cancel a job if needed
scancel <JOBID>

# Check completed jobs
sacct -u $USER --starttime=today
"""

print(troubleshooting_commands)

## Quick Reference Card

### Essential srun Commands

| Purpose | Command |
|---------|----------|
| Quick GPU test | `srun --partition=gpu-interactive --gres=gpu:1 --time=00:15:00 --pty bash` |
| Standard development | `srun --partition=gpu-interactive --gres=gpu:1 --mem=32G --time=02:00:00 --pty bash` |
| Heavy workload | `srun --partition=gpu-interactive --gres=gpu:1 --mem=64G --cpus-per-task=8 --time=04:00:00 --pty bash` |
| Multi-GPU | `srun --partition=gpu-interactive --gres=gpu:2 --mem=64G --time=02:00:00 --pty bash` |

### Monitoring Commands

| Purpose | Command |
|---------|----------|
| Check your jobs | `squeue -u $USER` |
| GPU status | `nvidia-smi` |
| Partition info | `sinfo -p gpu-interactive` |
| Cancel job | `scancel <JOBID>` |

---

**Next Steps:**
- Try the basic GPU request examples above
- Monitor your resource usage
- Move to batch jobs (`sbatch`) for longer-running tasks
- Check out the job monitoring guide for advanced management