# Batch Job Submission with sbatch on Explorer Cluster

This notebook teaches you how to create and submit batch jobs using `sbatch` for long-running machine learning tasks on the Explorer cluster.

## Table of Contents
1. [Understanding sbatch](#understanding-sbatch)
2. [Basic Job Scripts](#basic-job-scripts)
3. [GPU Job Scripts](#gpu-job-scripts)
4. [Advanced Configurations](#advanced-configurations)
5. [Job Arrays](#job-arrays)
6. [Best Practices](#best-practices)

## 1. Understanding sbatch <a name="understanding-sbatch"></a>

The `sbatch` command submits **batch jobs** to run without user interaction. Perfect for:
- Long training runs
- Parameter sweeps
- Production workflows
- Automated experiments

### Key Benefits:
- Runs without your presence
- Better resource utilization
- Automatic output logging
- Can run multiple jobs simultaneously

## 2. Basic Job Scripts <a name="basic-job-scripts"></a>

### 2.1 Simple CPU Job

Create a basic job script:

In [None]:
# Create a simple job script
simple_job_script = '''
#!/bin/bash
#SBATCH --job-name=simple_job
#SBATCH --partition=short
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --output=simple_job_%j.out
#SBATCH --error=simple_job_%j.err

# Load modules
module load anaconda3

# Activate environment
conda activate ml_course_env

# Run your script
python my_script.py
'''

# Save the script
with open('simple_job.sh', 'w') as f:
    f.write(simple_job_script)

print("Created simple_job.sh")
print("Submit with: sbatch simple_job.sh")

## 3. GPU Job Scripts <a name="gpu-job-scripts"></a>

### 3.1 Single GPU Training Job

In [None]:
# GPU training job script
gpu_job_script = '''
#!/bin/bash
#SBATCH --job-name=gpu_training
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=08:00:00
#SBATCH --output=gpu_training_%j.out
#SBATCH --error=gpu_training_%j.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=your.email@northeastern.edu

# Print job info
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $SLURM_NODELIST"
echo "GPU: $CUDA_VISIBLE_DEVICES"
echo "Start time: $(date)"

# Load modules
module load anaconda3
module load cuda/11.8

# Activate environment
conda activate ml_course_env

# Verify GPU
nvidia-smi

# Run training
python train_model.py --epochs 100 --batch-size 64 --lr 0.001

echo "End time: $(date)"
'''

with open('gpu_training.sh', 'w') as f:
    f.write(gpu_job_script)

print("Created gpu_training.sh")
print("Submit with: sbatch gpu_training.sh")

### 3.2 Multi-GPU Training Job

In [None]:
# Multi-GPU training script
multi_gpu_script = '''
#!/bin/bash
#SBATCH --job-name=multi_gpu_training
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=12:00:00
#SBATCH --output=multi_gpu_%j.out
#SBATCH --error=multi_gpu_%j.err

# Load modules
module load anaconda3
module load cuda/11.8

# Activate environment
conda activate ml_course_env

# Check available GPUs
echo "Available GPUs: $CUDA_VISIBLE_DEVICES"
nvidia-smi

# Run multi-GPU training
python -m torch.distributed.launch --nproc_per_node=2 train_distributed.py
'''

with open('multi_gpu_training.sh', 'w') as f:
    f.write(multi_gpu_script)

print("Created multi_gpu_training.sh")

## 4. Advanced Configurations <a name="advanced-configurations"></a>

### 4.1 Parameter Sweep Job

In [None]:
# Parameter sweep job
param_sweep_script = '''
#!/bin/bash
#SBATCH --job-name=param_sweep
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=24:00:00
#SBATCH --output=param_sweep_%j.out
#SBATCH --error=param_sweep_%j.err

# Load environment
module load anaconda3
conda activate ml_course_env

# Parameter sweep
learning_rates=(0.001 0.01 0.1)
batch_sizes=(32 64 128)

for lr in "${learning_rates[@]}"; do
    for bs in "${batch_sizes[@]}"; do
        echo "Training with lr=$lr, batch_size=$bs"
        python train_model.py --lr $lr --batch-size $bs --experiment-name "lr_${lr}_bs_${bs}"
    done
done
'''

with open('param_sweep.sh', 'w') as f:
    f.write(param_sweep_script)

print("Created param_sweep.sh")

## 5. Job Arrays <a name="job-arrays"></a>

Job arrays run multiple similar jobs efficiently:

In [None]:
# Job array script
job_array_script = '''
#!/bin/bash
#SBATCH --job-name=job_array
#SBATCH --partition=gpu
#SBATCH --array=1-10
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=04:00:00
#SBATCH --output=job_array_%A_%a.out
#SBATCH --error=job_array_%A_%a.err

# Load environment
module load anaconda3
conda activate ml_course_env

# Use array task ID for different seeds/configurations
SEED=$SLURM_ARRAY_TASK_ID

echo "Running job array task $SLURM_ARRAY_TASK_ID with seed $SEED"

# Run with different random seeds
python train_model.py --seed $SEED --experiment-name "run_$SEED"
'''

with open('job_array.sh', 'w') as f:
    f.write(job_array_script)

print("Created job_array.sh")
print("This will submit 10 jobs with different seeds")

## 6. Best Practices <a name="best-practices"></a>

### 6.1 Resource Estimation

In [None]:
# Resource estimation guidelines
resource_guide = """
Memory Guidelines:
- Small models (ResNet18): 16-32GB
- Medium models (ResNet50): 32-64GB  
- Large models (Transformers): 64-128GB

Time Guidelines:
- Quick experiments: 1-4 hours
- Standard training: 8-24 hours
- Large experiments: 24-72 hours

CPU Guidelines:
- 2-4 CPUs per GPU for data loading
- More CPUs for heavy preprocessing
"""

print(resource_guide)

### 6.2 Job Submission and Management

In [None]:
# Job management commands
management_commands = """
# Submit job
sbatch job_script.sh

# Check job status
squeue -u $USER

# Check job details
scontrol show job <JOBID>

# Cancel job
scancel <JOBID>

# Cancel all your jobs
scancel -u $USER

# Check job history
sacct -u $USER --starttime=today

# Check job efficiency
seff <JOBID>
"""

print(management_commands)

### 6.3 Example Training Script Integration

In [None]:
# Example Python training script that works well with sbatch
training_script_example = '''
import torch
import argparse
import os
import wandb

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--epochs', type=int, default=10)
    parser.add_argument('--batch-size', type=int, default=64)
    parser.add_argument('--lr', type=float, default=0.001)
    parser.add_argument('--seed', type=int, default=42)
    parser.add_argument('--experiment-name', type=str, default='experiment')
    args = parser.parse_args()
    
    # Set seed for reproducibility
    torch.manual_seed(args.seed)
    
    # Initialize wandb
    wandb.init(
        project="cluster-training",
        name=args.experiment_name,
        config=args
    )
    
    # Check GPU
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Using device: {device}")
    
    # Your training code here
    # ...
    
    # Save model
    torch.save(model.state_dict(), f'model_{args.experiment_name}.pth')
    
if __name__ == '__main__':
    main()
'''

with open('train_model.py', 'w') as f:
    f.write(training_script_example)

print("Created example train_model.py")

## Quick Reference

### Common SBATCH Directives

| Directive | Purpose | Example |
|-----------|---------|----------|
| `--job-name` | Job name | `--job-name=my_training` |
| `--partition` | Queue/partition | `--partition=gpu` |
| `--gres` | GPU resources | `--gres=gpu:1` |
| `--mem` | Memory | `--mem=32G` |
| `--time` | Time limit | `--time=08:00:00` |
| `--output` | Output file | `--output=job_%j.out` |
| `--error` | Error file | `--error=job_%j.err` |
| `--mail-type` | Email notifications | `--mail-type=END,FAIL` |
| `--array` | Job array | `--array=1-10` |

### Job Submission Workflow

1. **Create job script** with SBATCH directives
2. **Test locally** with small dataset first
3. **Submit job** with `sbatch script.sh`
4. **Monitor progress** with `squeue -u $USER`
5. **Check outputs** in `.out` and `.err` files
6. **Analyze results** and iterate

---

**Next Steps:**
- Try submitting a simple job
- Monitor job progress
- Check the job monitoring guide for advanced management
- Scale up to job arrays for parameter sweeps