# GPU Training on SageMaker

This notebook shows how to launch GPU training jobs on SageMaker.

**Two Options:**
1. **Training Jobs** (Recommended) - Launch separate GPU instance, auto-shutdown
2. **Switch Instance** - Change notebook to GPU (expensive, charges while idle)

## Prerequisites

In [None]:
import boto3
import sagemaker
from datetime import datetime
import json
import os

# Initialize SageMaker session
session = sagemaker.Session()
role = sagemaker.get_execution_role()
region = session.boto_region_name
account_id = session.account_id()
bucket = f"gl-rl-model-sagemaker-{account_id}-{region}"

print(f"üé≠ Role: {role}")
print(f"üìç Region: {region}")
print(f"üì¶ S3 Bucket: {bucket}")

## Option 1: Launch Training Job (Recommended)

‚úÖ **Advantages:**
- Only pay while training
- Automatic shutdown
- Spot instances save 70%
- Can run multiple jobs in parallel

In [None]:
# Training script is already in the scripts directory
# No need to copy - we'll reference it directly in the estimator

In [None]:
# Upload training data to S3
!aws s3 cp ../../data/training/query_pairs.jsonl s3://{bucket}/data/training/

In [None]:
from sagemaker.pytorch import PyTorch

# Configure training job
estimator = PyTorch(
    entry_point='train.py',
    source_dir='../scripts',  # Point to scripts directory
    role=role,
    
    # GPU Instance - Choose one:
    instance_type='ml.g5.xlarge',     # A10G 24GB VRAM ($0.30/hr spot)
    # instance_type='ml.g4dn.xlarge',  # T4 16GB VRAM ($0.35/hr spot)
    # instance_type='ml.p3.2xlarge',   # V100 16GB VRAM ($1.15/hr spot)
    
    instance_count=1,
    
    # Framework
    framework_version='2.0',
    py_version='py310',
    
    # Hyperparameters
    hyperparameters={
        'model_name': 'Qwen/Qwen2.5-Coder-1.5B-Instruct',
        'epochs': 3,
        'batch_size': 4,
        'learning_rate': 3e-5,
        'lora_r': 8,
        'lora_alpha': 16,
        'gradient_checkpointing': True,
        'fp16': True,
    },
    
    # Cost Optimization - Use Spot Instances
    use_spot_instances=True,
    max_wait=86400,  # 24 hours
    max_run=86400,   # 24 hours
    
    # Output
    output_path=f's3://{bucket}/output',
    base_job_name='gl-rl-model-gpu',
    
    # Checkpointing for spot interruption recovery
    checkpoint_s3_uri=f's3://{bucket}/checkpoints',
    checkpoint_local_path='/opt/ml/checkpoints',
)

print("‚úÖ Estimator configured")
print("üí∞ Cost estimate:")
print("  ml.g5.xlarge spot: ~$0.30/hour")
print("  Training time: ~2-4 hours")
print("  Total cost: ~$0.60-$1.20")

In [None]:
# Launch training job
job_name = f"gl-rl-gpu-{datetime.now().strftime('%Y%m%d-%H%M%S')}"

print(f"üöÄ Launching job: {job_name}")

estimator.fit(
    inputs={'training': f's3://{bucket}/data/training'},
    job_name=job_name,
    wait=False  # Don't block notebook
)

print(f"\n‚úÖ Job submitted!")
print(f"üìä Monitor: https://console.aws.amazon.com/sagemaker/home?region={region}#/jobs/{job_name}")

## Monitor Training Progress

In [None]:
def check_job_status(job_name):
    """Check training job status"""
    sm = boto3.client('sagemaker', region_name=region)
    
    try:
        job = sm.describe_training_job(TrainingJobName=job_name)
        status = job['TrainingJobStatus']
        
        print(f"Job: {job_name}")
        print(f"Status: {status}")
        
        if status == 'InProgress':
            print(f"Secondary Status: {job.get('SecondaryStatus', 'Starting')}")
            if 'TrainingStartTime' in job:
                elapsed = datetime.now(job['TrainingStartTime'].tzinfo) - job['TrainingStartTime']
                print(f"Elapsed: {elapsed}")
                
        elif status == 'Completed':
            print("‚úÖ Training completed!")
            print(f"Model: {job['ModelArtifacts']['S3ModelArtifacts']}")
            if 'TrainingStartTime' in job and 'TrainingEndTime' in job:
                duration = job['TrainingEndTime'] - job['TrainingStartTime']
                print(f"Duration: {duration}")
                
                # Calculate cost
                hours = duration.total_seconds() / 3600
                spot_price = 0.30  # ml.g5.xlarge spot price
                cost = hours * spot_price
                print(f"Estimated cost: ${cost:.2f}")
                
        elif status == 'Failed':
            print(f"‚ùå Failed: {job.get('FailureReason', 'Unknown')}")
            
    except Exception as e:
        print(f"Error: {e}")

# Check your job (replace with actual job name)
# check_job_status(job_name)

## Option 2: Quick Launch Script

For command-line launching:

In [None]:
# Generate launch script
launch_script = f"""#!/bin/bash
# Quick GPU training launch script

JOB_NAME="gl-rl-gpu-$(date +%Y%m%d-%H%M%S)"

aws sagemaker create-training-job \\
  --training-job-name $JOB_NAME \\
  --role-arn {role} \\
  --algorithm-specification TrainingImage=763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-training:2.0-gpu-py310,TrainingInputMode=File \\
  --resource-config InstanceType=ml.g5.xlarge,InstanceCount=1,VolumeSizeInGB=50 \\
  --input-data-config '[{{"ChannelName":"training","DataSource":{{"S3DataSource":{{"S3DataType":"S3Prefix","S3Uri":"s3://{bucket}/data/training","S3DataDistributionType":"FullyReplicated"}}}}}}]' \\
  --output-data-config S3OutputPath=s3://{bucket}/output \\
  --enable-managed-spot-training \\
  --stopping-condition MaxRuntimeInSeconds=86400,MaxWaitTimeInSeconds=86400 \\
  --region {region}

echo "‚úÖ Launched job: $JOB_NAME"
echo "üìä Monitor at: https://console.aws.amazon.com/sagemaker/home?region={region}#/jobs/$JOB_NAME"
"""

with open('launch_training.sh', 'w') as f:
    f.write(launch_script)

print("‚úÖ Created launch_training.sh")
print("Run with: bash launch_training.sh")

## Download Trained Model

In [None]:
def download_model(job_name):
    """Download trained model from S3"""
    sm = boto3.client('sagemaker', region_name=region)
    
    job = sm.describe_training_job(TrainingJobName=job_name)
    
    if job['TrainingJobStatus'] == 'Completed':
        model_uri = job['ModelArtifacts']['S3ModelArtifacts']
        print(f"Downloading model from: {model_uri}")
        
        !aws s3 cp {model_uri} ./model.tar.gz
        !tar -xzf model.tar.gz
        
        print("‚úÖ Model downloaded and extracted")
        !ls -la
    else:
        print(f"Job status: {job['TrainingJobStatus']}")

# download_model(job_name)

## Cost Optimization Guide

### Instance Selection

| Instance | GPU | VRAM | On-Demand | Spot | Use Case |
|----------|-----|------|-----------|------|----------|
| ml.g5.xlarge | A10G | 24GB | $1.00/hr | $0.30/hr | **Best value** |
| ml.g4dn.xlarge | T4 | 16GB | $0.73/hr | $0.35/hr | Budget option |
| ml.p3.2xlarge | V100 | 16GB | $3.83/hr | $1.15/hr | Fast training |

### Tips:
1. **Always use spot instances** (70% savings)
2. **Use checkpointing** for interruption recovery
3. **Optimize batch size** for GPU memory
4. **Enable mixed precision** (fp16=True)
5. **Use gradient checkpointing** for larger models

### Monitor Costs:
- [AWS Cost Explorer](https://console.aws.amazon.com/cost-management/)
- Set budget alerts in AWS Budgets
- Tag resources for cost tracking

## Alternative: Switch Notebook to GPU

‚ö†Ô∏è **Warning**: Charges even when idle!

1. Stop this notebook instance
2. Update settings ‚Üí Change to ml.g5.xlarge
3. Start instance
4. Run training directly in notebook
5. **Remember to switch back to ml.t2.medium after training!**