# Interactive PyTorch GPU Training Tutorial

This notebook will guide you through using PyTorch with GPU acceleration. It includes interactive examples and real-time performance comparisons.

## Table of Contents
1. [GPU Setup and Verification](#gpu-setup)
2. [Basic GPU Operations](#basic-operations)
3. [Performance Comparison](#performance)
4. [Memory Management](#memory)
5. [Multi-GPU Training](#multi-gpu)

## 1. GPU Setup and Verification <a name="gpu-setup"></a>

In [None]:
import torch
import sys
import platform

def check_gpu_availability():
    print(f"PyTorch version: {torch.__version__}")
    print(f"Python version: {sys.version.split()[0]}")
    print(f"Operating System: {platform.system()} {platform.version()}")
    
    if torch.cuda.is_available():
        print("\n✅ CUDA is available!")
        print(f"CUDA version: {torch.version.cuda}")
        print(f"GPU device: {torch.cuda.get_device_name(0)}")
        print(f"Number of GPUs: {torch.cuda.device_count()}")
        
        # Get memory info
        memory_allocated = torch.cuda.memory_allocated(0)
        memory_reserved = torch.cuda.memory_reserved(0)
        print(f"\nGPU Memory:")
        print(f"- Allocated: {memory_allocated/1024**2:.2f} MB")
        print(f"- Reserved:  {memory_reserved/1024**2:.2f} MB")
    else:
        print("\n❌ CUDA is not available. Running on CPU only.")

check_gpu_availability()

## 2. Basic GPU Operations <a name="basic-operations"></a>

Let's explore how to move tensors and operations to GPU:

In [None]:
def demonstrate_gpu_operations():
    # Create tensors
    cpu_tensor = torch.randn(1000, 1000)
    
    print("1. Creating tensors:")
    print(f"CPU tensor device: {cpu_tensor.device}")
    
    if torch.cuda.is_available():
        # Move to GPU
        gpu_tensor = cpu_tensor.cuda()
        print(f"GPU tensor device: {gpu_tensor.device}")
        
        # Create tensor directly on GPU
        direct_gpu_tensor = torch.randn(1000, 1000, device='cuda')
        print(f"Direct GPU tensor device: {direct_gpu_tensor.device}")
        
        # Basic operations
        print("\n2. Basic operations on GPU:")
        result = gpu_tensor @ direct_gpu_tensor
        print(f"Matrix multiplication result device: {result.device}")
        
        # Moving back to CPU
        print("\n3. Moving back to CPU:")
        cpu_result = result.cpu()
        print(f"Result moved back to CPU device: {cpu_result.device}")
    else:
        print("GPU operations not available")

demonstrate_gpu_operations()

## 3. Performance Comparison <a name="performance"></a>

Let's compare the performance of CPU vs GPU for common operations:

In [None]:
import time
import numpy as np

def benchmark_operations():
    sizes = [1000, 2000, 4000]
    results = []
    
    for size in sizes:
        # CPU timing
        a_cpu = torch.randn(size, size)
        b_cpu = torch.randn(size, size)
        
        start_time = time.time()
        _ = torch.matmul(a_cpu, b_cpu)
        cpu_time = time.time() - start_time
        
        # GPU timing
        if torch.cuda.is_available():
            a_gpu = a_cpu.cuda()
            b_gpu = b_cpu.cuda()
            
            # Warm-up
            _ = torch.matmul(a_gpu, b_gpu)
            torch.cuda.synchronize()
            
            start_time = time.time()
            _ = torch.matmul(a_gpu, b_gpu)
            torch.cuda.synchronize()
            gpu_time = time.time() - start_time
        else:
            gpu_time = float('nan')
            
        results.append({
            'size': size,
            'cpu_time': cpu_time,
            'gpu_time': gpu_time,
            'speedup': cpu_time/gpu_time if gpu_time > 0 else float('nan')
        })
    
    # Print results
    print("Matrix Multiplication Performance Comparison:")
    print("\nSize\t\tCPU (s)\t\tGPU (s)\t\tSpeedup")
    print("-" * 60)
    for r in results:
        print(f"{r['size']}x{r['size']}\t{r['cpu_time']:.4f}\t\t{r['gpu_time']:.4f}\t\t{r['speedup']:.2f}x")

benchmark_operations()

## 4. Memory Management <a name="memory"></a>

Understanding GPU memory management is crucial for efficient deep learning:

In [None]:
def demonstrate_memory_management():
    if not torch.cuda.is_available():
        print("GPU not available")
        return
    
    print("Initial GPU Memory Usage:")
    print(f"Allocated: {torch.cuda.memory_allocated()/1024**2:.2f} MB")
    print(f"Cached:    {torch.cuda.memory_reserved()/1024**2:.2f} MB")
    
    # Create some tensors
    tensors = []
    for i in range(5):
        tensors.append(torch.randn(1000, 1000, device='cuda'))
        print(f"\nAfter creating tensor {i+1}:")
        print(f"Allocated: {torch.cuda.memory_allocated()/1024**2:.2f} MB")
        print(f"Cached:    {torch.cuda.memory_reserved()/1024**2:.2f} MB")
    
    # Clear some memory
    print("\nClearing tensors...")
    del tensors
    torch.cuda.empty_cache()
    
    print("\nFinal GPU Memory Usage:")
    print(f"Allocated: {torch.cuda.memory_allocated()/1024**2:.2f} MB")
    print(f"Cached:    {torch.cuda.memory_reserved()/1024**2:.2f} MB")

demonstrate_memory_management()

## 5. Multi-GPU Training <a name="multi-gpu"></a>

If multiple GPUs are available, let's see how to use them:

In [None]:
def check_multi_gpu():
    if not torch.cuda.is_available():
        print("GPU not available")
        return
    
    num_gpus = torch.cuda.device_count()
    print(f"Number of GPUs available: {num_gpus}")
    
    if num_gpus > 1:
        print("\nGPU Details:")
        for i in range(num_gpus):
            print(f"\nGPU {i}:")
            print(f"Name: {torch.cuda.get_device_name(i)}")
            print(f"Memory Allocated: {torch.cuda.memory_allocated(i)/1024**2:.2f} MB")
            print(f"Memory Cached: {torch.cuda.memory_reserved(i)/1024**2:.2f} MB")
            
        # Example of DataParallel
        model = torch.nn.Linear(100, 10)
        model = torch.nn.DataParallel(model)
        print(f"\nModel using DataParallel: {model.device_ids}")
    else:
        print("Multi-GPU training not available (only one GPU detected)")

check_multi_gpu()

## Best Practices and Tips

1. **Memory Management**:
   - Use `del` to remove unused tensors
   - Call `torch.cuda.empty_cache()` to free memory
   - Monitor memory usage with `torch.cuda.memory_summary()`

2. **Performance Optimization**:
   - Use larger batch sizes on GPU
   - Minimize CPU-GPU data transfers
   - Use `torch.cuda.synchronize()` for accurate timing

3. **Multi-GPU Training**:
   - Use `DataParallel` for simple multi-GPU training
   - Consider `DistributedDataParallel` for better performance
   - Balance load across GPUs

4. **Common Pitfalls**:
   - Check tensor device before operations
   - Watch out for memory leaks
   - Be aware of synchronization points