# CPS GPU Cluster Test and Verification

This notebook verifies the functionality of the CPS GPU cluster after setting up OIDC authentication with Authentik.

## Cluster Configuration

- **JupyterHub**: v4.3.1 with Authentik OIDC authentication
- **GPU Operator**: v25.10.0 with Ubuntu 24.04 drivers
- **Available GPUs**: 8 GPUs total (2 per worker node across 4 nodes)
- **Storage**: NFS-backed persistent volumes
- **Access URL**: https://jupyterhub.cps.unileoben.ac.at

## Authentication Setup

- **Provider**: CPS Authentik (https://auth.cps.unileoben.ac.at)
- **Access Groups**: cps-users (regular users), cps-admins (administrators)
- **Profile Options**: CPU-only, Single GPU, Dual GPU, Research (4 GPUs)

## 1. Environment and Authentication Info

First, let's verify the current user and environment.

In [None]:
import os
import getpass
import platform
from datetime import datetime

print("=== CPS GPU Cluster Environment ===")
print(f"Timestamp: {datetime.now()}")
print(f"Username: {getpass.getuser()}")
print(f"Hostname: {platform.node()}")
print(f"Python Version: {platform.python_version()}")
print(f"Platform: {platform.platform()}")
print(f"Architecture: {platform.machine()}")

# Check if running in JupyterHub
if 'JUPYTERHUB_USER' in os.environ:
    print(f"JupyterHub User: {os.environ['JUPYTERHUB_USER']}")
    print(f"JupyterHub Service: {os.environ.get('JUPYTERHUB_SERVICE_NAME', 'N/A')}")
    
# Check for NVIDIA environment variables
print(f"\nNVIDIA Environment:")
print(f"NVIDIA_VISIBLE_DEVICES: {os.environ.get('NVIDIA_VISIBLE_DEVICES', 'Not Set')}")
print(f"NVIDIA_DRIVER_CAPABILITIES: {os.environ.get('NVIDIA_DRIVER_CAPABILITIES', 'Not Set')}")

## 2. GPU Detection and NVIDIA Driver Verification

Check if GPUs are available and the NVIDIA drivers are working correctly.

In [None]:
import subprocess
import sys

def run_command(cmd):
    """Run a shell command and return the output"""
    try:
        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        return result.stdout.strip(), result.stderr.strip(), result.returncode
    except Exception as e:
        return "", str(e), 1

print("=== GPU Hardware Detection ===")

# Check nvidia-smi
stdout, stderr, code = run_command("nvidia-smi")
if code == 0:
    print("‚úÖ nvidia-smi command successful")
    print(stdout)
else:
    print("‚ùå nvidia-smi failed:")
    print(stderr)

print("\n=== NVIDIA Driver Version ===")
stdout, stderr, code = run_command("nvidia-smi --query-gpu=driver_version --format=csv,noheader")
if code == 0:
    print(f"Driver Version: {stdout}")
else:
    print(f"Could not get driver version: {stderr}")

print("\n=== GPU Count and Models ===")
stdout, stderr, code = run_command("nvidia-smi --query-gpu=count,name --format=csv")
if code == 0:
    print(stdout)
else:
    print(f"Could not get GPU info: {stderr}")

## 3. PyTorch GPU Test

Test PyTorch integration with the available GPUs.

In [None]:
try:
    import torch
    print("‚úÖ PyTorch imported successfully")
    print(f"PyTorch version: {torch.__version__}")
    
    # Check CUDA availability
    if torch.cuda.is_available():
        print("‚úÖ CUDA is available")
        print(f"CUDA version: {torch.version.cuda}")
        print(f"Number of CUDA devices: {torch.cuda.device_count()}")
        
        # List all available devices
        for i in range(torch.cuda.device_count()):
            props = torch.cuda.get_device_properties(i)
            print(f"  GPU {i}: {props.name}")
            print(f"    - Memory: {props.total_memory / 1024**3:.1f} GB")
            print(f"    - Compute Capability: {props.major}.{props.minor}")
        
        # Test GPU computation
        print("\n=== GPU Computation Test ===")
        device = torch.device('cuda:0')
        print(f"Using device: {device}")
        
        # Create test tensors
        x = torch.randn(1000, 1000).to(device)
        y = torch.randn(1000, 1000).to(device)
        
        # Perform matrix multiplication
        start_time = torch.cuda.Event(enable_timing=True)
        end_time = torch.cuda.Event(enable_timing=True)
        
        start_time.record()
        result = torch.mm(x, y)
        end_time.record()
        
        torch.cuda.synchronize()
        elapsed_time = start_time.elapsed_time(end_time)
        
        print(f"‚úÖ Matrix multiplication (1000x1000) completed in {elapsed_time:.2f}ms")
        print(f"Result shape: {result.shape}")
        print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1024**2:.1f} MB")
        
    else:
        print("‚ùå CUDA is not available")
        print("Running on CPU only")
        
except ImportError:
    print("‚ùå PyTorch not installed, installing...")
    subprocess.run([sys.executable, "-m", "pip", "install", "torch"], check=True)
    print("‚úÖ PyTorch installed, please restart kernel and run again")

## 4. Storage Persistence Test

Test if the NFS storage is working and files persist across sessions.

In [None]:
import os
import json
from pathlib import Path
from datetime import datetime

print("=== Storage Persistence Test ===")

# Create test directory
test_dir = Path.home() / "cluster_test"
test_dir.mkdir(exist_ok=True)

# Create test file with timestamp
test_file = test_dir / "persistence_test.json"
test_data = {
    "timestamp": datetime.now().isoformat(),
    "user": getpass.getuser(),
    "hostname": platform.node(),
    "test_type": "storage_persistence"
}

# Write test file
with open(test_file, 'w') as f:
    json.dump(test_data, f, indent=2)

print(f"‚úÖ Created test file: {test_file}")

# Verify file contents
if test_file.exists():
    with open(test_file, 'r') as f:
        loaded_data = json.load(f)
    print(f"‚úÖ File exists and contains: {loaded_data}")
else:
    print("‚ùå Test file was not created successfully")

# Check storage space
import shutil
total, used, free = shutil.disk_usage(Path.home())
print(f"\n=== Storage Information ===")
print(f"Home directory: {Path.home()}")
print(f"Total space: {total / 1024**3:.1f} GB")
print(f"Used space: {used / 1024**3:.1f} GB")
print(f"Free space: {free / 1024**3:.1f} GB")

# List previous test files
existing_tests = list(test_dir.glob("*.json"))
if len(existing_tests) > 1:
    print(f"\n‚úÖ Found {len(existing_tests)} previous test files:")
    for test in existing_tests:
        print(f"  - {test.name}")
else:
    print(f"\nüìù This is the first test run")

## 5. Cluster Resource Overview

Get an overview of the available cluster resources and current usage.

In [None]:
# Check Kubernetes context if kubectl is available
print("=== Kubernetes Resource Information ===")

stdout, stderr, code = run_command("kubectl get nodes -o wide")
if code == 0:
    print("Cluster Nodes:")
    print(stdout)
else:
    print("kubectl not available in user environment (expected)")

# Check for mounted volumes
print("\n=== Mounted Filesystems ===")
stdout, stderr, code = run_command("df -h")
if code == 0:
    lines = stdout.split('\n')
    for line in lines:
        if any(keyword in line.lower() for keyword in ['nfs', 'home', 'jupyter']):
            print(line)

# System resource information
print(f"\n=== Local System Resources ===")
stdout, stderr, code = run_command("lscpu | grep -E '(Model name|CPU\\(s\\)|Thread|Core)'")
if code == 0:
    print("CPU Information:")
    print(stdout)

stdout, stderr, code = run_command("free -h")
if code == 0:
    print("\nMemory Information:")
    print(stdout)

# Check if this is running in a GPU-enabled pod
if torch.cuda.is_available():
    print(f"\n=== GPU Resource Summary ===")
    print(f"GPUs Available: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        props = torch.cuda.get_device_properties(i)
        print(f"  GPU {i}: {props.name} ({props.total_memory / 1024**3:.1f} GB)")

## Summary

This notebook verifies that the CPS GPU cluster is functioning correctly with:

1. ‚úÖ **Authentication**: OIDC integration with Authentik
2. ‚úÖ **GPU Access**: NVIDIA drivers and CUDA support
3. ‚úÖ **Compute**: PyTorch GPU acceleration
4. ‚úÖ **Storage**: Persistent NFS-backed volumes
5. ‚úÖ **Profiles**: Multiple resource configurations available

### Next Steps

- Test different profile options (CPU-only, Single GPU, Dual GPU, Research)
- Run actual ML workloads to verify performance
- Test collaborative features with multiple users
- Verify data persistence across pod restarts

### Support

For issues or questions, contact the CPS infrastructure team.