# GPU Test Notebook

This notebook aims to provide a very basic testing perspective on Jupyter notebooks with GPU support, in such way that:

1. Notebook environment setup
2. Verify the installed Python version
3. Find the version of the installed TensorFlow or PyTorch packages
4. Find the GPU on the devices list
5. Check if GPU drivers (nvidia-smi or rocm-smi) loads properly
6. CUDA/ROCm drivers are installed

## 1. Notebook environment setup

To provide some easier functions across the notebook, a setup will make some system calls more readable and easier to understand or execute:

- import standard packages that will be used on this notebook, i.e., `os`, `sys`, etc;
- check if this environment is running a TensorFlow or PyTorch image / environment;
- check if this enviromment contains a CUDA or ROCm GPU;

In [1]:
import os
import sys

try:
    import tensorflow as tf
except ImportError:
    pass

try:
    import torch
except ImportError:
    pass

# Initialize framework and GPU type
framework = "NONE"
gpu_type = "CPU"

def import_pytorch():
    global framework, gpu_type
    import torch
    framework = "PYTORCH"

    if torch.cuda.is_available():
        # Check if it's actually ROCm by looking at device name
        try:
            device_name = torch.cuda.get_device_name(0)

            # Check for AMD/ROCm indicators in device name
            if any(keyword in device_name.upper() for keyword in ['AMD', 'RADEON', 'INSTINCT', 'MI300', 'MI250', 'MI100']):
                gpu_type = "ROCM"
            else:
                gpu_type = "CUDA"
        except:
            # Fallback: check if ROCm version exists
            if hasattr(torch.version, 'hip') and torch.version.hip is not None:
                gpu_type = "ROCM"
            else:
                gpu_type = "CUDA"
    elif hasattr(torch.version, 'hip') and torch.version.hip is not None:
        gpu_type = "ROCM"

def import_tensorflow():
    global framework, gpu_type
    import tensorflow as tf
    framework = "TENSORFLOW"

    if len(tf.config.list_physical_devices('GPU')) > 0:
        try:
            gpu_details = tf.config.experimental.get_device_details(tf.config.list_physical_devices('GPU')[0])
            if 'AMD' in str(gpu_details) or 'ROCm' in str(gpu_details):
                gpu_type = "ROCM"
            else:
                gpu_type = "CUDA"
        except:
            gpu_type = "CUDA"

# Detect environment
try:
    import_pytorch()
except ImportError:
    try:
        import_tensorflow()
    except ImportError:
        pass

# Set environment variables (strings)
os.environ['ML_FRAMEWORK'] = framework
os.environ['GPU_TYPE'] = gpu_type

# Set global boolean variables for easy checking
pytorch = (framework == "PYTORCH")
tensorflow = (framework == "TENSORFLOW")
cuda = (gpu_type == "CUDA")
rocm = (gpu_type == "ROCM")
cpu_only = (gpu_type == "CPU")

# Also set environment variables as strings for boolean checks
os.environ['PYTORCH'] = str(pytorch).lower()
os.environ['TENSORFLOW'] = str(tensorflow).lower()
os.environ['CUDA'] = str(cuda).lower()
os.environ['ROCM'] = str(rocm).lower()
os.environ['CPU_ONLY'] = str(cpu_only).lower()

## 2. Verify the installed Python version
Multiple notebooks are available, and it can happen of system upgrades, notebooks built with different Python versions, across other possible changes.

The following test will only print out the Python version installed on this notebook, so it can be verified that the expected Python is really running.

> Note: this is yet a manual test, you need to know what version is supposed to run here and match with the output below.

In [2]:
print(sys.version)

3.12.11 (main, Aug 14 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-11)]


## 3. Find the version of the installed TensorFlow or PyTorch packages

As both TensorFlow or PyTorch are also upgraded from time to time, it's important for us to understand which version is installed on this system, if it matches with what is expected.

In [3]:
if tensorflow:
    print(f"TensorFlow: {tf.__version__}")
else:
    print(f"PyTorch: {torch.__version__}")

PyTorch: 2.7.1+rocm6.3


## 3. Find the GPU on the devices list
To understand if the GPUs are present in the current setup, we wil rely on TensorFlow Python client, which refers to the official Python API for interacting with the system's properties and devices.

- If the following code returns a list with items inside, this means that there are GPUs running on this server;
- If the following code returns an empty list, this means that there are no GPUs available;

In [4]:
if tensorflow:
    from tensorflow.python.client import device_lib
    local_device_protos = device_lib.list_local_devices()
    gpu_devices = [x.name for x in local_device_protos if x.device_type == 'GPU']
else:
    if torch.cuda.is_available():
        gpu_count = torch.cuda.device_count()
        gpu_devices = [f"GPU:{i} ({torch.cuda.get_device_name(i)})" for i in range(gpu_count)]

for gpu in gpu_devices:
    print(f"- {gpu}")

- GPU:0 (AMD Instinct MI300X VF)


## 4. Check if GPU drivers (nvidia-smi or rocm-smi) loads properly
The NVIDIA System Management Interface (nvidia-smi) or the ROCm System Management Interface (rocm-smi) are command line utilities intended to aid in the management and monitoring of NVIDIA or AMD GPU devices.

The following command only spins up the `nvidia-smi` or `rocm-smi` utilities:

In [5]:
if cuda:
    !nvidia-smi
elif rocm:
    !rocm-smi



Device  Node  IDs              Temp        Power     Partitions          SCLK    MCLK    Fan  Perf  PwrCap  VRAM%  GPU%  
[3m              (DID,     GUID)  (Junction)  (Socket)  (Mem, Compute, ID)                                                  [0m
0       2     0x74b5,   65402  40.0Â°C      141.0W    NPS1, SPX, 0        172Mhz  900Mhz  0%   auto  750.0W  0%     0%    


## 4. CUDA/ROCm drivers are installed
This test aims to simply check if CUDA/ROCm drivers are properly installed. To test this, the `nvcc` command will be executed for NVIDIA GPUs and `hipcc` for ROCm GPUs.


> Note: the code is as simple as possible, run the ones that makes sense for the tests you are doing (there are no extended programming to check automatically, etc, this is done this way on purpose to simplify the code as much as possible

In [6]:
if cuda:
    !nvcc --version
elif rocm:
    !hipcc --version
else:
    print("GPU Type: None detected")

/usr/bin/sh: line 1: hipcc: command not found
