# Purpose
Check whether GPUs are available/usable in current python environment.


# Debugging notes
## Make sure to set the correct venv/kernel in your notebook
The default `python 3` might not have the correct drivers.
<br>Instead, might need to manually set it to:
<br>`Python [conda env:root]`

## Sometimes the best fix is `sudo reboot`
When in doubt, open a terminal and do `sudo reboot`.

For some reason, the NVIDIA drivers might not be loaded properly after shutting down a VM instance from the GUI:
- https://console.cloud.google.com/ai-platform/notebooks/list/instances?project=data-prod-165221


In [1]:
!which python

/opt/conda/bin/python


# Imports & notebook setup

In [2]:
%load_ext autoreload
%autoreload 2

In [6]:
from pprint import pprint

# import tensorflow_text
import tensorflow as tf
from tensorflow.python.client import device_lib

# import subclu
# from subclu.utils.eda import (
#     setup_logging, notebook_display_config, print_lib_versions,
# )

print(f"{tf.__name__} {tf.__version__}")

tensorflow 2.3.3


In [7]:
# setup_logging()

# Check GPUs/XLA_GPUs recognized by Tensorflow/python

NOTE: `GPU`s and `XLA_GPU`s are recognized as two different device types.

https://www.tensorflow.org/xla
> **XLA: Optimizing Compiler for Machine Learning**
> XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes.
> 
> The results are improvements in speed and memory usage: e.g. in BERT MLPerf submission using 8 Volta V100 GPUs using XLA has achieved a ~7x performance improvement and ~5x batch size improvement

Other sources
- https://stackoverflow.com/questions/52943489/what-is-xla-gpu-and-xla-cpu-for-tensorflow


## What device gets used for calculations?

It should be `GPU` or `XLA_GPU`

In [8]:
%%time

tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)
CPU times: user 776 ms, sys: 1.13 s, total: 1.91 s
Wall time: 5.18 s


## List devices

Expected GPU output
```
Built with CUDA? True

GPUs
===
Num GPUs Available: 2
GPU details:
[   PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
    PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU')]

Built with CUDA? True

All devices:
===
Num devices: 4
Details:
[   name: "/device:CPU:0"
device_type: "CPU"

...

,
    name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 6215884038941287466
physical_device_desc: "device: XLA_GPU device"
,
    name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 14676252416
locality {
  bus_id: 1
  links {
  }
}
incarnation: 8485125904456880156
physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5"
]

```

In [25]:
l_phys_gpus = (
    tf.config.list_physical_devices('GPU') +
    tf.config.list_physical_devices('XLA_GPU')
)

print(
    f"\nBuilt with CUDA? {tf.test.is_built_with_cuda()}"
    f"\n\nGPUs\n==="
    f"\nNum GPUs Available: {len(l_phys_gpus)}"
    f"\nGPU details:"
)
pprint(l_phys_gpus, indent=4,)


Built with CUDA? True

GPUs
===
Num GPUs Available: 2
GPU details:
[   PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
    PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU')]


In [26]:
l_all_local_devices = device_lib.list_local_devices()
print(
    f"\nBuilt with CUDA? {tf.test.is_built_with_cuda()}"
    f"\n\nAll devices:\n==="
    f"\nNum devices: {len(l_all_local_devices)}"
    f"\nDetails:"
)
pprint(l_all_local_devices, indent=4,)


Built with CUDA? True

All devices:
===
Num devices: 4
Details:
[   name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 6891486951834722320
,
    name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 8259629423329914367
physical_device_desc: "device: XLA_CPU device"
,
    name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 18206492687374442845
physical_device_desc: "device: XLA_GPU device"
,
    name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 14676252416
locality {
  bus_id: 1
  links {
  }
}
incarnation: 15572239201293287089
physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5"
]


# Check NVIDIA CLI

First, do we even see the GPUs?

In [12]:
!lspci | grep 3D

00:04.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)


Then, are they recognized by the nvidia-smi tool?

In [13]:
!nvidia-smi

Wed Jun 30 06:22:31 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   67C    P0    32W /  70W |  14378MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Debug nvidia drivers

If `nvidia-smi` doesn't detect the drivers, we might need to reinstall them.

- https://towardsdatascience.com/troubleshooting-gcp-cuda-nvidia-docker-and-keeping-it-running-d5c8b34b6a4c

Getting nothing from `cuda` is bad... sigh

In [14]:
!dpkg -l | grep nvidia

ii  libnvidia-container-tools             1.4.0-1                       amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64            1.4.0-1                       amd64        NVIDIA container runtime library
ii  nvidia-container-runtime              3.5.0-1                       amd64        NVIDIA container runtime
ii  nvidia-container-toolkit              1.5.1-1                       amd64        NVIDIA container runtime hook
ii  nvidia-docker2                        2.6.0-1                       all          nvidia-docker CLI wrapper


In [15]:
!dpkg -l | grep cuda

In [16]:
!dmesg | grep NVIDIA

dmesg: read kernel buffer failed: Operation not permitted


In [17]:
# !apt search nvidia-driver

## CUDA version

In [18]:
!nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0


### Check `cudnn` & `cudatoolkit` versions in conda

This doens't work because we don't have the right permissions. We might not need it, though, because by default the VMs created by Google don't have these drivers installed via conda either. 

Unclear what's missing or what changes that breaks the driver detection when installing the requirements from my project.

In [22]:
# !conda search cudatoolkit

In [23]:
# !conda search cudnn

In [24]:
# this call works on my laptop, but fails on VM because of permissions
# !conda search cudnn --platform linux-64

# Check libraries

In [3]:
!pip list

Package                        Version
------------------------------ -------------------
absl-py                        0.10.0
aiohttp                        3.7.4.post0
ansiwrap                       0.8.4
anyio                          3.2.1
apache-beam                    2.28.0
appdirs                        1.4.4
argon2-cffi                    20.1.0
arrow                          1.1.1
asn1crypto                     1.4.0
astunparse                     1.6.3
async-generator                1.10
async-timeout                  3.0.1
attrs                          21.2.0
avro-python3                   1.9.2.1
backcall                       0.2.0
backports.functools-lru-cache  1.6.4
beatrix-jupyterlab             0.0.3
binaryornot                    0.4.4
black                          21.5b2
bleach                         3.3.0
blinker                        1.4
Bottleneck                     1.3.2
brotlipy                       0.7.0
cachetools                     4.2.2
caip-noteboo

In [4]:
conda list

# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             4.5                       1_gnu  
abseil-cpp                20210324.2           h9c3ff4c_0    conda-forge
absl-py                   0.10.0                   pypi_0    pypi
aiohttp                   3.7.4.post0      py37h5e8e339_0    conda-forge
alsa-lib                  1.2.3                h516909a_0    conda-forge
ansiwrap                  0.8.4                      py_0    conda-forge
anyio                     3.2.1            py37h89c1867_0    conda-forge
apache-beam               2.28.0                   pypi_0    pypi
appdirs                   1.4.4              pyh9f0ad1d_0    conda-forge
argon2-cffi               20.1.0           py37h5e8e339_2    conda-forge
arrow                     1.1.1              pyhd8ed1ab_0    conda-forge
arrow-cpp                 4.0.1           py37hac2aefd