# Purpose

2021-07-28
Let's install `"tensorflow-text == 2.3.0"` and see if the GPU is still detected after installing it.


---

Check whether GPUs are available/usable in current python environment.

**UPDATE:** With a fresh VM/notebook, we could see GPUs. This notebook runs the same code AFTER stopping and re-starting the VM. Maybe it's a problem with the way GCP handles the VM and not a problem created by installing new libraries?

Provenance:
- djb_01.03-test_gpus_available-instance-djb-subclu-inference-tf-2-3-20210630
- djb_01.031-test_gpus_available_AFTER_RESTART-instance-djb-subclu-inference-tf-2-3-20210630


# Debugging notes
## Make sure to set the correct venv/kernel in your notebook
The default `python 3` might not have the correct drivers.
<br>Instead, might need to manually set it to:
<br>`Python [conda env:root]`

## Sometimes the best fix is `sudo reboot`
When in doubt, open a terminal and do `sudo reboot`.

For some reason, the NVIDIA drivers might not be loaded properly after shutting down a VM instance from the GUI:
- https://console.cloud.google.com/ai-platform/notebooks/list/instances?project=data-prod-165221


In [2]:
!which python

/opt/conda/bin/python


# Check libraries, BEFORE installing `"tensorflow-text == 2.3.0"`

In [3]:
# !pip list

In [4]:
# conda list

# Install `"tensorflow-text == 2.3.0"`

We can't install `tensorflow-text==2.3.0` on the root environment because there is a numpy conflict *smh*...

So we have to add the `--user` flag. Even then we get some errors/warnings about incompatible libraries.
```bash
!pip install "tensorflow-text==2.3.0" --user
Installing collected packages: numpy, tensorflow-text
  WARNING: The scripts f2py, f2py3 and f2py3.7 are installed in '/home/jupyter/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
  
  ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

tfx-bsl 0.26.1 requires google-api-python-client<2,>=1.7.11, but you have google-api-python-client 2.10.0 which is incompatible.
tfx-bsl 0.26.1 requires pyarrow<0.18,>=0.17, but you have pyarrow 4.0.1 which is incompatible.
tensorflow-transform 0.26.0 requires pyarrow<0.18,>=0.17, but you have pyarrow 4.0.1 which is incompatible.
tensorflow-probability 0.11.0 requires cloudpickle==1.3, but you have cloudpickle 1.6.0 which is incompatible.
tensorflow-model-analysis 0.26.1 requires pyarrow<0.18,>=0.17, but you have pyarrow 4.0.1 which is incompatible.
tensorflow-data-validation 0.26.1 requires joblib<0.15,>=0.12, but you have joblib 1.0.1 which is incompatible.
tensorflow-data-validation 0.26.1 requires pyarrow<0.18,>=0.17, but you have pyarrow 4.0.1 which is incompatible.
apache-beam 2.28.0 requires httplib2<0.18.0,>=0.8, but you have httplib2 0.19.1 which is incompatible.
apache-beam 2.28.0 requires pyarrow<3.0.0,>=0.15.1, but you have pyarrow 4.0.1 which is incompatible.
apache-beam 2.28.0 requires typing-extensions<3.8.0,>=3.7.0, but you have typing-extensions 3.10.0.0 which is incompatible.

Successfully installed numpy-1.18.5 tensorflow-text-2.3.0

```


Error w/o the --user flag:
```bash
!pip install "tensorflow-text==2.3.0"
...
Installing collected packages: numpy, tensorflow-text
  Attempting uninstall: numpy
    Found existing installation: numpy 1.19.5
    Uninstalling numpy-1.19.5:
      Successfully uninstalled numpy-1.19.5
  Rolling back uninstall of numpy

...
ERROR: Could not install packages due to an OSError: [Errno 13] Permission denied: '/opt/conda/lib/python3.7/site-packages/numpy-1.18.5.dist-info/LICENSE.txt'
Consider using the `--user` option or check the permissions.
```

In [5]:
# !pip install "tensorflow-text==2.3.0" --user

In [6]:
# conda list

# Imports & notebook setup

In [7]:
%load_ext autoreload
%autoreload 2

In [14]:
from pprint import pprint
from pkg_resources import get_distribution

import tensorflow_text
import tensorflow as tf
from tensorflow.python.client import device_lib

# import subclu
# from subclu.utils.eda import (
#     setup_logging, notebook_display_config, print_lib_versions,
# )

for lib_ in [tf, tensorflow_text]:
    sep_ = '\t' if len(lib_.__name__) > 7 else '\t\t'
    print(f"{lib_.__name__}{sep_}v: {get_distribution(f'{lib_.__name__}').version}")

tensorflow	v: 2.3.3
tensorflow_text	v: 2.3.0


In [4]:
# setup_logging()

# Check GPUs/XLA_GPUs recognized by Tensorflow/python

NOTE: `GPU`s and `XLA_GPU`s are recognized as two different device types.

https://www.tensorflow.org/xla
> **XLA: Optimizing Compiler for Machine Learning**
> XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes.
> 
> The results are improvements in speed and memory usage: e.g. in BERT MLPerf submission using 8 Volta V100 GPUs using XLA has achieved a ~7x performance improvement and ~5x batch size improvement

Other sources
- https://stackoverflow.com/questions/52943489/what-is-xla-gpu-and-xla-cpu-for-tensorflow


## What device gets used for calculations?

It should be `GPU` or `XLA_GPU`

In [15]:
%%time

tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)
CPU times: user 657 ms, sys: 383 ms, total: 1.04 s
Wall time: 1.03 s


## List devices

Expected GPU output
```
Built with CUDA? True

GPUs
===
Num GPUs Available: 2
GPU details:
[   PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
    PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU')]

Built with CUDA? True

All devices:
===
Num devices: 4
Details:
[   name: "/device:CPU:0"
device_type: "CPU"

...

,
    name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 6215884038941287466
physical_device_desc: "device: XLA_GPU device"
,
    name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 14676252416
locality {
  bus_id: 1
  links {
  }
}
incarnation: 8485125904456880156
physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5"
]

```

In [16]:
l_phys_gpus = (
    tf.config.list_physical_devices('GPU') +
    tf.config.list_physical_devices('XLA_GPU')
)

print(
    f"\nBuilt with CUDA? {tf.test.is_built_with_cuda()}"
    f"\n\nGPUs\n==="
    f"\nNum GPUs Available: {len(l_phys_gpus)}"
    f"\nGPU details:"
)
pprint(l_phys_gpus, indent=4,)


Built with CUDA? True

GPUs
===
Num GPUs Available: 2
GPU details:
[   PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
    PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU')]


In [17]:
l_all_local_devices = device_lib.list_local_devices()
print(
    f"\nBuilt with CUDA? {tf.test.is_built_with_cuda()}"
    f"\n\nAll devices:\n==="
    f"\nNum devices: {len(l_all_local_devices)}"
    f"\nDetails:"
)
pprint(l_all_local_devices, indent=4,)


Built with CUDA? True

All devices:
===
Num devices: 4
Details:
[   name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 2031236102443331441
,
    name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 4577649183693087916
physical_device_desc: "device: XLA_CPU device"
,
    name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 12427452598662418691
physical_device_desc: "device: XLA_GPU device"
,
    name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 303824896
locality {
  bus_id: 1
  links {
  }
}
incarnation: 8820588589459663090
physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5"
]


# Check NVIDIA CLI

First, do we even see the GPUs?

In [18]:
!lspci | grep 3D

00:04.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)


Then, are they recognized by the nvidia-smi tool?

In [19]:
!nvidia-smi

Thu Jul 29 04:20:20 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   75C    P0    33W /  70W |  15045MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Debug nvidia drivers

If `nvidia-smi` doesn't detect the drivers, we might need to reinstall them.

- https://towardsdatascience.com/troubleshooting-gcp-cuda-nvidia-docker-and-keeping-it-running-d5c8b34b6a4c

Getting nothing from `cuda` is bad... sigh

In [20]:
!dpkg -l | grep nvidia

ii  libnvidia-container-tools             1.4.0-1                       amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64            1.4.0-1                       amd64        NVIDIA container runtime library
ii  nvidia-container-runtime              3.5.0-1                       amd64        NVIDIA container runtime
ii  nvidia-container-toolkit              1.5.1-1                       amd64        NVIDIA container runtime hook
ii  nvidia-docker2                        2.6.0-1                       all          nvidia-docker CLI wrapper


## CUDA version

In [21]:
!nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0


### Check `cudnn` & `cudatoolkit` versions in conda

This doens't work because we don't have the right permissions. We might not need it, though, because by default the VMs created by Google don't have these drivers installed via conda either. 

Unclear what's missing or what changes that breaks the driver detection when installing the requirements from my project.

In [22]:
# !conda search cudatoolkit

In [23]:
# !conda search cudnn

In [24]:
# this call works on my laptop, but fails on VM because of permissions
# !conda search cudnn --platform linux-64