# Tensor Core

A **Tensor Core** is a computing unit in Nvidia GPUs that multiplies two matrices, and then adds a third matrix to the result to accomplish hardware accelerated **General Matrix Multiplication** (GEMM). To leverage Tensor Cores in TensorFlow and PyTorch, you need to ensure that you're using the right hardware, software versions, and configurations. Tensor Cores are specialised processing units available in after NVIDIA's Volta architectures. For further details check [hardware/Neural Processing Unit](../80_hardware/gpu.md).

To use full precision (FP32) in TensorFlow and PyTorch, you typically don't need to do anything special, as FP32 is the default precision used by most deep learning frameworks. 

In AI model training, memory is often the bottleneck and hence may use lower precisions Tensor Cores. However this is beyond the notebook's scope and we will only demonstrate FP32 AI training here.

## Prerequisites
1. **Hardware**: Ensure you have an NVIDIA GPU that supports Tensor Cores (Volta, Turing, or Ampere architectures).
2. **CUDA Toolkit**: Install the CUDA toolkit version supported by your GPU.
3. **cuDNN Library**: Install the corresponding cuDNN library version.

In [2]:
!mamba install nvidia/label/cuda-11.8.0::cuda-toolkit -y


                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (1.1.0) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['nvidia/label/cuda-11.8.0::cuda-toolkit']

conda-forge/linux-64                                    

## TensorFlow
By default Tensorflow has enabled to run with Tensor Cores whenever possible with GPU [compute capability >= 7.0](https://developer.nvidia.com/cuda-gpus). For GPU memory saving and computational speed you may manually switch to [mixed precision](https://keras.io/api/mixed_precision/)

### Install TensorFlow with GPU support

In [1]:
# for Tensorflow and pyTorch compatibility we need to pin the library version
# this step may take some time to download
!pip install tensorflow[and-cuda]==2.14.*
!pip install nvidia-cudnn-cu11==8.7.0.84

[0mLooking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0mLooking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0m

### Fix Environment Path for CUDA Library
If you have problem in loading CUDNN from conda environment, you may need to run the following from terminal to properly set the path environments.

❗The conda environment file automatically choose the compatible version between Tensorflow and pyTorch under same CUDA (11.8) and CUDNN (8.7) settings. If you find CUDA or CUDNN version inconsistency by faulty loading the machine base CUDA libraries, use the following Conda virtual environment setting to override the system-wide paths:

```bash
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$CUDNN_PATH/lib:$LD_LIBRARY_PATH' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
```

Restart the Juypter kernel at this point, and continue from the cells below.

### Training TF AI with FP32 precision

In [2]:
import tensorflow as tf

In [3]:
# Define a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(10)
])

# Compile the model with the optimizer and loss function
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

# Example dataset with MNIST
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()
# use float32 will then take full computational precision on Tensor Cores
train_images = train_images.reshape(-1, 784).astype('float32') / 255
test_images = test_images.reshape(-1, 784).astype('float32') / 255

# Train the model
model.fit(train_images, train_labels, epochs=5, validation_data=(test_images, test_labels))

Epoch 1/5


2024-07-09 09:36:52.407937: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f28b001f660 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-07-09 09:36:52.407961: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Quadro RTX 4000, Compute Capability 7.5
2024-07-09 09:36:52.412426: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-07-09 09:36:52.547271: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7f295872a620>

Tensorflow also allow environmental variable control of Tensor Core. Check the setting from official documentation: https://docs.nvidia.com/deeplearning/frameworks/tensorflow-user-guide/index.html#tf_disable_tensor_op_math

## PyTorch

Similar to Tensorflow, Tensor Core mixed precision is called by [AMP (Automatic Mixed Precision)](https://pytorch.org/docs/stable/amp.html) in pyTorch.

### Install PyTorch with GPU support

In [7]:
# keep lower version of CUDA and pytorch for environment consistency to tensorflow
!pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

[0mLooking in indexes: https://download.pytorch.org/whl/cu118, https://pypi.ngc.nvidia.com
Collecting torchvision
  Downloading https://download.pytorch.org/whl/cu118/torchvision-0.18.1%2Bcu118-cp310-cp310-linux_x86_64.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[0mInstalling collected packages: torchvision
[0mSuccessfully installed torchvision-0.18.1+cu118
[0m

In [8]:
# header import
import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import GradScaler, autocast
from torchvision import datasets, transforms

### Define a simple neural network

In [9]:
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = x.view(-1, 784)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

In [10]:
# Hyperparameters
batch_size = 64
learning_rate = 0.01
epochs = 3

### Download training data and build pyTorch loader

In [11]:
# Data loaders
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
train_dataset = datasets.MNIST(root='../../data', train=True, transform=transform, download=True)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ../../data/MNIST/raw/train-images-idx3-ubyte.gz


100.0%


Extracting ../../data/MNIST/raw/train-images-idx3-ubyte.gz to ../../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ../../data/MNIST/raw/train-labels-idx1-ubyte.gz


100.0%


Extracting ../../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ../../data/MNIST/raw/t10k-images-idx3-ubyte.gz


100.0%


Extracting ../../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ../../data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100.0%

Extracting ../../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../../data/MNIST/raw






### Setup NN model, loss function, and optimizer

In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=learning_rate)

### Enable Automatic Mixed Precision (AMP)

In [13]:
# AMP scaler
scaler = GradScaler()

### Training loop

In [14]:
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for i, (images, labels) in enumerate(train_loader):
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()

        # Forward pass with autocast
        with autocast():
            outputs = model(images)
            loss = criterion(outputs, labels)

        # Backward pass and optimization with scaler
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        running_loss += loss.item()
        if (i + 1) % 100 == 0:
            print(f"Epoch [{epoch+1}/{epochs}], Step [{i+1}/{len(train_loader)}], Loss: {running_loss / 100:.4f}")
            running_loss = 0.0

print("Finished Training")

Epoch [1/3], Step [100/938], Loss: 1.8893
Epoch [1/3], Step [200/938], Loss: 1.1836
Epoch [1/3], Step [300/938], Loss: 0.7898
Epoch [1/3], Step [400/938], Loss: 0.6417
Epoch [1/3], Step [500/938], Loss: 0.5598
Epoch [1/3], Step [600/938], Loss: 0.4882
Epoch [1/3], Step [700/938], Loss: 0.4550
Epoch [1/3], Step [800/938], Loss: 0.4444
Epoch [1/3], Step [900/938], Loss: 0.4150
Epoch [2/3], Step [100/938], Loss: 0.4049
Epoch [2/3], Step [200/938], Loss: 0.3767
Epoch [2/3], Step [300/938], Loss: 0.3811
Epoch [2/3], Step [400/938], Loss: 0.3715
Epoch [2/3], Step [500/938], Loss: 0.3740
Epoch [2/3], Step [600/938], Loss: 0.3542
Epoch [2/3], Step [700/938], Loss: 0.3510
Epoch [2/3], Step [800/938], Loss: 0.3506
Epoch [2/3], Step [900/938], Loss: 0.3370
Epoch [3/3], Step [100/938], Loss: 0.3311
Epoch [3/3], Step [200/938], Loss: 0.3317
Epoch [3/3], Step [300/938], Loss: 0.3367
Epoch [3/3], Step [400/938], Loss: 0.3295
Epoch [3/3], Step [500/938], Loss: 0.3246
Epoch [3/3], Step [600/938], Loss:

## Additional Tips
- **Performance Monitoring**: Use NVIDIA’s `nvprof`, `nsight`, or `nvidia-smi` tools to monitor GPU usage and ensure Tensor Cores are being utilised.

In [5]:
!nvidia-smi

Mon Jul  8 23:34:04 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Quadro RTX 4000     Off  | 00000000:65:00.0 Off |                  N/A |
| 30%   32C    P8     5W / 125W |    360MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

- **Profile Your Code**: Use TensorFlow's or PyTorch's built-in profilers to understand where your model spends most of its time and ensure mixed precision is being applied correctly.

By following these steps, you should be able to take advantage of Tensor Cores in both TensorFlow and PyTorch, significantly accelerating the training process for deep learning models.

### FAQ

1. **It looks no difference from normal TF/pyTorch code, why you need to specially mention?**
    By default all TF/pyTorch are run under FP32 (full precision) mode. In cases that computational power is limited (e.g. insufficient GPU memory or limited training time), you can swap to mixed-precision model and change to non-default AI precisions.

2. **Is FP32 always necessary for best AI model**

    No, there are research showing AI training can significantly speed up on low-end GPUs with very similar performance.

3. **So why training precision is mentioned in section of Tensor Core?**

    In fact different precision Tensor Cores are physically computing unit within the GPU. In older GPU models and deep learning packages if you choose full precision mode there may be a chance to automatically fall into mixed-precision mode. This is caused by the limited amount of high precision Tensor Cores in older/low-end GPUs.
    
    Once we upgraded from V100 to A100 GPUs the model no longer retains the original performance until we explicitly set to use FP32 mode back to CUDA cores. This phenomenon is known as [precision loss problem](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html).

4. Can you tell me when is the best time to use non-FP32 precision?
    Considering developing smart microscopy, i.e. performing bioimage analysis simultaneously with image acquisition, computation may often offload to weaker GPUs. Together to catch up with image acquisition speed, AI float point precision is one possible factor to sacrifice to boost up automated bioimage anlysis.

3. **Can I simply use newer version of TF/pyTorch to solve the problem?**

    - Yes if you are training new model from scratch.
    - However there are older AI models that is very version specific. Or in the case to fit both TF and pyTorch under the same environment, you will have to stay in older version of the package. One know example is running [UNet](https://github.com/lmb-freiburg/Unet-Segmentation?tab=readme-ov-file) with pretrained cell segmentation model. 

## Further Reading
- [Keras Mixed Precision](https://keras.io/api/mixed_precision/)
- [pyTorch Automaic Mixed Precision](https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html)