In this notebook, we'll cover the following topics divided into sections:

**1: Utilizing Multiple GPUs**
We'll explore how to leverage multiple GPUs for your neural network, employing either data parallelism or model parallelism for efficient training.

**2: Automating GPU Selection**
We'll discuss techniques to automate GPU selection when creating new objects, ensuring efficient resource utilization in systems with multiple GPUs.

**3: Diagnosing Memory Issues**
We'll cover techniques for diagnosing and analyzing memory-related issues that may arise during model training, providing insights into effective memory utilization and troubleshooting common problems.

#### Moving Tensors between CPUs and GPUs
To facilitate tensor movement between CPUs and GPUs in PyTorch, the to() or cuda() function is utilized. By providing the index of the GPU as an argument, tensors can be efficiently moved.

In [4]:
import torch
import torch.nn as nn

In [2]:
if torch.cuda.is_available():
    dev = "cuda:0"
else:
    dev = "cpu"

device = torch.device(dev)

a = torch.zeros(4, 3)
a = a.to(0)  # alternatively, a.to(0)

Similarly, this functionality extends to moving `nn.Module` objects between GPUs as well.

In [5]:
class myNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Linear(5, 1)

    def forward(self, x):
        return self.net(x)

clf = myNetwork()
clf.to(0)

myNetwork(
  (net): Linear(in_features=5, out_features=1, bias=True)
)

#### Retrieving Device Information
To retrieve the device of a tensor, the get_device() method can be employed. It's important to note that this method is only supported for GPU tensors.

In [6]:
dev = a.get_device()
b = torch.tensor(a.shape).to(dev)

Additionally, the default device for creating GPU tensors can be set.



In [7]:
torch.cuda.set_device(0)

tens = torch.Tensor(3, 4).cuda()
tens.get_device()

0

#### The new_* functions

The `new_` functions in PyTorch, introduced in version `1.0`, offer convenient ways to create new tensors based on existing tensors. When a function like `new_ones` is called on a Tensor, it returns a new tensor of the same data type and on the same device as the tensor on which the `new_ones` function was invoked.

In [8]:
ones = torch.ones((2,)).cuda(0)

# Create a tensor of ones of size (3,4) on the same device as "ones"
newOnes = ones.new_ones((3,4)) 

randTensor = torch.randn(2,4)

These `new_` functions provide a quick and efficient way to generate new tensors with desired properties while maintaining consistency with existing tensors. 

#### Using Multiple GPUs

There are two primary ways to utilize multiple GPUs in PyTorch:

**Data Parallelism**
Data Parallelism involves dividing batches into smaller batches and processing them in parallel across multiple GPUs. In PyTorch, this is achieved through the `nn.DataParallel` class. You initialize a `nn.DataParallel` object with an `nn.Module` object representing your network and a list of GPU IDs across which the batches are to be parallelized.

In [9]:
myNet = myNetwork()
parallel_net = nn.DataParallel(myNet, device_ids=[0])

inputs = torch.Tensor(5,)    # random inputs 
inputs = inputs.to(0)
myNet.to(0)

predictions = parallel_net(inputs)
loss = (1 - predictions).mean()
loss.backward()

One issue with DataParallel is that it can lead to asymmetrical load on one GPU (the main node). This can be mitigated by computing the loss during the forward pass or by implementing a parallel loss function layer.

#### Model Parallelism
Model Parallelism involves breaking the neural network into smaller sub-networks and executing these sub-networks on different GPUs. This approach is useful when the network is too large to fit inside a single GPU. Implementing Model Parallelism in PyTorch is straightforward:

1. Ensure the input and the network are always on the same device.

2. Utilize the `to()` and `cuda()` functions with autograd support for gradient copying between GPUs during the backward pass.

In [10]:
class model_parallel(nn.Module):
    def __init__(self):
        super().__init__()
        self.sub_network1 = nn.Linear(100, 32)    # This part stays on GPU 1
        self.sub_network2 = nn.Linear(32, 10)     # This part stays on GPU 2

        self.sub_network1.cuda(0)
        self.sub_network2.cuda(1)

    def forward(x):
        x = x.cuda(0)
        x = self.sub_network1(x)
        x = x.cuda(1)
        x = self.sub_network2(x)
        return x

x = torch.Tensor(100,)        # Random Input
x = x.to(0)

net = model_parallel()        # No need to put it on GPUs as that has been taken care of in the init function

loss = (1 - net(x)).mean()
loss.backward()

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

These methods enable efficient utilization of multiple GPUs in PyTorch, offering scalability and performance improvements for deep learning tasks.

#### Dealing with Memory Losses using `del` keyword

While PyTorch has an aggressive garbage collector, it frees up the variable only when there exist no Pythonic reference to the object.

In [11]:
for x in range(10):
    tensor = torch.randn(1, 4)

print(tensor)

tensor([[-0.0400,  0.6801,  0.3101,  1.3898]])


A good practice to get rid of these variables is by using the del keyword.



In [12]:
net = myNetwork()
opt = torch.optim.SGD(net.parameters(), lr=0.01)

inp = torch.randn(5,)

for x in range(10):
    out = net(inp)
    loss = (1 - out).mean()
    opt.zero_grad()
    loss.backward()
    opt.step()

print(out, loss)                      # these variables still exist
del out, loss                         # Free the memory taken by these variables

tensor([0.0464], grad_fn=<AddBackward0>) tensor(0.9536, grad_fn=<MeanBackward0>)


#### Using Python Data Types Instead Of 1-D Tensors

Often, we aggregate values in our training loop to compute some metrics. If not done carefully in PyTorch, such a thing can lead to excess use of memory than what is required.

In [13]:
total_loss = 0

for x in range(10):
    iter_loss = torch.randn(3, 4).mean()
    iter_loss.requires_grad = True     # losses are supposed to be differentiable
    total_loss += iter_loss            # use total_loss += iter_loss.item() instead

#### Using `torch.no_grad()` for inference
Whenever you are doing inference with your network or any operation that doesn't require backpropagation of gradients, you should always put the code inside torch.no_grad() context manager.

In [14]:
net = myNetwork()
inp = torch.randn(5,)

with torch.no_grad():
    out = net(inp)

#### Emptying CUDA cache
While PyTorch aggressively frees up memory, a PyTorch process may not give back the memory to the OS even after you delete your tensors. This memory is cached so that it can be quickly allocated to new tensors being allocated without requesting extra memory from the OS.

In [17]:
import torch
from GPUtil import showUtilization as gpu_usage

print("Initial GPU Usage")
gpu_usage()                             

tensorList = []
for x in range(10):
    tensorList.append(torch.randn(10000000, 10).cuda())   # reduce the size of tensor if you are getting OOM

print("GPU Usage after allocating a bunch of Tensors")
gpu_usage()

del tensorList

print("GPU Usage after deleting the Tensors")
gpu_usage()  

print("GPU Usage after emptying the cache")
torch.cuda.empty_cache()
gpu_usage()

Initial GPU Usage
| ID | GPU | MEM |
------------------
|  0 |  6% |  6% |
GPU Usage after allocating a bunch of Tensors
| ID | GPU | MEM |
------------------
|  0 |  0% | 53% |
GPU Usage after deleting the Tensors
| ID | GPU | MEM |
------------------
|  0 | 28% | 53% |
GPU Usage after emptying the cache
| ID | GPU | MEM |
------------------
|  0 | 28% |  6% |


#### Using CUDNN Backend
One can use the CUDNN benchmark to have optimizations in the code. These optimizations are especially beneficial if your input size is fixed (you are not using RNNs).

In [18]:
import torch

torch.backends.cudnn.benchmark = True
torch.backends.cudnn.enabled = True

#### Using Half Precision Floats

One can use half precision floats if the GPU has FP16 support. It's simple enough to convert a normal model to its half precision variant.

In [19]:
inp = torch.randn(5,).cuda().half()

model = myNetwork().cuda().half()

model(inp)

tensor([0.6973], device='cuda:0', dtype=torch.float16, grad_fn=<AddBackward0>)

Batch Norm layers have been reported to have convergence issues with half precision floats, so it's better to use full precision for them.

In [20]:
import torch
import torch.nn as nn

class myNetworkBN(nn.Module):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Linear(10,5)
        self.bn = nn.BatchNorm1d(5)
        self.l2 = nn.Linear(5,1)
     
    def forward(self,x):
        x = self.l1(x)
        x = self.bn(x)
        x = self.l2(x)
        return x 

inp = torch.randn(10,).cuda().half().unsqueeze(0)       # Unsqueeze op to add mini-batch dimension

model = myNetworkBN().cuda().half().eval()              # Eval mode = use population statistics in BN

model(inp)

tensor([[-0.2389]], device='cuda:0', dtype=torch.float16,
       grad_fn=<AddmmBackward0>)

These techniques help manage memory efficiently in PyTorch, ensuring optimal utilization of system resources. One must always be careful about half precision floats when the value may get too large. It is recommended to use the Nvidia `apex` extension for using mixed precision training.