## Gradients

In [9]:
import torch
a = torch.rand(10, requires_grad=True)
b = torch.rand(10, requires_grad=True)
scalar = (a+b).sum()

In [10]:
assert a.grad is None # before backwards
scalar.backward()
assert a.grad is not None
assert b.grad is not None
print(a.grad)
print(b.grad)

tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])


#### No computational graph is required to be built in PyTorch. Everything is centered around tensor operations.

#### Building the computational graph is an expensive operation, and hogs up both memory and computational resources. So if possible, you want to disable the computational graph when you don't need gradients. 
#### For example, at inference time you just need the model outputs but never backpropagate through any losses. In such cases, you can use the context manager torch.no_grad:

In [None]:
# Don't run
with torch.no_grad():
    predictions = model(x)

#### In-place operations modify the contents of a tensor without copying the memory contents. These operations do not create new variables in the computational graph so cannot be backpropagated through. PyTorch raises an error while trying to build the computational graph.
#### One thing to be wary of is the += operator: this actually triggers an in-place operation, following the behavior of numpy.

In [8]:
# RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.

## Devices

#### GPU support - you can move tensors to the GPU by calling:

In [14]:
# Don't run

tensor.cuda()
# OR
tensor.to(torch.device("cuda"))

#### This function ensures that x and t are on the same device, regardless of where x is.

In [None]:
def some_function(x):
    t = torch.zeros(10).to(x.device)
    some_operation(x, y)

## Memory Management

### RuntimeError: CUDA Error: out of memory

#### There are multiple possible causes for this error, the most common ones being:

#### Problem 1: Batch sizes are too large - Although model weights do take up a lot of memory, the primary memory hog during training is the intermediate activations. This is because the forward activations are typically stored for backpropagation and take up much more space since the memory usage of the activations scales linearlly with the batch size.
#### Solution: Using smaller batch sizes generally solves the memory issue, but can cause training to become unstable. You can solve this problem using gradient accumulation, where you compute the gradient for multiple mini-batches before running the optimizer.


#### Problem 2: Need to restart the kernel if OOM error is encountered.
#### Solution: Use the code here to create a decorator that precludes you from having to restart the kernel.

In [17]:
import functools
import traceback
import sys
 
def get_ref_free_exc_info():
    type, val, tb = sys.exc_info()
    traceback.clear_frames(tb)
    return (type, val, tb)
 
def gpu_mem_restore(func):
    """Reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"""
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except:
            type, val, tb = get_ref_free_exc_info() # must!
            raise type(val).with_traceback(tb) from None
    return wrapper

You can use it like this:

In [18]:
@gpu_mem_restore
def do_something_on_the_gpu():
    print()

### Multi-GPU envts

Sometimes you want to constrain which process usees which GPU. The easiest way to do this is to set os.environ["CUDA_VISIBLE_DEVICES"] to the GPU(s) you want the process to be able to see. This will disable PyTorch from even knowing the existence of other GPUs. 

Another way is to use torch.device("cuda:i") to manually select the device in your code, which is effective if you need fine-grained control over which tensor is in which device (e.g. with multi-GPU training). 

A word of caution: torch.device("cuda:0") maps to the first device PyTorch sees, so if you set os.environ["CUDA_VISIBLE_DEVICES"] = "1" the device will map to GPU # 1.

#### CUDA version compaitibility

You can check for which cuda versions are supported by each version of PyTorch here https://pytorch.org/get-started/previous-versions/.

In PyTorch, you can check whether PyTorch thinks it has access to GPUs via the following function:

In [20]:
torch.cuda.is_available()

False

### Moving things to the GPU and back

Often, you will want to move batches of tensors or dictionaries of tensors to the GPU/CPU. 

A utility function I find useful is to_device:

In [21]:
def to_device(x, device: torch.device):
    """Transfers tensor or collection of tensors to a device."""
    if isinstance(x, (list, tuple)):
        return type(x)(to_device(v, device) for v in x)
    elif isinstance(x, dict):
        return {k: to_device(v, device) for k, v in x.items()}
    else:
        return x.to(device)

### Converting numpy arrays to tensors



In [23]:
import numpy as np
a = np.zeros(10)
b = np.zeros(10, dtype=np.int64)

# Converting to Tensor converts ints to floats
print(torch.Tensor(a))
print(torch.Tensor(b))


tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])


In [26]:
# To convert to int tensor use lowercase tensor with dtype int
print(torch.tensor(b, dtype=torch.int))

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=torch.int32)


#### USE Function Torch.from_numpy for tensors that won't change
This function takes advantage of the fact that PyTorch and numpy can share the same underlying memory layout, so as long as you don't change the data type, the conversion does not require any memory copying. 

### torch.form_numpy is MUCH faster than torch.Tensor since it doesn't create a copy.

Be careful though, because torch.from_numpy shares the same underlying data, so any modifications to the tensor will propagate back to the numpy array (and vice-versa).

### Converting tensors to numpy arrays

In [30]:
# Don't run
a.cpu().detach().numpy()