# AUTOMATIC MIXED PRECISION

- `torch.cuda.amp`는 Automatic Mixed Precision (AMP) 를 제공한다. 
- AMP란 특정 계산은 `torch.float32`(`float`) 으로 하고 몇몇 계산은 `torch.float16` (`half`)로 하는 것을 뜻 한다.  
- 이 노트북에서는 간단한 모델을 학습해보고 해당 모델에 `torch.cuda.amp.autocast` 와 `torch.cuda.amp.GradScaler`를 직접 적용해 보는 방법을 학습한다.
- 해당 노트북을 돌리기 위해서 설치해야하는 것은 파이썬과 파이토치다.
- AMP의 효과를 최대로 보기 위해서는 Tensor Core-enabled Architecture (Volta, Turing, Ampere)로 설계된 GPU를 사용하는 것이 좋다. 예) RTX 3090 (Ampere)

Background

- Pytorch
- `gc.collect`: [docs](https://docs.python.org/3/library/gc.html#gc.collect)


In [4]:
import gc, time, torch, os

os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '6,'

start_time = None

def start_timer():
    global start_time
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.synchronize()
    start_time = time.time()

def end_timer_and_print(msg):
    torch.cuda.synchronize()
    end_time = time.time()
    print("\n" + msg)
    print("Total Execution Time = {:.3f} sec".format(end_time - start_time))
    print("Max memory used by tensors = {} bytes".format(torch.cuda.max_memory_allocated()))

In [5]:
def make_model(in_size, out_size, num_layers):
    layers = []
    for _ in range(num_layers - 1):
        layers.append(torch.nn.Linear(in_size, in_size))
        layers.append(torch.nn.ReLU())
    layers.append(torch.nn.Linear(in_size, out_size))
    return torch.nn.Sequential(*layers).cuda()

In [6]:
batch_size = 512
in_size = 4096
out_size = 4096
num_layers = 3
num_batches = 50
epochs = 3

data = [torch.randn(batch_size, in_size, device="cuda") for _ in range(num_batches)]
targets = [torch.randn(batch_size, out_size, device="cuda") for _ in range(num_batches)]

loss_fn = torch.nn.MSELoss().cuda()

In [7]:
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)

start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        output = net(input)
        loss = loss_fn(output, target)
        loss.backward()
        opt.step()
        opt.zero_grad()
end_timer_and_print("Default precision:")




Default precision:
Total Execution Time = 2.532 sec
Max memory used by tensors = 1350681600 bytes


In [12]:
use_amp = True

net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)

start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast(enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad()

end_timer_and_print("Mixed precision:")




Mixed precision:
Total Execution Time = 0.840 sec
Max memory used by tensors = 1577232896 bytes
