## CS336 - Lec1(tech)+Lec2

## Tokenizers


## Overview of Lec2:
- primitives needed to train a model
- tensors to models to optimizers to training loop
- efficiency/use of resources(Memory GB & Compute FLOPs)

In [1]:
import torch

def motivating_questions():

1. How long would it take to train a 70B parameter model on 15T tokens on 1024 H100s?   
2. What's the largest model that can you can train on 8 H100s using AdamW (naively)?

### Memory Accounting

#### tensors_basics()
Tensors are the basic building block for storing everything: parameters, gradients, optimizer state, data, activations.

In [None]:
# create tensors in multiple ways:
x = torch.tensor([[1., 2, 3], [4, 5, 6]])
x = torch.zeros(4, 8)  # 4x8 matrix of all zeros 
x = torch.ones(4, 8)  # 4x8 matrix of all ones 
x = torch.randn(4, 8)  # 4x8 matrix of iid Normal(0, 1) samples 

In [None]:
# allocate but don't initialize the values:
x = torch.empty(4, 8)  # 4x8 matrix of uninitialized values

#### def tensors_memory()
almost everything are stored as floating point numbers.

FLOAT32/fp32/single precision (Default)
- float32由三部分组成：符号位（pos:31，最左侧）、指数位(pos:30-23)和尾数位(pos:22-0)
- 计算公式： $$(-1)^{\text{符号位}}\times 2^{\text{指数位}-127}\times (1+\text{尾数位}/{2^{23}}) $$
- 比如 0.5，二进制是 $0.1_2$，规格化为$1.0\times 2^{-1}$，所以符号=0，指数=126（二进制01111110），尾数=0（二进制000...）

In [None]:
# examine memory usage
assert x.dtype == torch.float32 # Default type
assert x.numel() == 4 * 8
assert x.element_size() == 4  # Float is 4 bytes

def get_memory_usage(x: torch.Tensor):
    return x.numel() * x.element_size()

assert get_memory_usage(x) == 4 * 8 * 4  # 128 bytes

In [None]:
# One matrix in the FFN layer of GPT-3:
assert get_memory_usage(torch.empty(12288 * 4, 12288)) == 2304 * 1024 * 1024  # 2.3 GB

FLOAT16/fp16/half precision：符号15、指数14-10和尾数9-0

In [None]:
# 内存减半，不适合很小/大的数
x = torch.zeros(4, 8, dtype=torch.float16)
assert x.element_size() == 2

x = torch.tensor([1e-8], dtype=torch.float16)
assert x == 0  # Underflow!

bfloat16
- Google Brain于2018年提出此解决深度学习问题，符号15、指数14-7、尾数6-0
- 扩大范围，拥有和float32一样的动态范围，但是减小精度（深度学习中对噪声/精度不太敏感）
- 对于optimizer的参数等，仍需要高精度

In [None]:
x = torch.tensor([1e-8], dtype=torch.bfloat16)
assert x != 0  # No underflow!

FP8
- NVDA提出用于机器学习，H100支持

Mixed Precision Training 

### Compute Accounting

#### tensors_on_gpu() 
默认tensor储存于cpu，因此需要手动调整到gpu上，并时常做sanity check

In [None]:
# 检查默认cpu
assert x.device == torch.device('cpu')

# 是否有cuda支持的gpu
torch.cuda.is_available() 

# 有多少可用gpu设备
num_gpus = torch.cuda.device_count()
for i in range(num_gpus):
    properties = torch.cuda.get_device_properties(i)

# 将x移到gpu
y = x.to("cuda:0")
z = torch.zeros(32, 32, device="cuda:0") # 或直接声明

#### tensor operations
1. tensor storage
2. tensor slicing（注意是对地址的引用，还是直接创建新的值）
3. tensor element-wise operation
4. tensor matmul

In [None]:
# tensor乘法
x = torch.ones(16, 32)
w = torch.ones(32, 2)
y = x @ w # 矩阵乘法，*是element-wise乘法

但是在深度学习中，通常会对batch或sequence中的每个样本单独计算，所以为了方便，设置了broadcast机制。

In [None]:
x = torch.ones(4, 8, 16, 32)
w = torch.ones(32, 2)
y = x @ w

assert y.size() == torch.Size([4, 8, 16, 2])

In [None]:
x = torch.ones(2, 2, 3)  # batch, sequence, hidden
y = torch.ones(2, 2, 3)  # batch, sequence, hidden
z = x @ y.transpose(-2, -1)  # batch, sequence, sequence

# 额外地，有更高效和清晰的做矩阵转置的操作
# Einops is a library for manipulating tensors where dimensions are named.

#### tensor operation flops
A floating-point operation (FLOP) is a basic operation like addition (x + y) or multiplication (x y).

FLOPs: 计算完成需要的量
FLOP/s: 表示硬件的计算速度的量

直观感受
- Training GPT-3 (2020) took 3.14e23 FLOPs.
- Training GPT-4 (2023) is speculated to take 2e25 FLOPs

- A100 has a peak performance of 312 teraFLOP/s
- H100 has a peak performance of 1979 teraFLOP/s with sparsity, 50% without

对于线性模型Linear来说，假设有n个点，每个点有d维，线性模型最终将d维向量映射成k个outputs。

In [None]:
if torch.cuda.is_available():
    B = 16384  # Number of points
    D = 32768  # Dimension
    K = 8192   # Number of outputs
else:
    B = 1024
    D = 256
    K = 64

device = "cuda:0" # 或者cpu
x = torch.ones(B, D, device=device)
w = torch.randn(D, K, device=device)
y = x @ w

# 最终，大致需要的flops为
actual_num_flops = 2 * B * D * K
# 通常来说，在深度学习中，矩阵乘法比其他操作都要贵。