GPT modelinin parametre sayısını hesaplamak için aşağıdaki formüller kullanılır:

1. **Embedding layer**: \( vocab\_size \times n\_embd \)
2. **Transformer block**: Her bir Transformer bloğunda toplam parametre sayısı şu şekildedir:
   - **LayerNorm**: İki LayerNorm katmanı vardır, her biri \( 2 \times n\_embd \)
   - **Attention (Q, K, V)**: \( 3 \times (n\_embd \times (n\_embd // n\_head) + n\_embd) \)
   - **Attention output projection**: \( n\_embd \times n\_embd + n\_embd \)
   - **MLP (intermediate dense layers)**: \( 2 \times (n\_embd \times 4 \times n\_embd + 4 \times n\_embd) \)

Bu formülleri kullanarak parametre sayısını hesaplayalım.

### Hesaplamalar

#### Embedding Layer

\[
50257 \times 768 = 38,609,856
\]

#### Transformer Block

Her bir Transformer bloğunda:

- **LayerNorm**: 
  \[
  2 \times 768 = 1,536
  \]

- **Attention (Q, K, V)**:
  \[
  3 \times (768 \times (768 // 12) + 768) = 3 \times (768 \times 64 + 768) = 3 \times (49,152 + 768) = 3 \times 49,920 = 149,760
  \]

- **Attention output projection**:
  \[
  768 \times 768 + 768 = 590,592 + 768 = 591,360
  \]

- **MLP**:
  \[
  2 \times (768 \times 4 \times 768 + 4 \times 768) = 2 \times (3,145,728 + 3,072) = 2 \times 3,148,800 = 6,297,600
  \]

Her bir Transformer bloğu için toplam:
\[
1,536 + 149,760 + 591,360 + 6,297,600 = 7,040,256
\]

#### Toplam Transformer Bloğu Parametreleri
\[
12 \times 7,040,256 = 84,483,072
\]

#### Toplam Parametreler
\[
38,609,856 (Embedding) + 84,483,072 (Transformer blokları) = 123,092,928
\]

Bu hesaplamalar, yaklaşık 124 milyon parametreye oldukça yakındır. Modelin toplam parametre sayısının 124M olduğu, verilen konfigürasyon değerlerine göre bu şekilde hesaplanır.

In [1]:
import glob
import inspect
import math
import os
import struct
from contextlib import nullcontext
from dataclasses import dataclass

import numpy as np
import torch
import torch._inductor.config as config
import torch.nn as nn
from torch.distributed.optim import ZeroRedundancyOptimizer
from torch.nn import functional as F

In [2]:
class NewGELU(nn.Module):
    """Careful there are a few versions of GeLU, this one is the exact one used by OpenAI"""
    def forward(self, input):
        return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))


In [3]:
FLASH = 0

In [4]:
class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        self.c_proj.LLMC_RESIDUAL_SCALE_FLAG = 1
        # regularization
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        # not really a 'bias', more of a mask, but following the OpenAI/HF naming though
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                     .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)
        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        if FLASH:
            # flashattention
            y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
        else:
            # manual implementation of attention
            # this materializes the large (T,T) matrix for all the queries and keys
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
            att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
            att = F.softmax(att, dim=-1)
            y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
        # output projection
        y = self.c_proj(y)
        return y

In [5]:
class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.gelu    = NewGELU()
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd)
        self.c_proj.LLMC_RESIDUAL_SCALE_FLAG = 1 # special flag for residual scaling.

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x

In [6]:
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

In [7]:
@dataclass
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50257
    n_layer: int = 12
    n_head: int = 12
    n_embd: int = 768

In [8]:

def print0(*args, **kwargs):
    # modified print that only prints from the master process
    # if this is not a distributed run, it's just a print
    if int(os.environ.get("RANK", 0)) == 0:
        print(*args, **kwargs)

In [9]:
class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd), # wte: word token embeddings
            wpe = nn.Embedding(config.block_size, config.n_embd), # wpe: positional embeddings
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]), # h: transformer blocks
            ln_f = nn.LayerNorm(config.n_embd), # ln_f: final layer norm before output
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False) # lm_head: head for language modeling
        self.lm_head.LLMC_SKIP_INIT = 1 # don't init this one, we will tie weights
        self.transformer.wte.weight = self.lm_head.weight # https://paperswithcode.com/method/weight-tying

        # init all weights, use a torch rng object to be very careful
        self.init_rng = torch.Generator()
        self.init_rng.manual_seed(42)
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            # apply special scaled init to the residual projections, per GPT-2 paper
            std = 0.02 if not hasattr(module, 'LLMC_RESIDUAL_SCALE_FLAG') else 0.02 / math.sqrt(2 * self.config.n_layer)
            # we want to skip initializing lm_head, which shares parameters with wte
            # and wte was already initialized down below during the Embedding init
            if not hasattr(module, 'LLMC_SKIP_INIT'):
                torch.nn.init.normal_(module.weight, mean=0.0, std=std, generator=self.init_rng)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02, generator=self.init_rng)
    @classmethod
    def from_pretrained(cls, model_type):
        """Loads pretrained GPT-2 model weights from huggingface"""
        # assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        from transformers import GPT2LMHeadModel
        print("loading weights from pretrained gpt: %s" % model_type)

        # n_layer, n_head and n_embd are determined from model_type
        config_args = {
            'ytu-ce-cosmos/turkish-gpt2':   dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
            'gpt2':                         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
            'gpt2-medium':                  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
            'gpt2-large':                   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
            'gpt2-xl':                      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
        }[model_type]
        config_args['vocab_size'] = 50257 # always 50257 for GPT model checkpoints
        config_args['block_size'] = 1024 # always 1024 for GPT model checkpoints
        # create a from-scratch initialized minGPT model
        config = GPTConfig(**config_args)
        model = GPT(config)
        sd = model.state_dict()
        sd_keys = sd.keys()
        sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')] # discard this mask / buffer, not a param

        # init a huggingface/transformers model
        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
        sd_hf = model_hf.state_dict()

        # copy while ensuring all of the parameters are aligned and match in names and shapes
        sd_keys_hf = sd_hf.keys()
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')] # ignore these, just a buffer
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')] # same, just the mask (buffer)
        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
        # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla Linear
        # this means that we have to transpose these weights when we import them
        assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
        for k in sd_keys_hf:
            if any(k.endswith(w) for w in transposed):
                # special treatment for the Conv1D weights we need to transpose
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                # vanilla copy over the other parameters
                assert sd_hf[k].shape == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])

        return model

    def forward(self, idx, targets=None, return_logits=True):
        device = idx.device
        b, t = idx.size()
        assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
        pos = torch.arange(0, t, dtype=torch.long, device=device) # shape (t)

        # forward the GPT model itself
        tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (t, n_embd)
        x = tok_emb + pos_emb

        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)

        if targets is not None:
            # if we are given some desired targets also calculate the loss
            logits = self.lm_head(x)
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        else:
            # inference-time mini-optimization: only forward the lm_head on the very last position
            logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
            loss = None

        # there are performance reasons why not returning logits is prudent, if not needed
        if not return_logits:
            logits = None

        return logits, loss

    def configure_optimizers(self, weight_decay, learning_rate, betas, device_type, zero_stage):
        # start with all of the candidate parameters
        param_dict = {pn: p for pn, p in self.named_parameters()}
        # filter out those that do not require grad
        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
        # create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
        # i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
        nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
        optim_groups = [
            {'params': decay_params, 'weight_decay': weight_decay},
            {'params': nodecay_params, 'weight_decay': 0.0}
        ]
        num_decay_params = sum(p.numel() for p in decay_params)
        num_nodecay_params = sum(p.numel() for p in nodecay_params)
        print0(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
        print0(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")
        # Create AdamW optimizer and use the fused version if it is available
        fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
        use_fused = fused_available and device_type == 'cuda'
        print0(f"using fused AdamW: {use_fused}")
        if zero_stage == 1:
            print0("using ZeroRedundancyOptimizer")
            optimizer = ZeroRedundancyOptimizer(**optim_groups[0], optimizer_class=torch.optim.AdamW,
                                                lr=learning_rate, betas=betas, fused=use_fused)
            optimizer.add_param_group(optim_groups[1])
        else:
            print0("using regular AdamW")
            optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, fused=use_fused)
        return optimizer

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
            # forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond)
            # pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx


In [10]:
def _peek_data_shard(filename):
    # only reads the header, returns header data
    with open(filename, "rb") as f:
        tokens = f.read()
        tokens = [x for x in tokens]
    return len(tokens)

def _load_data_shard(filename):
    with open(filename, "rb") as f:
        tokens = f.read()
        tokens = [x for x in tokens]
    return tokens

class DistributedDataLoader:
    def __init__(self, filename_pattern, B, T, process_rank, num_processes):
        self.process_rank = process_rank
        self.num_processes = num_processes
        self.B = B
        self.T = T

        # glob files that match the pattern
        self.files = sorted(glob.glob(filename_pattern))
        assert len(self.files) > 0, f"did not find any files that match the pattern {filename_pattern}"

        # load and validate all data shards, count number of tokens in total
        ntok_total = 0
        for fname in self.files:
            shard_ntok = _peek_data_shard(fname)
            assert shard_ntok >= num_processes * B * T + 1, f"dataset shard {fname} is too small for the current setting"
            ntok_total += shard_ntok
        self.ntok_total = ntok_total
        print0(f"DataLoader: total number of tokens: {ntok_total:,} across {len(self.files)} files")

        # kick things off
        self.current_shard = None
        self.reset()

    def reset(self):
        # we're being a bit clever here: if we already had shard 0 loaded,
        # then don't do the work to reload it, just reset the pointer
        if self.current_shard != 0:
            self.current_shard = 0
            self.tokens = _load_data_shard(self.files[self.current_shard])
        self.current_position = self.process_rank * self.B * self.T

    def advance(self): # advance to next data shard
        self.current_shard = (self.current_shard + 1) % len(self.files)
        self.current_position = self.process_rank * self.B * self.T
        self.tokens = _load_data_shard(self.files[self.current_shard])

    def next_batch(self):
        B = self.B
        T = self.T
        buf = self.tokens[self.current_position : self.current_position+B*T+1]
        # buf = torch.tensor(buf.astype(np.int32), dtype=torch.long)
        buf = torch.tensor(buf, dtype=torch.long)
        x = (buf[:-1]).view(B, T) # inputs
        y = (buf[1:]).view(B, T) # targets
        # advance the start pointer in current shard
        self.current_position += B * T * self.num_processes
        # if loading the next batch would be out of bounds advance the shard
        if self.current_position + (B * T * self.num_processes + 1) > len(self.tokens):
            self.advance()
        return x, y


In [11]:
def write_fp16(tensor, file):
    t = tensor.detach().cpu().to(torch.float16)
    b = t.numpy().tobytes()
    file.write(b)

def write_fp32(tensor, file):
    t = tensor.detach().cpu().to(torch.float32)
    b = t.numpy().tobytes()
    file.write(b)

def write_bf16(tensor, file):
    t = tensor.detach().cpu().to(torch.bfloat16)
    # numpy doesn't have bf16 datatype so we have to trick it
    t = t.view(torch.int16) # trick: reinterpret as int16
    b = t.numpy().tobytes()
    file.write(b)

In [12]:
def write_tensors(model_tensors, L, file, dtype):
    # writes the GPT-2 model's weights to a binary file
    assert dtype in {"float16", "float32", "bfloat16"}
    write_fun = write_fp16 if dtype == "float16" else write_fp32 if dtype == "float32" else write_bf16
    write_fun(model_tensors["transformer.wte.weight"], file) # (V, C)
    write_fun(model_tensors["transformer.wpe.weight"], file) # (T, C)
    for i in range(L): # (L, C)
        write_fun(model_tensors[f"transformer.h.{i}.ln_1.weight"], file)
    for i in range(L): # (L, C)
        write_fun(model_tensors[f"transformer.h.{i}.ln_1.bias"], file)
    for i in range(L): # (L, 3C, C)
        write_fun(model_tensors[f"transformer.h.{i}.attn.c_attn.weight"], file)
    for i in range(L): # (L, 3C)
        write_fun(model_tensors[f"transformer.h.{i}.attn.c_attn.bias"], file)
    for i in range(L): # (L, C, C)
        write_fun(model_tensors[f"transformer.h.{i}.attn.c_proj.weight"], file)
    for i in range(L): # (L, C)
        write_fun(model_tensors[f"transformer.h.{i}.attn.c_proj.bias"], file)
    for i in range(L): # (L, C)
        write_fun(model_tensors[f"transformer.h.{i}.ln_2.weight"], file)
    for i in range(L): # (L, C)
        write_fun(model_tensors[f"transformer.h.{i}.ln_2.bias"], file)
    for i in range(L): # (L, 4C, C)
        write_fun(model_tensors[f"transformer.h.{i}.mlp.c_fc.weight"], file)
    for i in range(L): # (L, 4C)
        write_fun(model_tensors[f"transformer.h.{i}.mlp.c_fc.bias"], file)
    for i in range(L): # (L, C, 4C)
        write_fun(model_tensors[f"transformer.h.{i}.mlp.c_proj.weight"], file)
    for i in range(L): # (L, C)
        write_fun(model_tensors[f"transformer.h.{i}.mlp.c_proj.bias"], file)
    write_fun(model_tensors["transformer.ln_f.weight"], file) # (C, )
    write_fun(model_tensors["transformer.ln_f.bias"], file) # (C, )


In [14]:
@torch.no_grad()
def pad_vocab(tensor, multiple=128, value=0):
    """
    The dimension of the vocab size in GPT-2 is 50,257
    which is unfortunately a very unfriendly number for a lot of
    matrix operations on the GPU. So we pad it to the nearest
    friendlier multiple, e.g. 50,304 if multiple=128 when we
    export the weights into C land. This is a NOOP algorithmically
    and is only done to make the tensor operations more efficient.
    """
    assert tensor.ndim == 2
    V, C = tensor.shape
    # assert V == 50257, "just being defensive here"
    # calculate padded vocab size by rounding up to nearest multiple
    Vp = ((V + multiple - 1) // multiple) * multiple
    # pad the tensor
    pad_rows = Vp - V
    padded = tensor if pad_rows == 0 else F.pad(tensor, (0, 0, 0, pad_rows), value=value)
    assert padded.shape == (Vp, C)
    return padded

In [15]:
def write_model(model, filename, dtype):
    # everything we need to instantiate the model
    # 1) header is: version int, GPTConfig ints, padding to 1024 bytes
    assert dtype in {"float16", "float32", "bfloat16"} # float16 todo maybe later
    version = {
        "float16": 2, # 2: all tensors are fp16, padded vocab
        "float32": 3, # 3: all tensors are fp32, padded vocab
        "bfloat16": 5, # 5: all tensors are bf16, padded vocab
    }[dtype]
    header = torch.zeros(256, dtype=torch.int32)
    header[0] = 20240326 # magic
    header[1] = version # checkpoint version
    header[2] = model.config.block_size
    header[3] = model.config.vocab_size
    header[4] = model.config.n_layer
    header[5] = model.config.n_head
    header[6] = model.config.n_embd
    # 2) the parameters follow the header
    params = {name: param.cpu() for name, param in model.named_parameters()}
    # pad the vocab to a multiple of 128 here at export, for efficiency in C
    wte = params["transformer.wte.weight"] # (V, C)
    wte_padded = pad_vocab(wte) # (Vp, C)
    params["transformer.wte.weight"] = wte_padded # (Vp, C)
    print(f"padded vocab size from {wte.size(0)} to {wte_padded.size(0)}")
    header[7] = wte_padded.size(0) # padded vocab size store in header
    # now write to file
    with open(filename, "wb") as file:
        file.write(header.numpy().tobytes()) # header
        write_tensors(params, model.config.n_layer, file, dtype) # params
    print(f"wrote {filename}")


In [16]:
def write_state(model, x, y, logits, loss, filename, dtype="float32"):
    # the state is used for debugging.
    # it contains information about the input, logits, loss, and the parameter gradients
    # this can be used for checking the computation correctness in C
    header = torch.zeros(256, dtype=torch.int32)
    header[0] = 20240327 # magic
    header[1] = 2 # run state version = 2 (1 -> 2 for padded vocab changes)
    header[2] = x.size(0) # batch size of the batch, B
    header[3] = x.size(1) # temporal extent of the batch, T
    grads = {name: param.grad.cpu() for name, param in model.named_parameters()}
    # pad the vocab grads here as well, to mirror write_model
    wte_grad = grads["transformer.wte.weight"] # (V, C)
    wte_grad_padded = pad_vocab(wte_grad, value=0) # (Vp, C) # TODO later maybe pad with nan?
    grads["transformer.wte.weight"] = wte_grad_padded # (Vp, C)
    print(f"padded vocab size in reference grads from {wte_grad.size(0)} to {wte_grad_padded.size(0)}")
    with open(filename, "wb") as file:
        # header
        file.write(header.numpy().tobytes())
        # input x
        file.write(x.cpu().numpy().astype("int32").tobytes()) # (B, T)
        # targets y
        file.write(y.cpu().numpy().astype("int32").tobytes()) # (B, T)
        # logits (result of the model forward pass)
        write_fp32(logits.cpu(), file)
        # loss (single float, result of the cross entropy loss)
        write_fp32(loss.cpu(), file)
        # gradients
        write_tensors(grads, model.config.n_layer, file, dtype)
    print(f"wrote {filename}")

In [17]:
def write_tokenizer(enc, filename):
    n = enc.max_token_value + 1
    header = torch.zeros(256, dtype=torch.int32)
    header[0] = 20240328 # magic
    header[1] = 2 # tokenizer version = 2 (1 -> 2: includes EOT token)
    header[2] = n # number of tokens
    header[3] = enc.eot_token # EOT token
    with open(filename, "wb") as file:
        file.write(header.numpy().tobytes())
        for i in range(n):
            b = enc.decode_bytes([i])
            length = len(b)
            assert length < 256, f"Token length exceeds 255: {length}"
            file.write(struct.pack("<B", length))  # Write the length as a 1-byte unsigned integer
            file.write(b)  # Write the actual bytes
    print(f"wrote {filename}")


In [18]:
# B, T = 2, 4
""" 

@dataclass
class GPTConfig:
    block_size: int = 128
    vocab_size: int = 64
    n_layer: int = 4
    n_head: int = 4
    n_embd: int = 16 """

B, T = 8, 1024 # batch size, sequence length
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"
device_type = 'cuda' if 'cuda' in device else 'cpu'
print(f"using device: {device} ({device_type})")

using device: mps (cpu)


In [19]:
ddp_rank = 0
ddp_local_rank = 0
zero_stage = 0
ddp_world_size = 1
master_process = True
seed_offset = 0
total_batch_size = B * T
tokens_per_fwdbwd = B * T
tokens_per_fwdbwd

8192

In [20]:
assert total_batch_size % tokens_per_fwdbwd == 0
grad_accum_steps = total_batch_size // tokens_per_fwdbwd
print0(f"total desired batch size: {total_batch_size}")
print0(f"=> calculated gradient accumulation steps: {grad_accum_steps}")

total desired batch size: 8192
=> calculated gradient accumulation steps: 1


In [21]:
# set up a context manager following the desired dtype and device
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16, 'float8': torch.float8_e4m3fn}['float16']
ctx = torch.amp.autocast(device_type=device_type, dtype=ptdtype) if device_type == "cuda" else nullcontext()
# rng / reproducibility
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

In [22]:
model = GPT.from_pretrained('gpt2')
model.train()
model.to(device)
model

  from .autonotebook import tqdm as notebook_tqdm


loading weights from pretrained gpt: gpt2


GPT(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (h): ModuleList(
      (0-11): 12 x Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=768, out_features=2304, bias=True)
          (c_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (gelu): NewGELU()
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [23]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")


In [25]:
tr_medikal_prompt = """Siz bir AI Medikal Asistan Chatbot'sunuz ve medikal soruları cevaplamak üzere eğitildiniz. Aşağıda, bir görevi tanımlayan bir talimat ve yanıt bağlamı verilmiştir. Talebe uygun bir yanıt yazın.

### Kullanıcı:
{}


### Asistan:
{}"""

# write an example for generating text
sample_text = "Teknolojinin gelişimi hayatımızı önemli ölçüde etkiledi. "
# sample_text = tr_medikal_prompt.format(sample_text, "")
sample_tokens = tokenizer.encode(sample_text)
sample_tokens = torch.tensor(sample_tokens, dtype=torch.long, device=device)[None, ...]

model.eval()
with torch.no_grad():
    sample_out = model.generate(sample_tokens, max_new_tokens=128, temperature=1, top_k=40)

# print the generated text
print0('---------------')
print0(f"Example input: {sample_text}")
print0(f"Generated output: {tokenizer.decode(sample_out[0].tolist())}")
print0('---------------')

---------------
Example input: Teknolojinin gelişimi hayatımızı önemli ölçüde etkiledi. 
Generated output: Teknolojinin gelişimi hayatımızı önemli ölçüde etkiledi. ğlákılıym ğlajülzı makir kıkurçırım.

Tepzalu öteşarı nikşarı niyilık yın kıkar, yuşlırımı zalekköm, yuşmöşrı makir, xasırırımüş, yıtıjemı, yıtıjı, yisu, xümö, xünyan
---------------


In [25]:
train_loader = DistributedDataLoader("gpt_train_tokens.bin", B, T, ddp_rank, ddp_world_size)
val_loader = DistributedDataLoader("gpt_val_tokens.bin", B, T, ddp_rank, ddp_world_size)


DataLoader: total number of tokens: 58,452,570 across 1 files
DataLoader: total number of tokens: 6,495,640 across 1 files


In [26]:
x, y = train_loader.next_batch()
x, y = x.to(device), y.to(device)
x.shape, y.shape

(torch.Size([8, 1024]), torch.Size([8, 1024]))

In [27]:
logits, loss = model(x, y)
loss.backward()
# save model params, in both float32 and bfloat16
model_to_size = {"gpt1Letter": "12K",  "gpt2": "124M", "gpt2-medium": "355M", "gpt2-large": "774M", "gpt2-xl": "1558M"}
model_to_size.update({f"d{d}": f"d{d}" for d in [12, 24, 36, 48]})
model_size_str = model_to_size["gpt2"] # e.g. "124M", or "d12"
write_model(model, f"gpt2_{model_size_str}.bin", dtype="float16")
write_model(model, f"gpt2_{model_size_str}_bf16.bin", dtype="bfloat16")
# save x, y, logits, loss, and parameter gradients, for debugging C
# always store these in fp32 to have an accurate reference (?)
write_state(model, x, y, logits, loss, f"gpt2_{model_size_str}_debug_state.bin")
# reset the train_loader for the optimization below
train_loader.reset()
# clear the grads here explicitly because otherwise we'd have a duplicate grad accumulation
# since in the training loop we do a backward() and then zero_grad() at the end of the loop
# this would cause an incorrect first training step
model.zero_grad()

padded vocab size from 50257 to 50304
wrote gpt2_124M.bin
padded vocab size from 50257 to 50304
wrote gpt2_124M_bf16.bin
padded vocab size in reference grads from 50257 to 50304
wrote gpt2_124M_debug_state.bin


In [28]:
weight_decay = 0.0
learning_rate = 3e-4
learning_rate_decay_frac = 0.0
warmup_iters = 0
num_iterations = 20
val_loss_every = 0
val_max_steps = 20
sample_every = 0
overfit_single_batch = 1
inference_only = 0
grad_clip = 1.0

In [29]:
raw_model = model # always contains the "raw" unwrapped model

# init the optimizer
optimizer = raw_model.configure_optimizers(weight_decay=weight_decay,
                                            learning_rate=learning_rate, betas=(0.9, 0.95),
                                            device_type=device, zero_stage=zero_stage)

num decayed parameter tensors: 50, with 124,318,464 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: False
using regular AdamW


In [30]:
# learning rate decay scheduler (cosine with warmup)
def get_lr(it):
    min_lr = learning_rate * learning_rate_decay_frac
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * (it+1) / warmup_iters
    # 2) if it > lr_decay_iters, return min learning rate
    if it > num_iterations:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (num_iterations - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff starts at 1 and goes to 0
    return min_lr + coeff * (learning_rate - min_lr)

In [31]:
import time

timings = []
norm = -1.0   # dummy value to print in inference-only mode
for step in range(num_iterations + 1):
    t0 = time.time()
    last_step = (step == num_iterations)

    # once in a while evaluate the validation dataset
    if (val_loss_every > 0 \
        and (step % val_loss_every == 0 or last_step)) \
        and (val_loader is not None):
        model.eval()
        val_loader.reset()
        with torch.no_grad():
            val_loss = 0.0
            for _ in range(val_max_steps):
                x, y = val_loader.next_batch()
                x, y = x.to(device), y.to(device)
                _, loss = model(x, y, return_logits=False)
                val_loss += loss.item()
            val_loss /= val_max_steps
        # log to console and to file
        print0(f"val loss {val_loss}")


    # once in a while perform model inference on the master process
    if (sample_every > 0 \
        and (step % sample_every == 0 or last_step)) \
        and master_process:
        model.eval()
        # before we end, let's also do one round of inference
        # we'll kick off the generation with "<|endoftext|>", which designates the start of a new sequence
        start_ids = [63]
        xg = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])
        max_new_tokens = 32
        temperature = 1.0
        top_k = 40
        yg = raw_model.generate(xg, max_new_tokens, temperature=temperature, top_k=top_k)
        print0('---------------')
        print0(tokenizer.decode(yg[0].tolist()))
        print0('---------------')

    # bit confusing: we want to make sure to eval and sample on 0th iteration
    # but also after the very last iteration. so we loop for step <= num_iterations
    # instead of just < num_iterations (one extra due to <=), only to do
    # the validation/sampling one last time, and then we break right here as we're done.
    if last_step:
        break

    # --------------- TRAINING SECTION BEGIN -----------------
    model.train()
    # micro-batch loop where we do gradient accumulation to reach desired total batch size
    lossf = 0.0 # for getting the mean loss (as simple float) over the accumulation steps
    for micro_step in range(grad_accum_steps):
        # fetch a batch
        if not overfit_single_batch \
            or (overfit_single_batch and step == 0 and micro_step == 0):
            x, y = train_loader.next_batch()
            x, y = x.to(device), y.to(device)

        # forward pass
        with ctx:
            _, loss = model(x, y, return_logits=False)
            # we have to scale the loss to account for gradient accumulation,
            # because the gradients just add on each successive backward().
            # addition of gradients corresponds to a SUM in the objective, but
            # instead of a SUM we want MEAN, so we scale the loss here
            loss = loss / grad_accum_steps
            lossf += loss.detach() # keep track of the mean loss
        # backward pass
        if not inference_only:
            loss.backward()

    lossf = lossf.item()
    norm = torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    # determine and set the learning rate for this iteration
    lr = get_lr(step)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr
    # step the optimizer
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)
    # --------------- TRAINING SECTION END -------------------
    # everything that follows now is just diagnostics, prints, logging, etc.

    # wait on the CPU for all device work to end so we get accurate per-iteration timings below
    if device == "mps":
        torch.mps.synchronize()
    elif device == "cuda":
        torch.cuda.synchronize()
    # time and print
    t1 = time.time()
    # the 0th iteration is often an outlier (much slower) => skip logging it
    tokens_per_second = grad_accum_steps * ddp_world_size * B * T / (t1-t0)
    print0(f"step {step+1:4d}/{num_iterations} | train loss {lossf:.6f} | norm {norm:.4f} | lr {lr:.2e} | ({(t1-t0)*1000:.2f} ms | {tokens_per_second:.0f} tok/s)")


    # keep track of smooth timings, last 20 iterations
    if step > 0 and step > num_iterations - 20:
        timings.append(t1-t0)


step    1/20 | train loss 1200.992798 | norm 1803.9814 | lr 3.00e-04 | (3060.49 ms | 2677 tok/s)
step    2/20 | train loss 733.837280 | norm 1302.2609 | lr 2.98e-04 | (1774.81 ms | 4616 tok/s)
step    3/20 | train loss 367.639954 | norm 2673.2495 | lr 2.93e-04 | (1791.28 ms | 4573 tok/s)
step    4/20 | train loss 240.408386 | norm 765.4218 | lr 2.84e-04 | (1763.73 ms | 4645 tok/s)
step    5/20 | train loss 87.283707 | norm 364.1551 | lr 2.71e-04 | (1750.12 ms | 4681 tok/s)
step    6/20 | train loss 115.222900 | norm 1533.7216 | lr 2.56e-04 | (1773.80 ms | 4618 tok/s)
step    7/20 | train loss 82.457840 | norm 399.1787 | lr 2.38e-04 | (1751.40 ms | 4677 tok/s)
step    8/20 | train loss 47.452095 | norm 3697690.0000 | lr 2.18e-04 | (1551.99 ms | 5278 tok/s)
step    9/20 | train loss 55.643742 | norm 357.5972 | lr 1.96e-04 | (1879.64 ms | 4358 tok/s)
step   10/20 | train loss 32.141853 | norm 341.9620 | lr 1.73e-04 | (1752.77 ms | 4674 tok/s)
step   11/20 | train loss 8.183723 | norm 134.

In [32]:
# print the average of the last 20 timings, to get something smooth-ish
timings = timings[-20:]
print0(f"final {len(timings)} iters avg: {np.mean(timings)*1000:.3f}ms")
print0(f"peak memory consumption: {torch.cuda.max_memory_allocated() // 1024 // 1024} MiB")

final 19 iters avg: 1749.804ms
peak memory consumption: 0 MiB


In [33]:
model.eval()
with torch.no_grad():
    sample_out = model.generate(sample_tokens, max_new_tokens=128, temperature=1, top_k=40)

# print the generated text
print0('---------------')
print0(f"Example input: {sample_text}")
print0(f"Generated output: {tokenizer.decode(sample_out[0].tolist())}")
print0('---------------')

---------------
Example input: Siz bir AI Medikal Asistan Chatbot'sunuz ve medikal soruları cevaplamak üzere eğitildiniz. Aşağıda, bir görevi tanımlayan bir talimat ve yanıt bağlamı verilmiştir. Talebe uygun bir yanıt yazın.

### Kullanıcı:
Teknolojinin gelişimi hayatımızı önemli ölçüde etkiledi. 


### Asistan:

Generated output: Siz bir AI Medikal Asistan Chatbot'sunuz ve medikal soruları cevaplamak üzere eğitildiniz. Aşağıda, bir görevi tanımlayan bir talimat ve yanıt bağlamı verilmiştir. Talebe uygun bir yanıt yazın.

### Kullanıcı:
Teknolojinin gelişimi hayatımızı önemli ölçüde etkiledi. 


### Asistan:
<|endoftext|>]]"<|endoftext|>...""f<|endoftext|>...""u<|endoftext|>...]!K<|endoftext|>...].<|endoftext|>!""K-<|endoftext|>…]H
<|endoftext|>Ÿ4_u-<|endoftext|>�cS/I"
<|endoftext|>escuS:<|endoftext|>]]"*���i*.<|endoftext|>�5!:<|endoftext|>ς.<|endoftext|>."""C@HIC.<|endoftext|> :)"*,<|endoftext|>�.<|endoftext|> EURUSDJ#SS!
<|endoftext|>…]J.<|endoftext|>...]#JCS!s,<|endoftext|>!»CIKS!s"