# Improving training

Now that basic training works, there's a few things I need to add in:

- Attention masking: I need to mask out the attention for the padding tokens. We already should get the attention mask from the data loader, so this should be pretty straightforward
- Metrics: I'd like to be able to track loss, FLOPs, memory usage, etc.
- Optimization: what training optimizations can I make? E.g. mixed precision training, gradient accumulation, etc.

In [5]:
%load_ext autoreload
%autoreload 2

import os
from composer.utils import reproducibility

seed = 42
reproducibility.seed_all(seed)

HF_TOKEN = os.getenv("HF_TOKEN")
CACHE_DIR = "/datadrive/hf_cache"

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Investigating Attention Masking

I'm realizing the attention masking doesn't really matter?
Since it's causal, having padding tokens at the end doesn't affect preceding logits. It won't affect the loss either, since we set those labels to -100.

I think it's not worth adding into the code; the is_causal flag is enough.

In [6]:
from transformers import PreTrainedTokenizerFast
import datasets

print("Loading datasets...")
wikihow_data: datasets.Dataset = datasets.load_dataset(
    "wikihow",
    name="all",
    data_dir=CACHE_DIR,
    cache_dir=CACHE_DIR,
    use_auth_token=HF_TOKEN,
    split="train",
    # streaming=True,
).shuffle(
    seed=seed
)  # type: ignore

tokenizer = PreTrainedTokenizerFast.from_pretrained("tokenizer/")

Loading datasets...


Found cached dataset wikihow (/datadrive/hf_cache/wikihow/all-data_dir=%2Fdatadrive%2Fhf_cache/1.2.0/5343fc81d685acaa086c9cc19eb8706206cd1f8b315792b04c1d7b92091c305e)
Loading cached shuffled indices for dataset at /datadrive/hf_cache/wikihow/all-data_dir=%2Fdatadrive%2Fhf_cache/1.2.0/5343fc81d685acaa086c9cc19eb8706206cd1f8b315792b04c1d7b92091c305e/cache-ca61b0a7a4447ccd.arrow


In [12]:
seq_len = 6
sample = wikihow_data[0]["text"][:50]
tokenized_sample = tokenizer(
    sample,
    padding="max_length",
    max_length=seq_len,
)
print("input_ids", tokenized_sample["input_ids"])
print("attention_mask", tokenized_sample["attention_mask"])

input_ids [220, 2160, 2718, 3914, 254, 6906, 6060, 413, 232, 687, 0, 0, 0, 0, 0]
attention_mask [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]


In [26]:
import torch

sample_qk = torch.randn(seq_len, seq_len)
attn_mask = torch.ones(seq_len, seq_len, dtype=torch.bool).tril(diagonal=0)
print(attn_mask)


tensor([[ True, False, False, False, False, False],
        [ True,  True, False, False, False, False],
        [ True,  True,  True, False, False, False],
        [ True,  True,  True,  True, False, False],
        [ True,  True,  True,  True,  True, False],
        [ True,  True,  True,  True,  True,  True]])


In [13]:
from model import WSModel, WSConfig

model = WSModel(WSConfig())

# Metrics

Just gonna paste my whole training script here and mess with it.

In [27]:
import os
from typing import Any

import datasets
import torch.utils.data
from composer import Trainer
from composer.optim import DecoupledAdamW, LinearWithWarmupScheduler
from composer.utils import reproducibility
from model import ComposerWSModel, WSConfig
from transformers import DataCollatorForLanguageModeling, PreTrainedTokenizerFast

HF_TOKEN = os.getenv("HF_TOKEN")
CACHE_DIR = "/datadrive/hf_cache"

###### CONFIG ######
model_params = {
    "d_model": 64,
    "n_heads": 4,
    "n_layers": 2,
    "vocab_size": 8192,
}

seed = 42
optim = {
    "lr": 1e-4,
    "betas": (0.9, 0.98),
    "eps": 1.0e-06,
    "weight_decay": 1.0e-5,
}
learning_rate = {"t_warmup": "250ba", "alpha_f": 0.02}
precision = "fp32"

save_folder = "checkpoints/pretraining/"
save_interval = "500ba"
hf_save_folder = "huggingface_model/"

tokenizer_dir = "tokenizer/"
###### END CONFIG ######


reproducibility.seed_all(seed)

tokenizer = PreTrainedTokenizerFast.from_pretrained(tokenizer_dir)
config = WSConfig(**model_params)

text_column_name = "text"


def tokenize_function(examples: dict[str, Any]):
    """
    Tokenize dataset examples.
    """
    examples[text_column_name] = [
        line
        for line in examples[text_column_name]
        if len(line) > 0 and not line.isspace()
    ]
    return tokenizer(
        examples[text_column_name],
        padding="max_length",
        truncation=True,
        max_length=256,
        return_special_tokens_mask=True,
    )


print("Loading datasets...")
wikihow_data: datasets.Dataset = datasets.load_dataset(
    "wikihow",
    name="all",
    data_dir=CACHE_DIR,
    cache_dir=CACHE_DIR,
    use_auth_token=HF_TOKEN,
    split="train",
    # streaming=True,
).shuffle(
    seed=seed
)  # type: ignore

tokenized_train = wikihow_data.map(
    tokenize_function,
    batched=True,
    remove_columns=wikihow_data.column_names,  # collate_fn doesn't like other columns
    load_from_cache_file=False,
)

collate_fn = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

train_dataloader = torch.utils.data.DataLoader(
    tokenized_train, batch_size=64, collate_fn=collate_fn
)


Loading datasets...


Found cached dataset wikihow (/datadrive/hf_cache/wikihow/all-data_dir=%2Fdatadrive%2Fhf_cache/1.2.0/5343fc81d685acaa086c9cc19eb8706206cd1f8b315792b04c1d7b92091c305e)
Loading cached shuffled indices for dataset at /datadrive/hf_cache/wikihow/all-data_dir=%2Fdatadrive%2Fhf_cache/1.2.0/5343fc81d685acaa086c9cc19eb8706206cd1f8b315792b04c1d7b92091c305e/cache-ca61b0a7a4447ccd.arrow


Map:   0%|          | 0/157252 [00:00<?, ? examples/s]

In [38]:
composer_model = ComposerWSModel(config=config, tokenizer=tokenizer)
optimizer = DecoupledAdamW(
    composer_model.model.parameters(),
    # lr=1.0e-4,
    # betas=(0.9, 0.98),
    # eps=1.0e-06,
    # weight_decay=1.0e-5,
    **optim,
)
lr_scheduler = LinearWithWarmupScheduler(**learning_rate)


In [36]:
from composer.loggers import WandBLogger
from composer.callbacks import SpeedMonitor

wandb_logger = WandBLogger(project="wabisabi")

In [39]:
trainer = Trainer(
    model=composer_model,  # This is the model from the HuggingFaceModel wrapper class.
    train_dataloader=train_dataloader,
    # eval_dataloader=eval_dataloader,
    max_duration="1ep",  # train for more epochs to get better performance
    optimizers=optimizer,
    schedulers=[lr_scheduler],
    device="gpu" if torch.cuda.is_available() else "cpu",
    precision="fp32",
    progress_bar=True,
    loggers=[wandb_logger],
    callbacks=[SpeedMonitor()],
    # checkpointing
    save_folder=save_folder,
    save_filename="ep{epoch}-ba{batch}-rank{rank}.pt",
    save_interval=save_interval,
    save_overwrite=True,
)
try:
    # Start training
    trainer.fit()

    # Save Hugging Face model
    config.save_pretrained(hf_save_folder)
    tokenizer.save_pretrained(hf_save_folder)
    composer_model.model.save_pretrained(hf_save_folder)
finally:
    trainer.close()


******************************
Config:
node_name: unknown because NODENAME environment variable not set
num_gpus_per_node: 1
num_nodes: 1
rank_zero_seed: 1046410796

******************************


train          Epoch   0:    0%|| 0/2458 [00:00<?, ?ba/s]         



0,1
loss/train/total,██▇▆▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
metrics/train/LanguageCrossEntropy,██▇▆▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
throughput/batches_per_sec,█▇▆▆▆▆▅▄▄▅▅▅▆▄▄▅▄▄▄▃▃▃▄▄▃▄▄▅▄▅▄▂▁▂▃▂▁▁▁▁
throughput/device/batches_per_sec,█▇▆▆▆▆▅▄▄▅▅▅▆▄▄▅▄▄▄▃▃▃▄▄▃▄▄▅▄▅▄▂▁▂▃▂▁▁▁▁
throughput/device/flops_per_sec,█▇▆▆▆▆▅▄▄▅▅▅▆▄▄▅▄▄▄▃▃▃▄▄▃▄▄▅▄▅▄▂▁▂▃▂▁▁▁▁
throughput/device/mfu,█▇▆▆▆▆▅▄▄▅▅▅▆▄▄▅▄▄▄▃▃▃▄▄▃▄▄▅▄▅▄▂▁▂▃▂▁▁▁▁
throughput/device/samples_per_sec,█▇▆▆▆▆▅▄▄▅▅▅▆▄▄▅▄▄▄▃▃▃▄▄▃▄▄▅▄▅▄▂▁▂▃▂▁▁▁▁
throughput/flops_per_sec,█▇▆▆▆▆▅▄▄▅▅▅▆▄▄▅▄▄▄▃▃▃▄▄▃▄▄▅▄▅▄▂▁▂▃▂▁▁▁▁
throughput/samples_per_sec,█▇▆▆▆▆▅▄▄▅▅▅▆▄▄▅▄▄▄▃▃▃▄▄▃▄▄▅▄▅▄▂▁▂▃▂▁▁▁▁
time/batch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███

0,1
loss/train/total,6.01482
metrics/train/LanguageCrossEntropy,6.01482
throughput/batches_per_sec,9.39411
throughput/device/batches_per_sec,9.39411
throughput/device/flops_per_sec,618565059620.1593
throughput/device/mfu,0.07637
throughput/device/samples_per_sec,595.58629
throughput/flops_per_sec,618565059620.1593
throughput/samples_per_sec,595.58629
time/batch,2458.0


# Optimization
Wanted to compare Mosaic's estimate with the deepspeed profiler. Looks about right - the MPT estimate is 66 G flops, while the deepspeed estimates 20.03 G flops for forward inference, which means total flops would be 20.03 * 3 = 60.09 G flops. Pretty close!

In [89]:
batch = next(iter(train_dataloader))
mpt_estimate = composer_model.flops_per_batch(batch)
print(composer_model.n_active_params)
print(mpt_estimate / 1e9, "G")

610624
66.46923264 G


In [66]:
batch["input_ids"].shape

torch.Size([64, 256])

In [72]:
emb = torch.nn.Embedding(
    num_embeddings=config.vocab_size, embedding_dim=config.d_model
).to("cuda")
emb(batch["input_ids"])


tensor([[[-1.0980e+00, -1.1528e-02, -7.1528e-01,  ...,  1.5902e+00,
           1.2393e+00,  1.0311e+00],
         [ 7.4352e-01, -1.4786e+00,  3.3844e-01,  ...,  1.5986e+00,
          -5.4977e-01, -1.1131e+00],
         [ 9.5008e-01, -1.1936e+00, -1.1842e-01,  ..., -3.4778e-01,
           3.1346e-01,  1.3806e-01],
         ...,
         [-3.6953e-02,  5.9130e-01,  6.3312e-01,  ..., -6.7467e-01,
           1.1711e+00, -1.8665e-01],
         [-1.2335e+00,  2.2180e-01,  6.8240e-01,  ..., -1.5828e-01,
          -9.0393e-01, -4.7375e-01],
         [ 6.5973e-01,  4.2160e-01, -8.4262e-01,  ..., -1.2005e-01,
          -3.9789e-01,  5.5433e-01]],

        [[-1.9193e-01, -7.7889e-01,  4.1020e-01,  ...,  8.7949e-01,
           7.6157e-01, -1.2559e+00],
         [ 3.5590e-01, -9.7230e-01,  3.1907e-01,  ..., -1.5804e+00,
          -1.5815e+00, -4.3272e-01],
         [-1.0980e+00, -1.1528e-02, -7.1528e-01,  ...,  1.5902e+00,
           1.2393e+00,  1.0311e+00],
         ...,
         [ 7.1277e-01, -2

In [88]:
from deepspeed.profiling.flops_profiler import get_model_profile

batch = next(iter(train_dataloader)).to("cuda")
batch_size = 64
test_composer_model = ComposerWSModel(config=config, tokenizer=tokenizer)
print(test_composer_model.model.device)
flops, macs, params = get_model_profile(
    model=test_composer_model.model,
    input_shape=(batch_size, 256),
    # kwargs={"input_ids": batch["input_ids"]},
)


cpu
cpu
torch.int64
cpu
torch.int64

-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 1:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)

params per gpu:                                               610.62 k
params of model = params per GPU * mp_size:                   0       
fwd MACs per GPU:                                             10.0 GMACs
fwd flops per GPU:                                            20.03 G 
fwd flops of model = fwd flops per GPU * mp_size:             20.03 G 
fwd latency:                                                  109.0 ms
fwd FLOPS per GPU = f