# Debugging

There's a bug somewhere - the model doesn't seem to be improving at all. I'll be trying to find it. What's the best way to do this?

First, I'll check my inputs. Maybe something's wrong with the training data + labels. To verify, I'll start by pretraining a tiny model implementation that I know works.

The tokenization appears fine.

# Loss

Trying to see if there's any issue with the labels and batch data. Can't find anything.

In [2]:
from typing import Any
from transformers import DataCollatorForLanguageModeling, PreTrainedTokenizerFast
import torch
import datasets
from composer.utils import reproducibility

seed = 42
reproducibility.seed_all(seed)

CACHE_DIR = "/datadrive/hf_cache"
tokenizer_dir = "tokenizer/"

context_length = 256
batch_size = 128

tokenizer = PreTrainedTokenizerFast.from_pretrained(tokenizer_dir)

wikihow_data: datasets.Dataset = datasets.load_dataset(
    "wikihow",
    name="all",
    data_dir=CACHE_DIR,
    cache_dir=CACHE_DIR,
    split="train",
    # streaming=True,
).shuffle(
    seed=seed
)  # type: ignore

text_column_name = "text"


def tokenize_function(examples: dict[str, Any]):
    """
    Tokenize dataset examples.
    """
    examples[text_column_name] = [
        line
        for line in examples[text_column_name]
        if len(line) > 0 and not line.isspace()
    ]
    return tokenizer(
        examples[text_column_name],
        padding="max_length",
        truncation=True,
        max_length=context_length,
        return_special_tokens_mask=True,
    )


tokenized_train = wikihow_data.map(
    tokenize_function,
    batched=True,
    remove_columns=wikihow_data.column_names,  # collate_fn doesn't like other columns
    # load_from_cache_file=False,
)

collate_fn = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

train_dataloader = torch.utils.data.DataLoader(
    tokenized_train, batch_size=batch_size, collate_fn=collate_fn
)


Found cached dataset wikihow (/datadrive/hf_cache/wikihow/all-data_dir=%2Fdatadrive%2Fhf_cache/1.2.0/5343fc81d685acaa086c9cc19eb8706206cd1f8b315792b04c1d7b92091c305e)
Loading cached shuffled indices for dataset at /datadrive/hf_cache/wikihow/all-data_dir=%2Fdatadrive%2Fhf_cache/1.2.0/5343fc81d685acaa086c9cc19eb8706206cd1f8b315792b04c1d7b92091c305e/cache-ca61b0a7a4447ccd.arrow


Map:   0%|          | 0/157252 [00:00<?, ? examples/s]

In [9]:
wikihow_data[0]["text"]


'The general characteristics of cats vary from breed to breed, and even within breeds, cats vary from one another just like humans do. Don’t choose a name before you get your cat, because it certainly is not a “one name fits all” situation. Some cats are very vocal and active, while others are quiet and lazy. Some want to be in your lap all day, while others may prefer its own space.Consider these traits while you are considering names.\n\n\nThe name Lord Paddington IV is an excellent name, but it may work best for a calm, reserved cat. The name Spazzy would be great for a hyperactive, silly cat.\nGive your cat some time to acclimate to your home. It may seem timid and quiet at first, but really just be going through a period of adjustment.;\n, While of course there is more to your precious cat than its physical appearance, it’s a great way to generate some unique potential names. If you can find one that describes its appearance while also fitting its personality, you’ve struck gold!\

In [8]:
tokenizer.decode(tokenized_train[0]["input_ids"])


' the general characteristics of cats vary from breed to breed, and even within breeds, cats vary from one another just like humans do. don’t choose a name before you get your cat, because it certainly is not a “one name fits all” situation. some cats are very vocal and active, while others are quiet and lazy. some want to be in your lap all day, while others may prefer its own space.consider these traits while you are considering names.\n\n\nthe name lord paddington iv is an excellent name, but it may work best for a calm, reserved cat. the name spazzy would be great for a hyperactive, silly cat.\ngive your cat some time to acclimate to your home. it may seem timid and quiet at first, but really just be going through a period of adjustment.;\n, while of course there is more to your precious cat than its physical appearance, it’s a great way to generate some unique potential names. if you can find one that describes its appearance while also fitting its personality, you’ve struck gold!

In [11]:
train_data = iter(train_dataloader)


In [13]:
sample = next(train_data)
print(sample["input_ids"])
print(sample["labels"])


tensor([[1696,  306,  638,  ..., 1347, 3841, 2022],
        [6790, 3246,  790,  ...,  540,  260, 1075],
        [ 421, 1648,   12,  ...,  220, 3202,  758],
        ...,
        [2496, 7058, 6778,  ..., 8133,   26,  288],
        [ 317,  258,  391,  ...,  220, 2249,  254],
        [1732,  220, 6287,  ...,    0,    0,    0]])
tensor([[1696,  306,  638,  ..., 1347, 3841, 2022],
        [6790, 3246,  790,  ...,  540,  260, 1075],
        [ 421, 1648,   12,  ...,  220, 3202,  758],
        ...,
        [2496, 7058, 6778,  ..., 8133,   26,  288],
        [ 317,  258,  391,  ...,  220, 2249,  254],
        [1732,  220, 6287,  ..., -100, -100, -100]])


In [15]:
# checking to see if the rolling works as intended
from einops import rearrange

labels = sample["labels"]
# shift labels left
labels = torch.roll(labels, -1, dims=1)
labels[:, -1] = -100  # don't predict the last token
print(labels)
# flatten
labels = rearrange(labels, "batch seq_len -> (batch seq_len)")
print(labels)


tensor([[ 306,  638, 1610,  ..., 3841, 2022, -100],
        [3246,  790, 3729,  ...,  260, 1075, -100],
        [1648,   12,  313,  ..., 3202,  758, -100],
        ...,
        [7058, 6778,   39,  ...,   26,  288, -100],
        [ 258,  391, 1180,  ..., 2249,  254, -100],
        [ 220, 6287,  467,  ..., -100, -100, -100]])
tensor([ 306,  638, 1610,  ..., -100, -100, -100])


In [19]:
from model import WSConfig, WSModel
from transformers import PreTrainedTokenizerFast

hf_model_dir = "huggingface_model/"

test_config = WSConfig.from_pretrained(hf_model_dir)
test_model = WSModel.from_pretrained(hf_model_dir, config=test_config)
test_tokenizer = PreTrainedTokenizerFast.from_pretrained(hf_model_dir)

prompt = "the top of"
prompt_ids = test_tokenizer.encode(prompt, return_tensors="pt")
print("Prompt IDs:", prompt_ids)


output_ids = test_model.generate(prompt_ids, max_new_tokens=20)
print("Output IDs:", output_ids)

output_text = test_tokenizer.decode(output_ids[0])
print("Output Text:", output_text)

Either FairScale or torch distributed is not available, MixtureOfExperts will not be exposed. Please install them if you would like to use MoE


Prompt IDs: tensor([[ 220, 1178,  254]])
Output IDs: tensor([[ 220, 1178,  254,  220, 1178,  254,  220, 1178,  254,  220, 1178,  254,
          220, 1178,  254,  220, 1178,  254,  220, 1178,  254,  220, 1178]])
Output Text:  the top of the top of the top of the top of the top of the top of the top of the top


In [20]:
outputs = test_model(sample["input_ids"])
output_logits = rearrange(
    outputs.logits,
    "batch seq_len vocab_size -> (batch seq_len) vocab_size",
    vocab_size=8192,
)

In [42]:
token_ids = torch.argmax(outputs.logits[1, :, :], dim=1)
tokenizer.decode(token_ids)

'ising,gress.oneyment who a good of thekerud, and you hairop of a. you can a the a good in you to day have, the top.\n on the youor. theising, can to the ableer. the and you be able thatwork and. and. and a ifet.\n\n, the. the theising, and.. and aate.\n out of. theing the and. andate. andwork of and theosh.\n". theet. a hourinkical. be be able for a who in the. aies of\n, buttonam. and areshes.. and be a.. the aally. aize and\n is beu the the good. thes and.. andwork and awork of\n, andified. a,. you aerses. thety.s.\n ofising,als, not be aified.\n, and you can ale. and good,.. aified. theising, the media. be you can good. to you the ifship.ly.ments\n toper. the with azer and, and be, able it is\n.ship., aoney.'

In [23]:
outputs.logits.shape

torch.Size([128, 256, 8192])

In [22]:
output_logits.shape

torch.Size([32768, 8192])

In [16]:
print(labels.shape)


torch.Size([32768])


In [49]:
# logit rearranging seems fine
print("Second batch, first token logits", outputs.logits[1, 0, :])
print("Rearranged logits", output_logits[256, :])

Second batch, first token logits tensor([-3.5652,  2.6967,  3.3603,  ..., -0.8174, -0.5433, -4.8314],
       grad_fn=<SliceBackward0>)
Rearranged logits tensor([-3.5652,  2.6967,  3.3603,  ..., -0.8174, -0.5433, -4.8314],
       grad_fn=<SliceBackward0>)


In [57]:
test_labels = torch.roll(sample["labels"], -1, dims=1)
test_labels[:, -1] = -100  # don't predict the last token
print(test_labels[1, :])
print(labels[256:512])


tensor([3246,  790, 3729,  260,  477,  898,  346,  208, 4597,  254,  458,   48,
         704,   12,  602,  313,  380, 1925,  391, 2140,  317,  258,  391, 6720,
         346,  208, 4475,  358, 3760,  692,   58, 2621,  251,  220, 1979,   14,
        3973, 4708,  991, 4843,  217,  254, 6790, 3246,  258,  871,  251,  318,
        4406,  237,  238,   12,  358, 1426,  318, 3410, 1304,   12, 2849,   12,
        2025,   12,  326, 1197, 3543, 2076,   14,  352,  154, 7057, 2509,  239,
         238, 6790, 3246,   12, 3868, 1928,   12,  326, 4089,  639,   14, 7394,
         509, 4224,  238, 3627,  239,   12, 6115,   12, 4089,  639,   12, 1304,
        2375,   12,  261, 2622, 2870,   14, 4064,   57,  238, 3543, 2076,  326,
         247,  312, 1272, 2076,  404,  612,  318, 5446,  284, 2094, 5629,  238,
        6585,  308, 5155, 1172,   14,  154, 1200,  673, 4460, 6518,  809, 1904,
         497, 5117,   57,   12,  720,  501, 4393,   57,  238, 2128, 4178,  534,
         326, 5336, 5313,   14,  440, 14

# Generate

Maybe the generate function is a little weird? Going to implement greedy decoding myself and see if there are differences.

Nope, it seems to be working fine.

It seems to be either my model implementation or how its being trained; to be sure, I think it's worth to train on a reliable model implementation and replicate results.

In [58]:
def generate(prompt: str):
    input = tokenizer(prompt, return_tensors="pt")
    output = test_model(input)
    print(output)

In [65]:
input["input_ids"].dtype

torch.int64

In [73]:
max_tokens = 20
prompt = "First, you must"

curr_str = prompt
for i in range(max_tokens):
    input_ids = tokenizer(curr_str, return_tensors="pt")["input_ids"]
    output = test_model(input_ids)
    pred_token_id = output.logits[0, -1, :].argmax()
    new_token = tokenizer.decode(pred_token_id)
    curr_str += new_token

In [79]:
generate_output = test_model.generate(
    tokenizer(prompt, return_tensors="pt")["input_ids"], max_new_tokens=20
)


In [82]:
tokenizer.decode(generate_output[0])

' first, you must be able to the top of the top of the top of the top of the top of the top'

In [74]:
curr_str

'First, you must be able to the top of the top of the top of the top of the top of the top'

# Same Data, Different Model

If I train a reliable model implementation on the same data but with different parameters and the results are similar, I can just try to scale up the model and see if it solves any issues.

They recommend using Docker images, so I'll be doing that.

Setup was a little annoying. For whatever reason, installing Docker Desktop doesn't properly install the service, so I used Docker Engine. Then, to access GPUs inside the container, you need to install Nvidia Container Toolkit.

The command to run is:
```
sudo docker run -it --runtime=nvidia --gpus all -v /datadrive:/datadrive mosaicml/llm-foundry:2.0.1_cu118-latest /bin/bash
```

Actually... I'll keep this around, but I don't feel like plumbing the depths of the llm-foundry code. I'll just adapt my training script to use GPT-Neo-X or something.

Ok, so it clearly trains with Neo-X.

In [1]:
import os
from typing import Any

import datasets
import torch.utils.data
import torchinfo
import wandb
from composer import Callback, Logger, State, Time, Trainer
from composer.callbacks import SpeedMonitor
from composer.loggers import WandBLogger
from composer.optim import DecoupledAdamW, LinearWithWarmupScheduler
from composer.utils import reproducibility
from composer.models.huggingface import HuggingFaceModel
from transformers import DataCollatorForLanguageModeling, PreTrainedTokenizerFast
import transformers

HF_TOKEN = os.getenv("HF_TOKEN")
CACHE_DIR = "/datadrive/hf_cache"

seed = 42
reproducibility.seed_all(seed)


In [2]:
# editing config to match mine (mostly)
# just want a general sense of the model to be the same


name = "EleutherAI/gpt-neox-20b"
config = transformers.GPTNeoXConfig.from_pretrained(name, cache_dir=CACHE_DIR)

context_length = 256
batch_size = 128
config.hidden_size = 256
config.intermediate_size = 1024
config.num_attention_heads = 4
config.num_hidden_layers = 6
config.vocab_size = 8192

optim = {
    "lr": 6.0e-4,
    "betas": (0.9, 0.95),
    "eps": 1.0e-08,
    "weight_decay": 0.0,
}
learning_rate = {"t_warmup": "100ba", "alpha_f": 0.1}

model = transformers.GPTNeoXForCausalLM(config=config)
# tokenizer = transformers.AutoTokenizer.from_pretrained(name, cache_dir=CACHE_DIR)
# tokenizer.pad_token = tokenizer.eos_token
tokenizer = PreTrainedTokenizerFast.from_pretrained("tokenizer/")


In [3]:
text_column_name = "text"


def tokenize_function(examples: dict[str, Any]):
    """
    Tokenize dataset examples.
    """
    examples[text_column_name] = [
        line
        for line in examples[text_column_name]
        if len(line) > 0 and not line.isspace()
    ]
    return tokenizer(
        examples[text_column_name],
        padding="max_length",
        truncation=True,
        max_length=context_length,
        return_special_tokens_mask=True,
    )


print("Loading datasets...")
wikihow_data: datasets.Dataset = datasets.load_dataset(
    "wikihow",
    name="all",
    data_dir=CACHE_DIR,
    cache_dir=CACHE_DIR,
    use_auth_token=HF_TOKEN,
    split="train",
    # streaming=True,
).shuffle(
    seed=seed
)  # type: ignore

tokenized_train = wikihow_data.map(
    tokenize_function,
    batched=True,
    remove_columns=wikihow_data.column_names,  # collate_fn doesn't like other columns
    # load_from_cache_file=False,
)

collate_fn = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

train_dataloader = torch.utils.data.DataLoader(
    tokenized_train, batch_size=batch_size, collate_fn=collate_fn
)


Loading datasets...


Found cached dataset wikihow (/datadrive/hf_cache/wikihow/all-data_dir=%2Fdatadrive%2Fhf_cache/1.2.0/5343fc81d685acaa086c9cc19eb8706206cd1f8b315792b04c1d7b92091c305e)
Loading cached shuffled indices for dataset at /datadrive/hf_cache/wikihow/all-data_dir=%2Fdatadrive%2Fhf_cache/1.2.0/5343fc81d685acaa086c9cc19eb8706206cd1f8b315792b04c1d7b92091c305e/cache-ca61b0a7a4447ccd.arrow
Loading cached processed dataset at /datadrive/hf_cache/wikihow/all-data_dir=%2Fdatadrive%2Fhf_cache/1.2.0/5343fc81d685acaa086c9cc19eb8706206cd1f8b315792b04c1d7b92091c305e/cache-2206b29dbc9b948e.arrow


In [4]:
from composer.metrics.nlp import LanguageCrossEntropy

train_metrics = [LanguageCrossEntropy()]
composer_model = HuggingFaceModel(
    model=model,
    shift_labels=True,
    tokenizer=tokenizer,
    use_logits=True,
    metrics=train_metrics,
)
torchinfo.summary(
    composer_model.model, input_size=(batch_size, context_length), dtypes=[torch.long]
)

optimizer = DecoupledAdamW(
    composer_model.model.parameters(),
    **optim,
)
lr_scheduler = LinearWithWarmupScheduler(**learning_rate)


class SampleCallback(Callback):
    def __init__(
        self, sample_prompt: str, tokenizer: PreTrainedTokenizerFast, interval: str
    ):
        self.sample_prompt_ids = tokenizer.encode(sample_prompt, return_tensors="pt")
        self.interval = Time.from_timestring(interval)
        self.last_sample = Time(0, "ba")
        self.tokenizer = tokenizer

        # create table for samples
        self.table = wandb.Table(columns=["sample"])
        super().__init__()

    def batch_end(self, state: State, logger: Logger):
        if (state.timestamp.batch - self.last_sample) < self.interval:
            return
        output_ids = state.model.generate(
            state.device.tensor_to_device(self.sample_prompt_ids),
            max_new_tokens=30,
        )
        output_text = self.tokenizer.decode(output_ids[0])
        self.table.add_data(output_text)
        logger.log_metrics({"samples": self.table})

        self.last_sample = state.timestamp.batch


wandb_logger = WandBLogger(project="wabisabi")

In [5]:
save_folder = "gpt-neox/checkpoints/pretraining"
save_interval = "500ba"
hf_save_folder = "gpt-neox/huggingface_model/"

# Create Trainer Object
trainer = Trainer(
    model=composer_model,  # This is the model from the HuggingFaceModel wrapper class.
    train_dataloader=train_dataloader,
    # eval_dataloader=eval_dataloader,
    max_duration="1ep",  # train for more epochs to get better performance
    optimizers=optimizer,
    schedulers=[lr_scheduler],
    device="gpu" if torch.cuda.is_available() else "cpu",
    precision="fp32",
    progress_bar=True,
    loggers=[wandb_logger],
    callbacks=[
        SpeedMonitor(),
        SampleCallback("To cook pasta, the first step is to", tokenizer, "100ba"),
    ],
    # checkpointing
    save_folder=save_folder,
    save_filename="ep{epoch}-ba{batch}-rank{rank}.pt",
    save_interval=save_interval,
    save_overwrite=True,
)

try:
    # Start training
    trainer.fit()
finally:
    trainer.close()

print("Saving model...")
# Save Hugging Face model
config.save_pretrained(hf_save_folder)
tokenizer.save_pretrained(hf_save_folder)
composer_model.model.save_pretrained(hf_save_folder)
print("Done!")

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mjohn-sungjin[0m. Use [1m`wandb login --relogin`[0m to force relogin


[2023-06-28 02:11:23,723] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)


******************************
Config:
node_name: unknown because NODENAME environment variable not set
num_gpus_per_node: 1
num_nodes: 1
rank_zero_seed: 789680082

******************************


train          Epoch   0:    0%|| 0/1229 [00:00<?, ?ba/s]         

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


1ba
2ba
3ba
4ba
5ba
6ba
7ba
8ba
9ba
10ba
1ba
2ba
3ba
4ba
5ba
6ba
7ba
8ba
9ba
10ba
1ba
2ba
3ba
4ba
5ba
6ba
7ba
8ba
9ba
10ba
1ba
2ba
3ba
4ba
5ba
6ba
7ba
8ba
9ba
10ba
1ba
2ba
3ba
4ba
5ba
6ba
7ba
8ba
9ba
10ba
1ba
2ba
3ba
4ba
5ba
6ba
7ba
8ba
9ba
10ba
1ba
2ba
3ba
4ba
5ba
6ba
7ba
8ba
9ba
10ba
1ba
2ba
3ba
4ba
5ba
6ba
7ba
8ba
9ba
10ba
1ba
2ba
3ba
4ba
5ba
6ba
7ba
8ba
9ba
10ba
1ba
2ba
3ba
4ba
5ba


0,1
loss/train/total,████▇▇▇▇▇▇▆▆▆▆▆▅▅▅▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁
metrics/train/LanguageCrossEntropy,████▇▇▇▇▇▇▆▆▆▆▆▅▅▅▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁
time/batch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
time/batch_in_epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
time/epoch,▁
time/sample,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
time/sample_in_epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
time/token,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
time/token_in_epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
time/total,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███

0,1
loss/train/total,5.76012
metrics/train/LanguageCrossEntropy,5.76012
time/batch,95.0
time/batch_in_epoch,95.0
time/epoch,0.0
time/sample,12160.0
time/sample_in_epoch,12160.0
time/token,3112960.0
time/token_in_epoch,3112960.0
time/total,0.0211


KeyboardInterrupt: 