feat(model parallelism): moving the labels to the same device as the logits for gpt2 and bart #22591

kausmeows · 2023-04-05T15:55:44Z

What does this PR do?

As suggested in the #22561 moving the labels to the same device as the logits they are compared to for bart and gpt-2 models

This action has been referred to from #22535

lm_logits = self.lm_head(outputs[0])
lm_logits = lm_logits + self.final_logits_bias.to(lm_logits.device)

masked_lm_loss = None
if labels is not None:
    labels = labels.to(lm_logits.device)
    loss_fct = CrossEntropyLoss()
    masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), labels.view(-1))

cc @sgugger could you review this once.

…logits for gpt2 and bart

HuggingFaceDocBuilderDev · 2023-04-05T16:10:34Z

The documentation is not available anymore as the PR was closed or merged.

sgugger · 2023-04-05T17:26:15Z

Thanks a lot for your PR! Could you apply make fix-copies so that the models copied from BART or GPT-2 are auto-updated?

kausmeows · 2023-04-05T18:05:54Z

Thanks a lot for your PR! Could you apply make fix-copies so that the models copied from BART or GPT-2 are auto-updated?

Hi, just did that!

sgugger

Thanks a lot!

kausmeows · 2023-04-05T18:38:10Z

Thanks a lot!

All good! ✨

innat · 2023-04-05T19:33:54Z

Hi, @kaustubh-s1, does this change will fix model parallel for gpt2? I've just tried but got

 File "/opt/conda/envs/gpt_neox/lib/python3.9/site-packages/torch/nn/functional.py", line 2515, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

P.S. my setup is almost same like this, only the following differences

def get_parallel_model(model_name):
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map='auto',
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True
    )

    # 
    # setattr(model, 'model_parallel', True)
    # setattr(model, 'is_parallelizable', True)

    setattr(model, 'gradient_checkpointing', True)
    return model

kausmeows · 2023-04-06T06:54:59Z

Hi, @kaustubh-s1, does this change will fix model parallel for gpt2? I've just tried but got

 File "/opt/conda/envs/gpt_neox/lib/python3.9/site-packages/torch/nn/functional.py", line 2515, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__native_layer_norm)

P.S. my setup is almost same like this, only the following differences

def get_parallel_model(model_name):
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map='auto',
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True
    )

    # 
    # setattr(model, 'model_parallel', True)
    # setattr(model, 'is_parallelizable', True)

    setattr(model, 'gradient_checkpointing', True)
    return model

Hi @innat. It should do that ig. But I do not have a multi gpu setup so can't say for sure. I just followed the steps #22535 to move labels to same device as logits. Theoretically speaking, it should work.

…logits for gpt2 and bart (huggingface#22591)

feat(model parallelism): moving the labels to the same device as the …

c0d5df2

…logits for gpt2 and bart

apply make fix-copies

6efbcd5

sgugger approved these changes Apr 5, 2023

View reviewed changes

sgugger merged commit 1564189 into huggingface:main Apr 5, 2023
4 checks passed

kausmeows deleted the kaus branch April 5, 2023 18:40

xssChauhan mentioned this pull request Apr 5, 2023

Move labels to the same device as logits for LlamaForSequenceClassification and Blip2 #22596

Merged

innat mentioned this pull request Apr 7, 2023

Make all Transformer models compatible with model parallelism #22561

Closed

41 tasks

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023

feat(model parallelism): moving the labels to the same device as the …

ef73be7

…logits for gpt2 and bart (huggingface#22591)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(model parallelism): moving the labels to the same device as the logits for gpt2 and bart #22591

feat(model parallelism): moving the labels to the same device as the logits for gpt2 and bart #22591

kausmeows commented Apr 5, 2023

HuggingFaceDocBuilderDev commented Apr 5, 2023 •

edited

sgugger commented Apr 5, 2023

kausmeows commented Apr 5, 2023

sgugger left a comment

kausmeows commented Apr 5, 2023

innat commented Apr 5, 2023

kausmeows commented Apr 6, 2023

feat(model parallelism): moving the labels to the same device as the logits for gpt2 and bart #22591

feat(model parallelism): moving the labels to the same device as the logits for gpt2 and bart #22591

Conversation

kausmeows commented Apr 5, 2023

What does this PR do?

HuggingFaceDocBuilderDev commented Apr 5, 2023 • edited

sgugger commented Apr 5, 2023

kausmeows commented Apr 5, 2023

sgugger left a comment

Choose a reason for hiding this comment

kausmeows commented Apr 5, 2023

innat commented Apr 5, 2023

kausmeows commented Apr 6, 2023

HuggingFaceDocBuilderDev commented Apr 5, 2023 •

edited