Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA OOM with HF Model #1058

Open
fadebek1 opened this issue Nov 9, 2022 · 0 comments
Open

CUDA OOM with HF Model #1058

fadebek1 opened this issue Nov 9, 2022 · 0 comments

Comments

@fadebek1
Copy link

fadebek1 commented Nov 9, 2022

Hi, has the HF model been tested to train on CUDA? I am getting OOM errors no matter how small the batch size is. Im using a V100 32GB. A code snippet is attached below. I've profiled the individual steps of hf_model.train using nvidia-smi and narrowed down the issue to here. The GPU memory spikes to fill up 32GB after the dataset is loaded. Is all of the data being loaded onto the GPU? Is this supposed to happen? Is there a way to disable this behavior? The error message also supports this as PyTorch only reserved 2.8GB for the model itself.

import t5.data.mixtures
import functools
import t5.models
import seqio
import torch
import tensorflow_datasets as tfds
from transformers import Adafactor


model = t5.models.HfPyTorchModel("google/t5-v1_1-base", "/tmp" , torch.device("cuda"))

TaskRegistry = seqio.TaskRegistry
for b in tfds.text.glue.Glue.builder_configs.values():
     task = TaskRegistry.get("glue_%s_v002" % b.name)
     task.source._tfds_dataset._name = task.source._tfds_dataset._name.replace("1.0.0", "2.0.0")

model.train(
     "glue_v002_proportional",
     262144,
     5000
     {"inputs": 512, "targets": 512},
     "train",
     16,
     functools.partial(Adafactor, lr=1e-3, relative_step=False),
        )
OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 31.75 GiB total capacity;
2.71 GiB already allocated; 45.75 MiB free; 2.79 GiB reserved in total by PyTorch) If reserved 
memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant