CUDA OOM with HF Model #1058

fadebek1 · 2022-11-09T00:02:11Z

Hi, has the HF model been tested to train on CUDA? I am getting OOM errors no matter how small the batch size is. Im using a V100 32GB. A code snippet is attached below. I've profiled the individual steps of hf_model.train using nvidia-smi and narrowed down the issue to here. The GPU memory spikes to fill up 32GB after the dataset is loaded. Is all of the data being loaded onto the GPU? Is this supposed to happen? Is there a way to disable this behavior? The error message also supports this as PyTorch only reserved 2.8GB for the model itself.

import t5.data.mixtures
import functools
import t5.models
import seqio
import torch
import tensorflow_datasets as tfds
from transformers import Adafactor


model = t5.models.HfPyTorchModel("google/t5-v1_1-base", "/tmp" , torch.device("cuda"))

TaskRegistry = seqio.TaskRegistry
for b in tfds.text.glue.Glue.builder_configs.values():
     task = TaskRegistry.get("glue_%s_v002" % b.name)
     task.source._tfds_dataset._name = task.source._tfds_dataset._name.replace("1.0.0", "2.0.0")

model.train(
     "glue_v002_proportional",
     262144,
     5000
     {"inputs": 512, "targets": 512},
     "train",
     16,
     functools.partial(Adafactor, lr=1e-3, relative_step=False),
        )

OutOfMemoryError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 31.75 GiB total capacity;
2.71 GiB already allocated; 45.75 MiB free; 2.79 GiB reserved in total by PyTorch) If reserved 
memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA OOM with HF Model #1058

CUDA OOM with HF Model #1058

fadebek1 commented Nov 9, 2022 •

edited

CUDA OOM with HF Model #1058

CUDA OOM with HF Model #1058

Comments

fadebek1 commented Nov 9, 2022 • edited

fadebek1 commented Nov 9, 2022 •

edited