Impossible to finetune model bigger than 124M or 125M parameters (colab) #291

Blazeolmo · 2022-03-10T19:50:21Z

When i try to finetune gpt-2 355M (because gpt-neo 350M is broken), no matter what i do i will always get a cuda oom error. not with a t4 (and no, i'm not planning on getting colab pro just to play with funy stupeed text gen ai), not with fp16 (by the way, that doesnt work as well) and not even with gradient_checkpointing=True.
What the duck am i supposed to do? create vram out of thin air? cant there be something that limits the vram and empties it (i find it astonishing that there's no way to empty the vram other than factory reset of the vm. like, it's vram, not disk storage) to avoid not being able to train/loosing half of the progress/having to factory reset because the vram is full of the previous failed training attempt)

Any potentially usefull help is appreciated. sorry if i was a little rude, i'm just tired of stuff on github being more broken than a chair that has been set on fire.

Mennaruuk · 2022-03-14T02:37:29Z

This issue is most likely not due to the script. It has been reproduced in three different scripts this year. I reported it to Google, I hope they can take a look at it. But yes, I've faced the same issue you're having. Personally, I don't see this getting fixed soon: it looks like Google is allocating fewer high-performance GPUs to free-tier users, accounting for more crashes when training with bigger models. For the time being, consider training with the 124M model using GPU, or bigger models using TPU, which is not time-ideal but at least it will work. I don't recommend using another service such as Gradient. Their free tier won't get you anywhere with even the smallest model. Please consider wording this differently next time, though: there is a human on the other side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impossible to finetune model bigger than 124M or 125M parameters (colab) #291

Impossible to finetune model bigger than 124M or 125M parameters (colab) #291

Blazeolmo commented Mar 10, 2022

Mennaruuk commented Mar 14, 2022 •

edited

Impossible to finetune model bigger than 124M or 125M parameters (colab) #291

Impossible to finetune model bigger than 124M or 125M parameters (colab) #291

Comments

Blazeolmo commented Mar 10, 2022

Mennaruuk commented Mar 14, 2022 • edited

Mennaruuk commented Mar 14, 2022 •

edited