Evaluation loop fails both with and without deepspeed #1472

ananyahjha93 · 2022-08-27T12:13:55Z

** Environment **

OS: Ubuntu 20.04
Hardware (GPU, or instance type): 8xA100
cuda: 11.3
cudnn: 8
pytorch: 1.12.1
composer: dev branch installed from source
deepspeed: 0.7.2
transformers: 4.21.2

** To reproduce

Steps to reproduce the behavior:

Use C4Dataset to train HF bloom on multiple GPUs with or without deepspeed.

Expected behavior

Eval loop should run without crashing.

Additional context

Error message without deepspeed.

Error message with deepspeed

I'll try to debug a bit more to see what's wrong but posting it here in the meantime.

hanlint · 2022-08-27T14:08:25Z

Thanks @ananyahjha93 , we did merge a major refactor this week (#1419) that may be causing some instabilities on dev. Was this working before on a versioned release?

bandish-shah · 2022-08-27T20:28:26Z

Thanks @ananyahjha93 for bringing this up. As @hanlint mentioned above, we’re in the process of a few major refactors so dev could be unstable for some time. We suggest you stick to one of the official releases:
https://github.com/mosaicml/composer/releases

v0.9.0 is the most recent release, could you give that a try?

ananyahjha93 · 2022-08-30T04:24:57Z

@bandish-shah tried 0.9.0, the eval loop crashes because of OOM error, which is weird because eval should not allocate additional memory on the GPU.

This is the GPU usage while training:

But once the eval loop starts, the code crashes with this message:

hanlint · 2022-08-30T06:33:31Z

🤔 If you are supplying the same batch_size=64 to both training and eval, this can occur since grad_accum only applies to the training, not to eval. Could you try setting a smaller batch size for eval?

Or alternatively, if you are installing from dev, you can try grad_accum=auto and it should adjust everything to fit in memory for both training and eval.

ananyahjha93 · 2022-08-30T19:58:12Z

@hanlint that is most likely the obvious answer which I have been overlooking
let me check and report back, if this was the only issue with my run, I am sorry for the confusion

ananyahjha93 · 2022-09-01T10:55:54Z

@hanlint closing this issue, yes you were right its the batch size for eval dataloader which was the issue, but I guess a line warning about this with the grad accum thing would be nice!

ananyahjha93 · 2022-09-01T21:55:19Z

@hanlint @bandish-shah the batch size thing mentioned fixes 0.9.0 and eval runs fine with it. But with the dev branch the problem of eval dataloader process exiting randomly continues.

bandish-shah · 2022-09-01T22:50:32Z

Hi @ananyahjha93 we're still working through stability issues on dev so it would be best to use a released version of Composer. Is there a reason you're trying to use dev? We'll keep this bug open regardless to see if a fix addresses this issue.

One thing that comes to mind looking at your environment information, I don't think we've tested Composer with Torch 1.12.1 + CUDA 11.3. Our latest Docker images are using Torch 1.12.1 + CUDA 11.6.2 which is what we run our CI testing on:
https://hub.docker.com/layers/mosaicml/pytorch/1.12.1_cu116-python3.9-ubuntu20.04/images/sha256-f0dc580a51667970d48eb67b1407bd992959f681763b0c324d3cd21626d54e91?context=repo

It's a shot in the dark but would it be possible for you to try this configuration?

ananyahjha93 · 2022-09-05T20:10:31Z

so this seems to be a ninja version issue, because I was getting the following error on a different machine: zhanghang1989/PyTorch-Encoding#167

So using their fix I followed these steps:

wget https://github.com/ninja-build/ninja/releases/download/v1.8.2/ninja-linux.zip
sudo unzip ninja-linux.zip -d /usr/local/bin/
sudo update-alternatives --install /usr/bin/ninja ninja /usr/local/bin/ninja 1 --force

however, my dataloader crashes again, which made me look at ninja version in the requirements file shared by @hanlint .

updating ninja to 1.10.2 works

wget https://github.com/ninja-build/ninja/releases/download/v1.10.2/ninja-linux.zip
sudo unzip ninja-linux.zip -d /usr/local/bin/
sudo update-alternatives --install /usr/bin/ninja ninja /usr/local/bin/ninja 1 --force

however, my question is why does installing ninja using this method works and not by doing a pip install?

hanlint · 2022-09-06T17:36:05Z

Yeah that's odd, my requirements were installed via pip. Are you sure that pip install --upgrade ninja==1.10.2 didn't work for you?

ananyahjha93 · 2022-09-06T20:09:56Z

@hanlint no, that is the weird part

that is why it was so difficult to find this bug last week

ananyahjha93 · 2022-09-16T07:57:35Z

@hanlint I think you can close this issue here, since this was linked to a specific ninja version local to our systems. I don't think this is a bigger issue. Fixing the ninja version might be an easy fix.

hanlint · 2022-09-16T16:02:39Z

OK thanks!

ananyahjha93 added the bug Something isn't working label Aug 27, 2022

ananyahjha93 closed this as completed Sep 1, 2022

ananyahjha93 reopened this Sep 1, 2022

hanlint self-assigned this Sep 5, 2022

hanlint closed this as completed Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation loop fails both with and without deepspeed #1472

Evaluation loop fails both with and without deepspeed #1472

ananyahjha93 commented Aug 27, 2022

hanlint commented Aug 27, 2022

bandish-shah commented Aug 27, 2022

ananyahjha93 commented Aug 30, 2022

hanlint commented Aug 30, 2022

ananyahjha93 commented Aug 30, 2022

ananyahjha93 commented Sep 1, 2022

ananyahjha93 commented Sep 1, 2022

bandish-shah commented Sep 1, 2022

ananyahjha93 commented Sep 5, 2022 •

edited

hanlint commented Sep 6, 2022

ananyahjha93 commented Sep 6, 2022

ananyahjha93 commented Sep 16, 2022

hanlint commented Sep 16, 2022

Evaluation loop fails both with and without deepspeed #1472

Evaluation loop fails both with and without deepspeed #1472

Comments

ananyahjha93 commented Aug 27, 2022

Expected behavior

Additional context

hanlint commented Aug 27, 2022

bandish-shah commented Aug 27, 2022

ananyahjha93 commented Aug 30, 2022

hanlint commented Aug 30, 2022

ananyahjha93 commented Aug 30, 2022

ananyahjha93 commented Sep 1, 2022

ananyahjha93 commented Sep 1, 2022

bandish-shah commented Sep 1, 2022

ananyahjha93 commented Sep 5, 2022 • edited

hanlint commented Sep 6, 2022

ananyahjha93 commented Sep 6, 2022

ananyahjha93 commented Sep 16, 2022

hanlint commented Sep 16, 2022

ananyahjha93 commented Sep 5, 2022 •

edited