Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation loop fails both with and without deepspeed #1472

Closed
ananyahjha93 opened this issue Aug 27, 2022 · 13 comments
Closed

Evaluation loop fails both with and without deepspeed #1472

ananyahjha93 opened this issue Aug 27, 2022 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@ananyahjha93
Copy link
Contributor

** Environment **

  • OS: Ubuntu 20.04
  • Hardware (GPU, or instance type): 8xA100
  • cuda: 11.3
  • cudnn: 8
  • pytorch: 1.12.1
  • composer: dev branch installed from source
  • deepspeed: 0.7.2
  • transformers: 4.21.2

** To reproduce

Steps to reproduce the behavior:

  1. Use C4Dataset to train HF bloom on multiple GPUs with or without deepspeed.

Expected behavior

Eval loop should run without crashing.

Additional context

Error message without deepspeed.
Screen Shot 2022-08-27 at 4 57 15 AM

Error message with deepspeed
Screen Shot 2022-08-27 at 5 12 56 AM

I'll try to debug a bit more to see what's wrong but posting it here in the meantime.

@ananyahjha93 ananyahjha93 added the bug Something isn't working label Aug 27, 2022
@hanlint
Copy link
Contributor

hanlint commented Aug 27, 2022

Thanks @ananyahjha93 , we did merge a major refactor this week (#1419) that may be causing some instabilities on dev. Was this working before on a versioned release?

@bandish-shah
Copy link
Member

Thanks @ananyahjha93 for bringing this up. As @hanlint mentioned above, we’re in the process of a few major refactors so dev could be unstable for some time. We suggest you stick to one of the official releases:
https://github.com/mosaicml/composer/releases

v0.9.0 is the most recent release, could you give that a try?

@ananyahjha93
Copy link
Contributor Author

@bandish-shah tried 0.9.0, the eval loop crashes because of OOM error, which is weird because eval should not allocate additional memory on the GPU.

This is the GPU usage while training:
Screen Shot 2022-08-29 at 9 21 24 PM

But once the eval loop starts, the code crashes with this message:
Screen Shot 2022-08-29 at 9 24 37 PM

@hanlint
Copy link
Contributor

hanlint commented Aug 30, 2022

🤔 If you are supplying the same batch_size=64 to both training and eval, this can occur since grad_accum only applies to the training, not to eval. Could you try setting a smaller batch size for eval?

Or alternatively, if you are installing from dev, you can try grad_accum=auto and it should adjust everything to fit in memory for both training and eval.

@ananyahjha93
Copy link
Contributor Author

@hanlint that is most likely the obvious answer which I have been overlooking
let me check and report back, if this was the only issue with my run, I am sorry for the confusion

@ananyahjha93
Copy link
Contributor Author

@hanlint closing this issue, yes you were right its the batch size for eval dataloader which was the issue, but I guess a line warning about this with the grad accum thing would be nice!

@ananyahjha93
Copy link
Contributor Author

@hanlint @bandish-shah the batch size thing mentioned fixes 0.9.0 and eval runs fine with it. But with the dev branch the problem of eval dataloader process exiting randomly continues.

@bandish-shah
Copy link
Member

Hi @ananyahjha93 we're still working through stability issues on dev so it would be best to use a released version of Composer. Is there a reason you're trying to use dev? We'll keep this bug open regardless to see if a fix addresses this issue.

One thing that comes to mind looking at your environment information, I don't think we've tested Composer with Torch 1.12.1 + CUDA 11.3. Our latest Docker images are using Torch 1.12.1 + CUDA 11.6.2 which is what we run our CI testing on:
https://hub.docker.com/layers/mosaicml/pytorch/1.12.1_cu116-python3.9-ubuntu20.04/images/sha256-f0dc580a51667970d48eb67b1407bd992959f681763b0c324d3cd21626d54e91?context=repo

It's a shot in the dark but would it be possible for you to try this configuration?

@hanlint hanlint self-assigned this Sep 5, 2022
@ananyahjha93
Copy link
Contributor Author

ananyahjha93 commented Sep 5, 2022

so this seems to be a ninja version issue, because I was getting the following error on a different machine: zhanghang1989/PyTorch-Encoding#167

So using their fix I followed these steps:

wget https://github.com/ninja-build/ninja/releases/download/v1.8.2/ninja-linux.zip
sudo unzip ninja-linux.zip -d /usr/local/bin/
sudo update-alternatives --install /usr/bin/ninja ninja /usr/local/bin/ninja 1 --force 

however, my dataloader crashes again, which made me look at ninja version in the requirements file shared by @hanlint .

updating ninja to 1.10.2 works

wget https://github.com/ninja-build/ninja/releases/download/v1.10.2/ninja-linux.zip
sudo unzip ninja-linux.zip -d /usr/local/bin/
sudo update-alternatives --install /usr/bin/ninja ninja /usr/local/bin/ninja 1 --force 

however, my question is why does installing ninja using this method works and not by doing a pip install?

@hanlint
Copy link
Contributor

hanlint commented Sep 6, 2022

Yeah that's odd, my requirements were installed via pip. Are you sure that pip install --upgrade ninja==1.10.2 didn't work for you?

@ananyahjha93
Copy link
Contributor Author

@hanlint no, that is the weird part

that is why it was so difficult to find this bug last week

@ananyahjha93
Copy link
Contributor Author

@hanlint I think you can close this issue here, since this was linked to a specific ninja version local to our systems. I don't think this is a bigger issue. Fixing the ninja version might be an easy fix.

@hanlint
Copy link
Contributor

hanlint commented Sep 16, 2022

OK thanks!

@hanlint hanlint closed this as completed Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants