Wikikgv2 Model Training is Hanging #11

HarryShomer · 2022-10-04T17:59:29Z

Hi,

I'm having an issue where the model gets stuck while training. It typically happens early in the first epoch. Below is an example when running the smore/training/vec_scripts/train_shallow_wikikgv2.sh (unmodified sans the GPUs) on 4 NVIDIA RTX A6000 50GB GPUs.

It hangs forever unless I stop it with a keyboard interrupt. Doing so yields the following traceback (I only post a portion because it's very long and repetitive).

It seems like something is happening in the multiprocessing as it's hanging when sharing messages between processes.

Any help would be appreciated! @hyren

Thanks,
Harry

The text was updated successfully, but these errors were encountered:

hyren · 2022-10-09T02:30:30Z

Hi, this does not seem to be related to the smore repo. Here is an instruction I found, can you please check?
https://www.rdmamojo.com/2012/05/18/libibverbs/

HarryShomer · 2022-10-11T20:07:29Z

That doesn't seem to be the issue.

I tried seting up libibverbs, and while the warnings went away, the code is still hanging.

hyren · 2022-10-11T21:14:10Z

Hi, can you say more about the details of the environment?

HarryShomer · 2022-10-11T21:43:50Z

Sure. Here's some basic info. Let me know if you would like anything else.

OS: Ubuntu 20.04.5 LTS
CUDA: 11.6.124
GPU(s): NVIDIA RTX A6000
Python: 3.9.12
PyTorch: 1.12.1

Hanjun-Dai · 2022-10-11T21:49:20Z

Hi there, sorry for the inconvenience, could you please quickly try with only 1 gpu, and see if it still hangs? We are trying to figure out whether it is due to gpu-gpu communication, or the c++ sampler we used.

HarryShomer · 2022-10-11T21:57:27Z

Sorry, I should have been clearer in my initial comment. The code doesn't hang when running with just one GPU. This only occurs when using multiple GPUs.

Juanhui28 · 2022-10-12T05:58:08Z

We tried one gpu again, actually it still hangs. Sorry for the inconvenience.

Hanjun-Dai · 2022-10-12T07:26:35Z

Hi there,

Could you please kindly pull the latest code in the wikikgv2 branch, and add --train_async_rw=False to your script and try again? Since I'm unable to reproduce your issue, I'd like to see whether this would temporarily resolve the issue.

Juanhui28 · 2022-10-12T22:06:19Z

Hi,

We tried your suggestion and now the training doesn't hang! But we got the similar issue in the evaluation. And we noticed your suggestion in another issue and follow the suggestion, but it still hangs.

I really appreciate your help!

hyren · 2022-10-12T22:21:07Z

hi, what's the script you used?

Juanhui28 · 2022-10-12T22:22:16Z

Hi, train_shallow_wikikgv2.sh in the training/vec_scripts folder.

hyren · 2022-10-12T22:34:16Z

Just to make sure, you used the scripts in train_shallow_wikikgv2.sh and add the --train_async_rw=False flag?

Juanhui28 · 2022-10-12T23:54:48Z

Actually no, since if we add it we got the unrecognized arguments error.

hyren · 2022-10-12T23:57:30Z

have you pulled the most recent commits?

Juanhui28 · 2022-10-13T03:16:09Z

Sorry we made some mistakes when we merged the code. Now we pulled the the latest code and it works for both training and evaluation!

Thank you so much!

hyren closed this as completed Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wikikgv2 Model Training is Hanging #11

Wikikgv2 Model Training is Hanging #11

HarryShomer commented Oct 4, 2022 •

edited

hyren commented Oct 9, 2022

HarryShomer commented Oct 11, 2022

hyren commented Oct 11, 2022

HarryShomer commented Oct 11, 2022

Hanjun-Dai commented Oct 11, 2022

HarryShomer commented Oct 11, 2022

Juanhui28 commented Oct 12, 2022

Hanjun-Dai commented Oct 12, 2022

Juanhui28 commented Oct 12, 2022

hyren commented Oct 12, 2022

Juanhui28 commented Oct 12, 2022

hyren commented Oct 12, 2022

Juanhui28 commented Oct 12, 2022 •

edited

hyren commented Oct 12, 2022

Juanhui28 commented Oct 13, 2022

Wikikgv2 Model Training is Hanging #11

Wikikgv2 Model Training is Hanging #11

Comments

HarryShomer commented Oct 4, 2022 • edited

hyren commented Oct 9, 2022

HarryShomer commented Oct 11, 2022

hyren commented Oct 11, 2022

HarryShomer commented Oct 11, 2022

Hanjun-Dai commented Oct 11, 2022

HarryShomer commented Oct 11, 2022

Juanhui28 commented Oct 12, 2022

Hanjun-Dai commented Oct 12, 2022

Juanhui28 commented Oct 12, 2022

hyren commented Oct 12, 2022

Juanhui28 commented Oct 12, 2022

hyren commented Oct 12, 2022

Juanhui28 commented Oct 12, 2022 • edited

hyren commented Oct 12, 2022

Juanhui28 commented Oct 13, 2022

HarryShomer commented Oct 4, 2022 •

edited

Juanhui28 commented Oct 12, 2022 •

edited