Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error in training #40

Closed
nikky4D opened this issue Feb 10, 2022 · 4 comments
Closed

error in training #40

nikky4D opened this issue Feb 10, 2022 · 4 comments

Comments

@nikky4D
Copy link

nikky4D commented Feb 10, 2022

Hi, I encountered this error during training and I'm not sure what it means:

2022-02-09,21:22:00 | INFO | Rank 0 | Train Epoch: 9 [28800/43670 (66%)]        Loss: 0.493029  Data (t) 0.000  Batch (t) 0.235 LR: 0.000020    logit_scale 2.821
2022-02-09,21:22:24 | INFO | Rank 0 | Train Epoch: 9 [32000/43670 (73%)]        Loss: 0.642597  Data (t) 0.008  Batch (t) 0.274 LR: 0.000012    logit_scale 2.822
2022-02-09,21:22:48 | INFO | Rank 0 | Train Epoch: 9 [35200/43670 (81%)]        Loss: 0.442177  Data (t) 0.002  Batch (t) 0.243 LR: 0.000006    logit_scale 2.822
2022-02-09,21:23:13 | INFO | Rank 0 | Train Epoch: 9 [38400/43670 (88%)]        Loss: 0.435208  Data (t) 0.000  Batch (t) 0.255 LR: 0.000003    logit_scale 2.823
2022-02-09,21:23:37 | INFO | Rank 0 | Train Epoch: 9 [41600/43670 (95%)]        Loss: 0.295687  Data (t) 0.000  Batch (t) 0.240 LR: 0.000000    logit_scale 2.823
2022-02-09,21:24:36 | INFO | Rank 0 | Eval Epoch: 10 image_to_text_mean_rank: 40.2243   image_to_text_median_rank: 22.0000      image_to_text_R@1: 0.0628       image_to_text_R@5: 0.2063       image_to_text_R@10: 0.3273      text_to_image_mean_rank: 44.4849     text_to_image_median_rank: 25.0000      text_to_image_R@1: 0.0477       text_to_image_R@5: 0.1817       text_to_image_R@10: 0.2948      val_loss: 0.3798        epoch: 10.0000  num_elements: 6432.0000
Exception in thread Thread-5:
Traceback (most recent call last):
  File "C:\Users\nuzuegbunam\Anaconda3\envs\open_clip_3_9\lib\multiprocessing\connection.py", line 317, in _recv_bytes

Does anyone have any idea what this means?

@gabrielilharco
Copy link
Collaborator

Hi @nikky4D, thanks for opening up this issue. Is this the full stack trace? I haven't encountered anything like this before, so I'm wondering whether this could potentially be a mismatch in python / pytorch / cuda versions. Do your versions match those in environment.yml?

Also, could you provide more details on what hardware you are using?

@nikky4D
Copy link
Author

nikky4D commented Feb 10, 2022

For the stack trace: That is all I was given. I was running 10 epochs so this is at the final epoch.
For python/pytorch/cuda: I'm using python 3.9.10, with pytorch 1.10.2, and cuda 11.3.
For environment.yml: The versions don't match with those in the environment.yml as I'm using a windows machine and I couldn't install many of the packages in environment.yml in my conda environment.

@gabrielilharco
Copy link
Collaborator

I see, that could be the issue here. Our codebase was only tested with the versions in environment.yml and on unix machines

@nikky4D
Copy link
Author

nikky4D commented Feb 11, 2022

You may be correct. I'm using 4 workers, for batchsize of 32, on a 2080TI. And everything runs well until the last epoch where the after calculating the evaluation metrics, I get that error. It appears to only be affecting the saving of the checkpoint of the epoch directly precedeing the error as I don't seem able to load that checkpoint. I can load other checkpoints.

Anyway, I'll close this until I find a solution that would work.

@nikky4D nikky4D closed this as completed Feb 11, 2022
rom1504 added a commit that referenced this issue Nov 23, 2022
* Added support for Multilingual Dataset Wrapper and Multilingual MSCoco

* Removed temp file

* Delete model_loader.py

* Added default value to model_cache_dir params

* Added model_cache_dir option to test

* Delete multilingual_mscoco-old.py

* Converted Multilingual MS-COCO into own dataset

* Made Multilingual COCO independent from wrapper

* Delete multilingual_dataset.py

* Fixed broken import

Co-authored-by: Romain Beaumont <romain.rom1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants