Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs? #1

Closed
WangWenhao0716 opened this issue Dec 10, 2021 · 14 comments
Closed

Bugs? #1

WangWenhao0716 opened this issue Dec 10, 2021 · 14 comments

Comments

@WangWenhao0716
Copy link

Congratulations! We really appreciate the work. When I run the

python v107.py \
  -a tf_efficientnetv2_m_in21ft1k --dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 --seed 99999 \
  --epochs 10 --lr 0.5 --wd 1e-6 --batch-size 16 --ncrops 2 \
  --gem-p 1.0 --pos-margin 0.0 --neg-margin 1.1 --weight ./v98/train/checkpoint_0001.pth.tar \
  --input-size 512 --sample-size 1000000 --memory-size 1000 \
  ../input/training_images/

I come across

Traceback (most recent call last):                                              
  File "v107.py", line 774, in <module>
    train(args)
  File "v107.py", line 425, in train
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/home/wangwenhao/anaconda3/envs/ISC/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/wangwenhao/anaconda3/envs/ISC/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/wangwenhao/anaconda3/envs/ISC/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 5 terminated with the following error:
Traceback (most recent call last):
  File "/home/wangwenhao/anaconda3/envs/ISC/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/wangwenhao/fbisc-descriptor-1st/exp/v107.py", line 573, in main_worker
    train_one_epoch(train_loader, model, loss_fn, optimizer, scaler, epoch, args)
  File "/home/wangwenhao/fbisc-descriptor-1st/exp/v107.py", line 595, in train_one_epoch
    labels = torch.cat([torch.tile(i, dims=(args.ncrops,)), torch.tensor(j)])
ValueError: only one element tensors can be converted to Python scalars

Do you know how to fix it?
Thanks.

@lyakaap
Copy link
Owner

lyakaap commented Dec 10, 2021

Congrats to you too! Your winnings in both tracks are incredible :)

I haven't ever seen such an error. Do you put image files on correct location? If you have already done this correctly, print i and j value and show me outputs. It might be useful information for debug.

@WangWenhao0716
Copy link
Author

Thanks for your reply. In fact, I know that might be useful for debugging and I have tried it yesterday. However, I cannot work it out by myself.

i =  tensor([1537, 1191])
j =  [tensor([1546283, 1867690]), tensor([1780914, 1504719]), tensor([1353055, 1878239]), tensor([1931255, 1205254]), tensor([1178165, 1401500]), tensor([1713147, 1749940]), tensor([1333900, 1671408]), tensor([1732070, 1593446]), tensor([1475793, 1149125]), tensor([1002561, 1548406]), tensor([1634161, 1714439]), tensor([1729160, 1631621]), tensor([1257713, 1890521]), tensor([1896319, 1713320]), tensor([1085255, 1081381]), tensor([1392220, 1799155]), tensor([1460125, 1605860]), tensor([1426539, 1045038]), tensor([1722017, 1349333]), tensor([1371985, 1360729]), tensor([1332006, 1671282]), tensor([1339213, 1493030]), tensor([1909343, 1060632]), tensor([1400760, 1459965]), tensor([1692564, 1535537]), tensor([1494376, 1822024]), tensor([1878225, 1558317]), tensor([1288187, 1682532]), tensor([1793712, 1596738]), tensor([1348662, 1096824])]

A toy example:

import torch
i = torch.Tensor([1537, 1191])
j = [torch.Tensor([1380528, 1715717]), torch.Tensor([1614647, 1619035])]
torch.cat([torch.tile(i, dims=(2,)), torch.tensor(j)])

It will also result in

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_1097398/2617762925.py in <module>
----> 1 torch.cat([torch.tile(i, dims=(2,)), torch.tensor(j)])

ValueError: only one element tensors can be converted to Python scalars

Looking forward to your reply.

@lyakaap
Copy link
Owner

lyakaap commented Dec 12, 2021

In my case, i and j are as follows:

i:

tensor([1537]) 

j:

[tensor([1493751]), tensor([1594483]), tensor([1310616]), tensor([1566637]), tensor([1041634]), tensor([1321072]), tensor([1756446]), tensor([1876031]), tensor([1949834]), tensor([1317828]), tensor([1293972]), tensor([1700646]), tensor([1928488]), tensor([1719636])
, tensor([1716178]), tensor([1565452]), tensor([1281302]), tensor([1904498]), tensor([1212152]), tensor([1821218]), tensor([1004454]), tensor([1903469]), tensor([1583914]), tensor([1809848]), tensor([1894128]), tensor([1311861]), tensor([1405172]), tensor([1122038]), tensor([1628
859]), tensor([1761828])]

It is strange that tensors have two elements in your case.
Please confirm again your data location and your pytorch version.

@WangWenhao0716
Copy link
Author

In my case, i and j are as follows:

i:

tensor([1537]) 

j:

[tensor([1493751]), tensor([1594483]), tensor([1310616]), tensor([1566637]), tensor([1041634]), tensor([1321072]), tensor([1756446]), tensor([1876031]), tensor([1949834]), tensor([1317828]), tensor([1293972]), tensor([1700646]), tensor([1928488]), tensor([1719636])
, tensor([1716178]), tensor([1565452]), tensor([1281302]), tensor([1904498]), tensor([1212152]), tensor([1821218]), tensor([1004454]), tensor([1903469]), tensor([1583914]), tensor([1809848]), tensor([1894128]), tensor([1311861]), tensor([1405172]), tensor([1122038]), tensor([1628
859]), tensor([1761828])]

It is strange that tensors have two elements in your case. Please confirm again your data location and your pytorch version.

Thanks. It is interesting. I will double-check all the related files and give you feedback.

@WangWenhao0716
Copy link
Author

Hi, I have double-checked the PyTorch version:

>>> import torch
>>> torch.__version__
'1.9.0+cu111'

Also the data-dir:

input
  query_images
  reference_images
  training_images
  public_ground_truth.csv
exp
...

However, the problem still exists 😭😭😭.
Please make sure that you run the v107.py rather than others (others perform well).
It is very strange.
Or can you have a real-time meeting (such as zoom) with me to reproduce the bug?
Thanks a lot!

@WangWenhao0716
Copy link
Author

I will do some future works on this topic, therefore, as a benchmark, your method is crucial to my research. Thanks!

@lyakaap
Copy link
Owner

lyakaap commented Dec 14, 2021

That's truly strange.
Okay, let's arrange real-time meeting schedule via email: bepemgdlp@gmail.com

@WangWenhao0716
Copy link
Author

That's truly strange. Okay, let's arrange real-time meeting schedule via email: bepemgdlp@gmail.com

I'm free at any time today. Could you arrange a zoom meeting at your convenience time? Zoom does not allow Chinese guys to arrange a meeting. Thanks!

@WangWenhao0716
Copy link
Author

My email is wangwenhao0716@gmail.com

@lyakaap
Copy link
Owner

lyakaap commented Dec 15, 2021

I noticed that my code doesn't consider the cases using fewer GPUs than 16 GPUs in v107.py.
I will fix it and commit.
Thanks @WangWenhao0716

@WangWenhao0716
Copy link
Author

Thanks for your reply and all your contribution.

@lyakaap
Copy link
Owner

lyakaap commented Dec 16, 2021

Fixed it. Close this issue.

@lyakaap lyakaap closed this as completed Dec 16, 2021
@WangWenhao0716
Copy link
Author

All the other parts work well. Thanks again for your work. By the way, faiss works with A100 well (faiss 1.7.1 with cuda11.1).

@lyakaap
Copy link
Owner

lyakaap commented Dec 20, 2021

Thanks for reporting! Will take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants