Searching ITM is stucked at epoch10 #8

ghost · 2021-02-27T10:49:03Z

run search_itm.py is stucked at epoch10. No errors occur and the program does not terminate itself.
The last output is as the following,

evaluate percent 45.2755905511811
evaluate percent 47.24409448818898
evaluate percent 49.21259842519685
evaluate percent 51.181102362204726
evaluate percent 53.14960629921261
evaluate percent 55.118110236220474
evaluate percent 57.08661417322835
evaluate percent 59.055118110236215
evaluate percent 61.023622047244096
evaluate percent 62.99212598425197
evaluate percent 64.96062992125984
evaluate percent 66.92913385826772
evaluate percent 68.89763779527559
evaluate percent 70.86614173228347
evaluate percent 72.83464566929135
evaluate percent 74.80314960629921
evaluate percent 76.77165354330708
evaluate percent 78.74015748031496
evaluate percent 80.70866141732283
evaluate percent 82.67716535433071
evaluate percent 84.64566929133859
evaluate percent 86.61417322834646
evaluate percent 88.58267716535433
evaluate percent 90.5511811023622
evaluate percent 92.51968503937007
evaluate percent 94.48818897637796
evaluate percent 96.45669291338582
evaluate percent 98.4251968503937
(1014, 5070)
i2t stat num: 1014
i2t results: 14.89 37.48 50.79 10.00 34.80

t2i stat num: 5070
t2i results: 12.31 36.31 51.50 10.00 29.36

reset negative captions ...
reset negative captions ...
reset negative captions ...
reset negative captions ...

And the output of nvidia-smi is as the following all the time since the program is stucked.

Sat Feb 27 18:48:18 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          On   | 00000000:02:00.0 Off |                    0 |
| 23%   37C    P0    63W / 235W |   5573MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          On   | 00000000:03:00.0 Off |                    0 |
| 23%   43C    P0    69W / 235W |   9840MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K40m          On   | 00000000:82:00.0 Off |                    0 |
| N/A   33C    P0    62W / 235W |   5573MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K40m          On   | 00000000:83:00.0 Off |                    0 |
| N/A   34C    P0    68W / 235W |   9840MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     24425      C   ...naconda3/envs/py36-t101-cu90/bin/python  5562MiB |
|    1     24426      C   ...naconda3/envs/py36-t101-cu90/bin/python  9827MiB |
|    2     24427      C   ...naconda3/envs/py36-t101-cu90/bin/python  5562MiB |
|    3     24428      C   ...naconda3/envs/py36-t101-cu90/bin/python  9827MiB |
+-----------------------------------------------------------------------------+

I have noticed that epoch10 is the NEG_START_EPOCH, but I have no idea about what is wrong there.

The text was updated successfully, but these errors were encountered:

ghost · 2021-02-27T10:59:08Z

And after I killed the program manually,
The traceback is as the following:

^CException ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f87c6c81898>>
Traceback (most recent call last):
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 717, in __del__
Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f01aa828c18>>
Traceback (most recent call last):
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 717, in __del__
    self._shutdown_workers()
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 685, in _shutdown_workers
    self.done_event.set()
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/synchronize.py", line 346, in set
    self._shutdown_workers()
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 685, in _shutdown_workers
    self.done_event.set()
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/synchronize.py", line 346, in set
    with self._cond:
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/synchronize.py", line 230, in __enter__
    return self._lock.__enter__()
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    with self._cond:
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/synchronize.py", line 230, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
    return self._lock.__enter__()
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "search_itm.py", line 721, in <module>
    join=True
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
    while not spawn_context.join():
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 73, in join
    timeout=timeout,
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

And the output of nvidia-smi is as the following now:

Sat Feb 27 19:02:11 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          On   | 00000000:02:00.0 Off |                    0 |
| 23%   25C    P8    20W / 235W |     12MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          On   | 00000000:03:00.0 Off |                    0 |
| 23%   42C    P0    69W / 235W |   9840MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K40m          On   | 00000000:82:00.0 Off |                    0 |
| N/A   24C    P8    20W / 235W |     12MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K40m          On   | 00000000:83:00.0 Off |                    0 |
| N/A   34C    P0    67W / 235W |   9840MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1     24426      C   ...naconda3/envs/py36-t101-cu90/bin/python  9827MiB |
|    3     24428      C   ...naconda3/envs/py36-t101-cu90/bin/python  9827MiB |
+-----------------------------------------------------------------------------+

And after I killed the two processes on GPU 1 and 3 by nvidia-smi | grep 'python' | awk '{ print $3 }' | xargs -n1 kill -9, I got the following output:

(py36-t101-cu90) zhouxx@gpu79:~/gprojects/mmnas$ Process SpawnProcess-2:
Traceback (most recent call last):
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 21, in _wrap
    pass  # SIGINT; Killed by parent, do nothing
KeyboardInterrupt

(py36-t101-cu90) zhouxx@gpu79:~/gprojects/mmnas$ Process SpawnProcess-4:
Traceback (most recent call last):
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 21, in _wrap
    pass  # SIGINT; Killed by parent, do nothing
KeyboardInterrupt
Process SpawnProcess-3:
Traceback (most recent call last):
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 21, in _wrap
    pass  # SIGINT; Killed by parent, do nothing
KeyboardInterrupt
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 21, in _wrap
    pass  # SIGINT; Killed by parent, do nothing
KeyboardInterrupt
^C
(py36-t101-cu90) zhouxx@gpu79:~/gprojects/mmnas$ ^C
(py36-t101-cu90) zhouxx@gpu79:~/gprojects/mmnas$
(py36-t101-cu90) zhouxx@gpu79:~/gprojects/mmnas$ /home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown
  len(cache))
/home/zhouxx/anaconda3/envs/py36-t101-cu90/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 20 leaked semaphores to clean up at shutdown
  len(cache))

It seems that a process goes wrong (maybe out of memory but no hints) and the others are waiting.
But I cannot still figure it out.
I'll appreciate it if anyone can help.

ghost · 2021-02-27T11:17:55Z

It really seems like facebookresearch/fairseq#708 (comment) .

ghost · 2021-02-27T11:47:18Z

I have just noticed the prerequisites in README.
I think maybe the requirement of 150GB memory for ITM is the essential reason for the problem above.

I checked the RAM size of my server.

(py36-t041-cu90) zhouxx@gpu79:~/gprojects/mmnas/logs/ckpts$ free -m -h
              total        used        free      shared  buff/cache   available
Mem:            94G         71G        844M         22G         22G        528M
Swap:           29G         29G        1.2M

Is there any tricky way to reduce the memory cost but do not reduce the batch size?

ghost · 2021-02-27T12:11:20Z

And I wonder why ITM requires so much mem.
I'll appreciate it if anyone could explain that.

MIL-VLG · 2021-03-02T13:51:54Z

Sorry for the late reply. The ITM indeed need so much memory for a deep model like MMnas. If the memory is not sufficient, maybe you can reduce the hidden dimension from 512 to 256 to have a try.

The reason for the large memory is that we need to forward the positive samples along with its negative samples into the network, which makes it more memory consuming compared to other tasks.

ghost · 2021-03-04T05:42:31Z

@MIL-VLG Got it! Thanks!

ghost closed this as completed Mar 4, 2021

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Searching ITM is stucked at epoch10 #8

Searching ITM is stucked at epoch10 #8

ghost commented Feb 27, 2021 •

edited by ghost

Loading

ghost commented Feb 27, 2021 •

edited by ghost

Loading

ghost commented Feb 27, 2021

ghost commented Feb 27, 2021

ghost commented Feb 27, 2021

MIL-VLG commented Mar 2, 2021

ghost commented Mar 4, 2021

Searching ITM is stucked at epoch10 #8

Searching ITM is stucked at epoch10 #8

Comments

ghost commented Feb 27, 2021 • edited by ghost Loading

ghost commented Feb 27, 2021 • edited by ghost Loading

ghost commented Feb 27, 2021

ghost commented Feb 27, 2021

ghost commented Feb 27, 2021

MIL-VLG commented Mar 2, 2021

ghost commented Mar 4, 2021

ghost commented Feb 27, 2021 •

edited by ghost

Loading

ghost commented Feb 27, 2021 •

edited by ghost

Loading