Memory Leak problem #2

liuzili97 · 2022-04-19T05:30:46Z

thanks for your work! I am getting a memory leak while training.

I strictly follow the installation instruction in docs/INSTALL.md, and train the model using the following script:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 main_persformer.py --mod=persformer --batch_size=2 --nepochs=40

However, the memory consumption (system memory instead of cuda memory) gradually increases during training, and finally takes all the memory.

Does this problem occur with you?

The text was updated successfully, but these errors were encountered:

ChonghaoSima · 2022-04-19T06:03:10Z

Hello, just try to reproduce your problem, how much memory do you have in your machine? And after how many epoch does it happen?

liuzili97 · 2022-04-19T06:08:37Z

thanks for your reply. The memory is 256GB, and the memory problem happens in the first epoch.

I can observe in htop that the memory usage is gradually increasing from around 20GB to 200+GB, and then the process crashes. It takes about 10~20 minutes from the start of training.

ChonghaoSima · 2022-04-19T06:15:19Z

could you provide more information about your machine? such as pytorch version, cuda version, python version, etc. We didn't go into a memory leak when we train on 4-3090 with 128GB memory.

liuzili97 · 2022-04-19T07:16:57Z

my python version, cuda version, and PyTorch version are 3.8.13, 11.1, and 1.8.1.

I will train the model on other machines later and see if the problem still exists.

liuzili97 · 2022-04-20T07:22:05Z

it seems the same problem still exists on other machines with python version 3.6.13.

dyfcalid · 2022-04-20T08:32:10Z

So far everything works fine with our tests, and we’ll test on other machines later to see if we can reproduce your problem.

dyfcalid · 2022-05-13T05:59:03Z

Maybe you can pull the latest code and have a try to see if the problem still exists.

liuzili97 · 2022-05-13T07:20:09Z

Maybe you can pull the latest code and have a try to see if the problem still exists.

Thanks, I pull the latest code but the problem still exists.

Maybe it is caused by some unexpected environment problem. I would close this issue. If anyone else encounters this issue in the future, we may re-open this issue again.

nickle-fang · 2022-07-20T05:51:36Z

@liuzili97 I have met the same problem and my environment settings are the same as yours. Have you solved the problem?

liuzili97 · 2022-07-20T05:54:00Z

@liuzili97 I have met the same problem and my environment settings are the same as yours. Have you solved the problem?

No, I haven't

asadnorouzi · 2022-07-27T14:07:32Z

I also have the same issue! It consumes almost 99% of my system memory and crashes even before training starts (after loading dataset). I reported it in a separate issue here: #33

casialixiaodong · 2022-08-03T09:49:13Z

could you provide more information about your machine? such as pytorch version, cuda version, python version, etc. We didn't go into a memory leak when we train on 4-3090 with 128GB memory.

Thanks for your perfect work. Would you like to tell me the gcc --version of your environment with your 4-3090? My Server is 8-3090+CUDA11.1+pytorch1.8.0+gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu120.10), but I can't solve the problem in ‘INSTALL.md’ section when "cd models/nms/ --> python setup.py install"

ChonghaoSima · 2022-08-03T10:15:23Z

We’re using gcc 6 on 4-3090 machine Best Chonghao Sima

…

________________________________ From: casialixiaodong ***@***.***> Sent: Wednesday, August 3, 2022 5:49:25 PM To: OpenPerceptionX/PersFormer_3DLane ***@***.***> Cc: Sima, Chonghao ***@***.***>; Comment ***@***.***> Subject: Re: [OpenPerceptionX/PersFormer_3DLane] Memory Leak problem (Issue #2)

---- External Email: Use caution with attachments, links, or sharing data ---- could you provide more information about your machine? such as pytorch version, cuda version, python version, etc. We didn't go into a memory leak when we train on 4-3090 with 128GB memory. Thanks for your perfect work. Would you like to tell me the gcc --version of your environment with your 4-3090? My Server is 8-3090+CUDA11.1+pytorch1.8.0+gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu120.10), but I can't solve the problem in ‘INSTALL.md’ section when "cd models/nms/ --> python setup.py install" — Reply to this email directly, view it on GitHub<https://github.com/OpenPerceptionX/PersFormer_3DLane/issues/2#issuecomment-1203729550>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AM6RGLTQWODGIFKJWKTPBW3VXI6CLANCNFSM5TXVFLZQ>. You are receiving this because you commented.Message ID: ***@***.***>

liuzili97 closed this as completed May 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Leak problem #2

Memory Leak problem #2

liuzili97 commented Apr 19, 2022 •

edited

Loading

ChonghaoSima commented Apr 19, 2022

liuzili97 commented Apr 19, 2022 •

edited

Loading

ChonghaoSima commented Apr 19, 2022 •

edited

Loading

liuzili97 commented Apr 19, 2022

liuzili97 commented Apr 20, 2022

dyfcalid commented Apr 20, 2022

dyfcalid commented May 13, 2022

liuzili97 commented May 13, 2022

nickle-fang commented Jul 20, 2022

liuzili97 commented Jul 20, 2022

asadnorouzi commented Jul 27, 2022

casialixiaodong commented Aug 3, 2022

ChonghaoSima commented Aug 3, 2022 via email

Memory Leak problem #2

Memory Leak problem #2

Comments

liuzili97 commented Apr 19, 2022 • edited Loading

ChonghaoSima commented Apr 19, 2022

liuzili97 commented Apr 19, 2022 • edited Loading

ChonghaoSima commented Apr 19, 2022 • edited Loading

liuzili97 commented Apr 19, 2022

liuzili97 commented Apr 20, 2022

dyfcalid commented Apr 20, 2022

dyfcalid commented May 13, 2022

liuzili97 commented May 13, 2022

nickle-fang commented Jul 20, 2022

liuzili97 commented Jul 20, 2022

asadnorouzi commented Jul 27, 2022

casialixiaodong commented Aug 3, 2022

ChonghaoSima commented Aug 3, 2022 via email

liuzili97 commented Apr 19, 2022 •

edited

Loading

liuzili97 commented Apr 19, 2022 •

edited

Loading

ChonghaoSima commented Apr 19, 2022 •

edited

Loading