Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak problem #2

Closed
liuzili97 opened this issue Apr 19, 2022 · 13 comments
Closed

Memory Leak problem #2

liuzili97 opened this issue Apr 19, 2022 · 13 comments

Comments

@liuzili97
Copy link

liuzili97 commented Apr 19, 2022

thanks for your work! I am getting a memory leak while training.

I strictly follow the installation instruction in docs/INSTALL.md, and train the model using the following script:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 main_persformer.py --mod=persformer --batch_size=2 --nepochs=40

However, the memory consumption (system memory instead of cuda memory) gradually increases during training, and finally takes all the memory.

Does this problem occur with you?

@ChonghaoSima
Copy link
Collaborator

Hello, just try to reproduce your problem, how much memory do you have in your machine? And after how many epoch does it happen?

@liuzili97
Copy link
Author

liuzili97 commented Apr 19, 2022

thanks for your reply. The memory is 256GB, and the memory problem happens in the first epoch.

I can observe in htop that the memory usage is gradually increasing from around 20GB to 200+GB, and then the process crashes. It takes about 10~20 minutes from the start of training.

@ChonghaoSima
Copy link
Collaborator

ChonghaoSima commented Apr 19, 2022

could you provide more information about your machine? such as pytorch version, cuda version, python version, etc. We didn't go into a memory leak when we train on 4-3090 with 128GB memory.

@liuzili97
Copy link
Author

my python version, cuda version, and PyTorch version are 3.8.13, 11.1, and 1.8.1.

I will train the model on other machines later and see if the problem still exists.

@liuzili97
Copy link
Author

it seems the same problem still exists on other machines with python version 3.6.13.

@dyfcalid
Copy link
Collaborator

So far everything works fine with our tests, and we’ll test on other machines later to see if we can reproduce your problem.

@dyfcalid
Copy link
Collaborator

Maybe you can pull the latest code and have a try to see if the problem still exists.

@liuzili97
Copy link
Author

Maybe you can pull the latest code and have a try to see if the problem still exists.

Thanks, I pull the latest code but the problem still exists.

Maybe it is caused by some unexpected environment problem. I would close this issue. If anyone else encounters this issue in the future, we may re-open this issue again.

@nickle-fang
Copy link

@liuzili97 I have met the same problem and my environment settings are the same as yours. Have you solved the problem?

@liuzili97
Copy link
Author

@liuzili97 I have met the same problem and my environment settings are the same as yours. Have you solved the problem?

No, I haven't

@asadnorouzi
Copy link

I also have the same issue! It consumes almost 99% of my system memory and crashes even before training starts (after loading dataset). I reported it in a separate issue here: #33

@casialixiaodong
Copy link

could you provide more information about your machine? such as pytorch version, cuda version, python version, etc. We didn't go into a memory leak when we train on 4-3090 with 128GB memory.

Thanks for your perfect work. Would you like to tell me the gcc --version of your environment with your 4-3090? My Server is 8-3090+CUDA11.1+pytorch1.8.0+gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu120.10), but I can't solve the problem in ‘INSTALL.md’ section when "cd models/nms/ --> python setup.py install"

@ChonghaoSima
Copy link
Collaborator

ChonghaoSima commented Aug 3, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants