Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thx for your job. I wonder what's suitable to train model, specially in what type of gpu and how big of gpu memory. I trained a few times in quadro 6000 but the log said sigkill problem in second epoch. #7

Closed
nievuelo opened this issue Feb 23, 2023 · 5 comments

Comments

@nievuelo
Copy link

nievuelo commented Feb 23, 2023

My GPU quadro 6000 24g main cpu memory 64g

@nievuelo
Copy link
Author

I mean, the first epoch seems work well, but in the second epoch and during the mid time of the seonc batch like 4660/5033, my thread will be killed and the pycharm tips are like:"interrupted by signal 9: sigkill" I check my log in linux and it said cpu out of memory, But my main memory of linux is 64g, I think it might be enough. I don't know what causes this kind of error.

@Nicozwy
Copy link
Owner

Nicozwy commented Feb 25, 2023

I think that the CPU memory is too small because we have to load many reports.
Have you tried loading a small portation of reports for each claim?

@nievuelo
Copy link
Author

Thanks a lot for your generous help. I will star this project.
But I mean it's not the storage capacity 199g, it's main memory capacity 64g. So I revised the eval_exp_fc5 from report each claim=12->report_each_claim=6, and change the report _each_claim from 30 to 15 in train_exp_fc5_liar_raw2. But it seems still throw kill sig problem. But the difference is before I change the parameter, it would crash during training time but after I revised it, it would crash during evaluation time. But I just had changed the parameter of report_each_claim in evaluate_model func from eval_exp_fc5. So could you help me to figure it out?

@nievuelo
Copy link
Author

Or in other words, how big of the required memory is?

@Nicozwy
Copy link
Owner

Nicozwy commented Feb 25, 2023

Hi, @nievuelo . I can not figure it out with limited information, but you can try this code on the other machine with GPU 3090 because we have successfully tested it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants