-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to modify the identifier of GPU and the number of GPU to train the model? #29
Comments
You can specify GPU ids with |
I added statements:os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1' in the train_net script。When performing train_net script training, Report an error: |
oh,I know that I need to modify the IMS_PER_BATCH and IMS_PER_DEVICE parameter in the config script to change its batch_size. |
When you use two GPUs, the error For changing the |
I have now modified the corresponding parameters in the config script, but run train_ net script still reports an error: |
Traceback (most recent call last): |
Could you provide more details about your command for training? |
I am using the train_net script under tools folder for training, Some parameters in the config script are adjusted, including IMS_PER_BATCH, IMS_PER_DEVICE, WARMUP_FACTOR and WARMUP_ITERS parameters。And add extra statement in the train_net script : os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1'. |
You need to add |
Now there is a new error in the 'dist URL' parameter: ai...Your code actually is too hard to run。。。。 |
Why not just follow the steps in README. It should work well. |
Using the method in REDEME to train, it can only modify the number of GPUs, but it definitely can't update the identifier of GPU to train at all. |
It can.... I give an exmaple above.
|
Ok,I konw. Take 2 GPUs for training , it still report error : The number of GPUs required by your code is too large. My team only has 4 GPUs per machine,I don't think I can train.....ai.... |
I useing 4 GPUs for training with the way you provided, like this: But it still report a error : How could I solve it ? Thanks ! |
Many reasons can produce this error. You can refer to this solution and have a try. |
OK,I trying to see if I could work it out. Thanks ! |
这个代码太难跑了 |
是的,很难跑,他是与基于cvpods库实现的, 需要安装这个库然后编译这个库,然后在源码中还要编译。而且最少要四张卡才能跑,非常吃显卡。。。之前我试了4张2080ti跑,结果还是报错,也就是上面个的error。难定,不想train这个代码了,其实这篇论文的encoder部分倒是可以学习的,其他的地方我懒得花时间了。。还得跑自己的实验,唉。。。 |
Hello, I want to use the under the tools folder 'train_net' script to train the yolof-res101-dc5-1x version of the network, but because the first card of my group's server is occupied by others, I want to use other cards to train, I did not find the statement to modify the GPU number in 'setup' script. so I put num_ gpu,num_ machines and machines_ rank parameters are all changed to 1, but they are still trained with GPU: 0. How to solve it?
Thanks !
The text was updated successfully, but these errors were encountered: