Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on GPUs #3

Closed
yihong-chen opened this issue Aug 22, 2016 · 9 comments
Closed

Performance on GPUs #3

yihong-chen opened this issue Aug 22, 2016 · 9 comments

Comments

@yihong-chen
Copy link

Hi,
I am trying to run CharCNN on 4 GeForce GTX 1080.I am struggling with 2 problems.

  1. The 4 GPUs seem to have unbalanced load.When I ran nvidia-smi,the result is shown as following
    image
    what can I do to make full use of all GPUs?
  2. I have been running CharCNN for several days.But the accuracy remains around 0.3 without any advancement.I just ran python training.py without any change to your code.The current status is shown as following:
    image
    Do I have to change some parameters or the type of optimizer? Have you ever tried to run CharCNN with better performance?

Thank you very much!

@yihong-chen
Copy link
Author

I found that the problem can be solved by changing the optimizer from Adamoptimizer to Momentum optimizer.

@mhjabreel
Copy link
Owner

Hi @LaceyChen17 ,

Thank you for your comments, actually I have a PC with only one GPU "NVIDIA Geforce 720M", so I was not able to try the code with multiple GPUs. About the convergence I just tried to run the code with a few number of epochs "as my GPU device is old and its capabilities are limited", but you are right I realized that using the Momentum optimizer can solve that issue.

Thank you very much

@zhang-jinyi
Copy link

zhang-jinyi commented Nov 17, 2016

Hi @LaceyChen17 @mhjabreel

I changed the optimizer from Adamoptimizer to Momentum optimizer.
Just the code below:
optimizer = tf.train.MomentumOptimizer(learning_rate, config.training.momentum)
and of course,
momentum = 0.9 in the config.py

Finally,I run the python training.py for about 3000 steps.
But the accuracy remains under 0.30 without any advancement.

Could you tell me more details of the way to deal with that?

@yihong-chen
Copy link
Author

maybe you just need to run more steps. I ran it on 2 nvidia Tesla M40s and the accuracy remained under 0.4 for almost 4 hours,The training accuracy vs time curve is shown as following
image

@zhang-jinyi
Copy link

@LaceyChen17
I‘ll try and thank you for your patience.

@yihong-chen
Copy link
Author

@renzhe0009 You are welcome ^_^

@jeremied3
Copy link

only after 33000 steps (~2days) did I see the validation accuracy climb from ~30% to ~70% and eventually 88% eventually.

@ydzhang12345
Copy link

simply change the base rate to 1e-3 and use adam, you will see accuracy climb to 80% in less than 1000 steps

@ayrtondenner
Copy link

I just tried changing the base rate and kept Adam Optimizer, but I didn't get best results, my neural network couldn't pass 30% of accuracy in test data.

2018-06-25T17:01:41.139571: step 56010, loss 1.38633, acc 0.226562
2018-06-25T17:01:43.331433: step 56020, loss 1.38645, acc 0.179688
2018-06-25T17:01:45.562337: step 56030, loss 1.38628, acc 0.28125
2018-06-25T17:01:47.747149: step 56040, loss 1.38629, acc 0.265625
2018-06-25T17:01:49.989139: step 56050, loss 1.3863, acc 0.226562
2018-06-25T17:01:52.222080: step 56060, loss 1.38624, acc 0.289062
2018-06-25T17:01:54.453016: step 56070, loss 1.38631, acc 0.265625
2018-06-25T17:01:56.656879: step 56080, loss 1.3863, acc 0.289062
2018-06-25T17:01:58.912880: step 56090, loss 1.38638, acc 0.171875
2018-06-25T17:02:01.111723: step 56100, loss 1.38629, acc 0.273438

Evaluation:
2018-06-25T17:02:01.192915: step 56100, loss 1.38627, acc 0.226562

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants