-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues running AI Benchmark.. #40
Comments
fails similarly on Vega:
EDIT: on WSL2 fails even earlier on Vega: |
Hi @oscarbg , this is something that we are actively looking into. As you noticed, tensorflow-directml's memory usage is very high at the moment, which is a problem when training with many batches. We will update this issue once we release a package that addresses these crashes. |
Thanks @PatriceVignola! |
Hey @oscarbg , we just released tensorflow-directml 1.15.3.dev200911 with many improvements to the memory allocator. You can try it out and tell us how it goes! Also, since we have now open-sourced our fork, new tensorflow-directml issues should be opened over here. |
Hi @PatriceVignola, on Titan V DirectML I get:
on CUDA I got:
so basically a 2x-3x performace loss using DirectML vs CUDA right now.. posting full benchmark on Titan V on 460.15 drivers:
|
How do I run a single model using ai-benchmarks? |
I don't think it's possible without modifying the AIBenchmark scripts. You could (after pip-installing the package, for example) modify the loop in |
My benchmarking fails after 8th test...
what should I do? |
Hi,
seeing my last issue being closed: microsoft/DirectML#16
just updated to latest (200626) tensorflow directml to test on "native" Windows:(tensorflow_directml-1.15.3.dev200626-cp37-cp37m-win_amd64)
I'm on NV Titan V and 451.48 driver..
now 1/19. MobileNet-V2 training step runs without issues..
so my last issue is solved..
but benchmark still fails to completion.. now faults on "2/19. Inception-V3" training step..
I think maybe a GPU mem allocation issue as I see on task manager GPU tab that "dedicated GPU mem" is almost full prior to training step (11.8/12GB allocated)..
seems DirectML backend maybe not optimized in relation to GPU mem usage as I can run this benchmark on CUDA backend without issues..
or maybe either AI Bench or DirectML backend is not freeing GPU mem "buffers" between benchmark steps..
hope we can end running full AI Benchmark on DirectML without issues..
for later will ask for better training performance as:
1.2 - training | batch=50, size=224x224: 9138 ± 137 ms
seems to much for a Titan V.. at least on CUDA this is way faster..
The text was updated successfully, but these errors were encountered: