Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues running AI Benchmark.. #40

Closed
oscarbg opened this issue Jun 28, 2020 · 8 comments
Closed

Issues running AI Benchmark.. #40

oscarbg opened this issue Jun 28, 2020 · 8 comments

Comments

@oscarbg
Copy link

oscarbg commented Jun 28, 2020

Hi,
seeing my last issue being closed: microsoft/DirectML#16
just updated to latest (200626) tensorflow directml to test on "native" Windows:(tensorflow_directml-1.15.3.dev200626-cp37-cp37m-win_amd64)
I'm on NV Titan V and 451.48 driver..
now 1/19. MobileNet-V2 training step runs without issues..
so my last issue is solved..
but benchmark still fails to completion.. now faults on "2/19. Inception-V3" training step..
I think maybe a GPU mem allocation issue as I see on task manager GPU tab that "dedicated GPU mem" is almost full prior to training step (11.8/12GB allocated)..
seems DirectML backend maybe not optimized in relation to GPU mem usage as I can run this benchmark on CUDA backend without issues..
or maybe either AI Bench or DirectML backend is not freeing GPU mem "buffers" between benchmark steps..
hope we can end running full AI Benchmark on DirectML without issues..
for later will ask for better training performance as:
1.2 - training | batch=50, size=224x224: 9138 ± 137 ms
seems to much for a Titan V.. at least on CUDA this is way faster..

python
Python 3.7.7 (tags/v3.7.7:d7c567b08f, Mar 10 2020, 10:41:24) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from ai_benchmark import AIBenchmark
>>> results = AIBenchmark().run()

>>   AI-Benchmark-v.0.1.2
>>   Let the AI Games begin..

*  TF Version: 1.15.3
*  Platform: Windows-10-10.0.19564-SP0
*  CPU: N/A
*  CPU RAM: 32 GB
*  GPU/0: N/A
*  GPU RAM: N/A GB
*  CUDA Version: 11.0
*  CUDA Build: V11.0.167

The benchmark is running...
The tests might take up to 20 minutes
Please don't interrupt the script

1/19. MobileNet-V2

1.1 - inference | batch=50, size=224x224: 52.5 ± 10.7 ms
1.2 - training  | batch=50, size=224x224: 9138 ± 137 ms

2/19. Inception-V3

2.1 - inference | batch=20, size=346x346: 812 ± 36 ms
2020-06-28 22:50:23.358757: F tensorflow/core/common_runtime/dml/dml_allocator.cc:97] Check failed: (((HRESULT)((hr))) >= 0) == true (0 vs. 1)
@oscarbg
Copy link
Author

oscarbg commented Jun 28, 2020

fails similarly on Vega:

>>   AI-Benchmark-v.0.1.2
>>   Let the AI Games begin..

*  TF Version: 1.15.3
*  Platform: Windows-10-10.0.19564-SP0
*  CPU: N/A
*  CPU RAM: 32 GB
*  GPU/0: N/A
*  GPU RAM: N/A GB
*  CUDA Version: 11.0
*  CUDA Build: V11.0.167

The benchmark is running...
The tests might take up to 20 minutes
Please don't interrupt the script

1/19. MobileNet-V2

1.1 - inference | batch=50, size=224x224: 106 ± 33 ms
1.2 - training  | batch=50, size=224x224: 10541 ± 176 ms

2/19. Inception-V3

2.1 - inference | batch=20, size=346x346: 1238 ± 29 ms
2020-06-28 23:35:55.556323: F tensorflow/core/common_runtime/dml/dml_command_recorder.cc:150] Check failed: (((HRESULT)((((HRESULT)0x8007000EL)))) >= 0) == true (0 vs. 1)

EDIT: on WSL2 fails even earlier on Vega:
after
1.1 - - inference | batch=50, size=224x224: 136 ± 22 ms
it crashes and ends WSL2 process..

@PatriceVignola
Copy link
Contributor

Hi @oscarbg , this is something that we are actively looking into. As you noticed, tensorflow-directml's memory usage is very high at the moment, which is a problem when training with many batches. We will update this issue once we release a package that addresses these crashes.

@oscarbg
Copy link
Author

oscarbg commented Jun 29, 2020

Thanks @PatriceVignola!
good to know devs are aware and working on it..

@PatriceVignola
Copy link
Contributor

Hey @oscarbg , we just released tensorflow-directml 1.15.3.dev200911 with many improvements to the memory allocator. You can try it out and tell us how it goes!

Also, since we have now open-sourced our fork, new tensorflow-directml issues should be opened over here.

@jstoecker jstoecker transferred this issue from microsoft/DirectML Sep 17, 2020
@oscarbg
Copy link
Author

oscarbg commented Sep 20, 2020

Hi @PatriceVignola,
thanks for update!
new update works very nice..
memory usage is good now..
now seems only remaining issue is upping the performance vs CUDA target..

on Titan V DirectML I get:

Device Inference Score: 6468
Device Training Score: 5271
Device AI Score: 11739

on CUDA I got:

Device Inference Score: 15245
Device Training Score: 15619
Device AI Score: 30864

so basically a 2x-3x performace loss using DirectML vs CUDA right now..

posting full benchmark on Titan V on 460.15 drivers:

>>> from ai_benchmark import AIBenchmark
>>> results = AIBenchmark().run()

>>   AI-Benchmark-v.0.1.2
>>   Let the AI Games begin..

*  TF Version: 1.15.3
*  Platform: Windows-10-10.0.20180-SP0
*  CPU: N/A
*  CPU RAM: 32 GB
*  GPU/0: N/A
*  GPU RAM: N/A GB
*  CUDA Version: N/A
*  CUDA Build: N/A

The benchmark is running...
The tests might take up to 20 minutes
Please don't interrupt the script

1/19. MobileNet-V2

1.1 - inference | batch=50, size=224x224: 56.2 ± 7.3 ms
1.2 - training  | batch=50, size=224x224: 1268 ± 10 ms

2/19. Inception-V3

2.1 - inference | batch=20, size=346x346: 87.4 ± 5.0 ms
2.2 - training  | batch=20, size=346x346: 447 ± 7 ms

3/19. Inception-V4

3.1 - inference | batch=10, size=346x346: 89.6 ± 4.8 ms
3.2 - training  | batch=10, size=346x346: 412 ± 6 ms

4/19. Inception-ResNet-V2

4.1 - inference | batch=10, size=346x346: 89.4 ± 1.8 ms
4.2 - training  | batch=8, size=346x346: 370 ± 5 ms

5/19. ResNet-V2-50

5.1 - inference | batch=10, size=346x346: 68.5 ± 2.6 ms
5.2 - training  | batch=10, size=346x346: 276 ± 5 ms

6/19. ResNet-V2-152

6.1 - inference | batch=10, size=256x256: 109 ± 4 ms
6.2 - training  | batch=10, size=256x256: 403 ± 8 ms

7/19. VGG-16

7.1 - inference | batch=20, size=224x224: 112 ± 2 ms
7.2 - training  | batch=2, size=224x224: 86.8 ± 1.9 ms

8/19. SRCNN 9-5-5

8.1 - inference | batch=10, size=512x512: 131 ± 3 ms
8.2 - inference | batch=1, size=1536x1536: 117 ± 4 ms
8.3 - training  | batch=10, size=512x512: 719 ± 13 ms

9/19. VGG-19 Super-Res

9.1 - inference | batch=10, size=256x256: 151 ± 3 ms
9.2 - inference | batch=1, size=1024x1024: 242 ± 4 ms
9.3 - training  | batch=10, size=224x224: 843 ± 9 ms

10/19. ResNet-SRGAN

10.1 - inference | batch=10, size=512x512: 176 ± 6 ms
10.2 - inference | batch=1, size=1536x1536: 159 ± 5 ms
10.3 - training  | batch=5, size=512x512: 479 ± 8 ms

11/19. ResNet-DPED

11.1 - inference | batch=10, size=256x256: 203 ± 2 ms
11.2 - inference | batch=1, size=1024x1024: 329 ± 5 ms
11.3 - training  | batch=15, size=128x128: 484 ± 5 ms

12/19. U-Net

12.1 - inference | batch=4, size=512x512: 493 ± 7 ms
12.2 - inference | batch=1, size=1024x1024: 550 ± 16 ms
12.3 - training  | batch=4, size=256x256: 488 ± 12 ms

13/19. Nvidia-SPADE

13.1 - inference | batch=5, size=128x128: 233 ± 6 ms
13.2 - training  | batch=1, size=128x128: 556 ± 6 ms

14/19. ICNet

14.1 - inference | batch=5, size=1024x1536: 349 ± 4 ms
14.2 - training  | batch=10, size=1024x1536: 1506 ± 7 ms

15/19. PSPNet

15.1 - inference | batch=5, size=720x720: 1086 ± 10 ms
15.2 - training  | batch=1, size=512x512: 398 ± 7 ms

16/19. DeepLab

16.1 - inference | batch=2, size=512x512: 672 ± 4 ms
16.2 - training  | batch=1, size=384x384: 474 ± 4 ms

17/19. Pixel-RNN

17.1 - inference | batch=50, size=64x64: 989 ± 7 ms
17.2 - training  | batch=10, size=64x64: 2643 ± 7 ms

18/19. LSTM-Sentiment

18.1 - inference | batch=100, size=1024x300: 681 ± 13 ms
18.2 - training  | batch=10, size=1024x300: 1388 ± 10 ms

19/19. GNMT-Translation

19.1 - inference | batch=1, size=1x20: 335 ± 5 ms

Device Inference Score: 6468
Device Training Score: 5271
Device AI Score: 11739

For more information and results, please visit http://ai-benchmark.com/alpha

@oscarbg oscarbg closed this as completed Sep 20, 2020
@megha1906
Copy link

How do I run a single model using ai-benchmarks?

@jstoecker
Copy link
Contributor

How do I run a single model using ai-benchmarks?

I don't think it's possible without modifying the AIBenchmark scripts. You could (after pip-installing the package, for example) modify the loop in run_tests (ai_benchmark/utils.py) to skip the models that you're not interested in.

@darkar18
Copy link

darkar18 commented Mar 15, 2022

My benchmarking fails after 8th test...

>>   AI-Benchmark-v.0.1.2
>>   Let the AI Games begin..

*  TF Version: 1.15.5
*  Platform: Windows-10-10.0.22000-SP0
*  CPU: N/A
*  CPU RAM: 7 GB

The benchmark is running...
The tests might take up to 20 minutes
Please don't interrupt the script

1/19. MobileNet-V2

1.1 - inference | batch=50, size=224x224: 132 ± 2 ms
1.2 - training  | batch=50, size=224x224: 693 ± 1 ms

2/19. Inception-V3

2.1 - inference | batch=20, size=346x346: 150 ± 2 ms
2.2 - training  | batch=20, size=346x346: 483 ± 2 ms

3/19. Inception-V4

3.1 - inference | batch=10, size=346x346: 162 ± 2 ms
3.2 - training  | batch=10, size=346x346: 555 ± 11 ms

4/19. Inception-ResNet-V2

4.1 - inference | batch=10, size=346x346: 182 ± 2 ms
4.2 - training  | batch=8, size=346x346: 514 ± 2 ms

5/19. ResNet-V2-50

5.1 - inference | batch=10, size=346x346: 80.4 ± 2.9 ms
5.2 - training  | batch=10, size=346x346: 266 ± 1 ms

6/19. ResNet-V2-152

6.1 - inference | batch=10, size=256x256: 117 ± 2 ms
6.2 - training  | batch=10, size=256x256: 498 ± 3 ms

7/19. VGG-16

7.1 - inference | batch=20, size=224x224: 116 ± 1 ms
7.2 - training  | batch=2, size=224x224: 96.9 ± 1.5 ms

8/19. SRCNN 9-5-5

8.1 - inference | batch=10, size=512x512: 203 ± 4 ms
8.2 - inference | batch=1, size=1536x1536: 183 ± 5 ms
Traceback (most recent call last):
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
    return fn(*args)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,64,512,512] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
         [[{{node gradients/generator/Relu_grad/ReluGrad}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ai-test.py", line 3, in <module>
    b.run()
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\__init__.py", line 64, in run
    use_CPU=self.use_CPU, precision=precision, _type="full", start_dir=self.cwd)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 635, in run_tests
    sess.run(train_step, feed_dict={input_: data, target_: target})
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
    run_metadata_ptr)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
    run_metadata)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,64,512,512] and type float on /job:localhost/replica:0/task:0/device:DML:0 by allocator DmlAllocator
         [[node gradients/generator/Relu_grad/ReluGrad (defined at C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py:1762) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Original stack trace for 'gradients/generator/Relu_grad/ReluGrad':
  File "ai-test.py", line 3, in <module>
    b.run()
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\__init__.py", line 64, in run
    use_CPU=self.use_CPU, precision=precision, _type="full", start_dir=self.cwd)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 615, in run_tests
    subTest.optimizer, subTest.learning_rate, testInfo.tf_ver_2)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 202, in constructOptimizer
    train_step = optimizer.minimize(loss_)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\training\optimizer.py", line 403, in minimize
    grad_loss=grad_loss)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\training\optimizer.py", line 512, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gradients_impl.py", line 158, in gradients
    unconnected_gradients)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gradients_util.py", line 679, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gradients_util.py", line 350, in _MaybeCompile
    return grad_fn()  # Exit early
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gradients_util.py", line 679, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\nn_grad.py", line 415, in _ReluGrad
    return gen_nn_ops.relu_grad(grad, op.outputs[0])
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\ops\gen_nn_ops.py", line 11732, in relu_grad
    "ReluGrad", gradients=gradients, features=features, name=name)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3371, in create_op
    attrs, op_def, compute_device)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3440, in _create_op_internal
    op_def=op_def)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1762, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'generator/Relu', defined at:
  File "ai-test.py", line 3, in <module>
    b.run()
[elided 0 identical lines from previous traceback]
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\__init__.py", line 64, in run
    use_CPU=self.use_CPU, precision=precision, _type="full", start_dir=self.cwd)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 557, in run_tests
    input_, output_, train_vars_ = getModelSrc(test, testInfo, sess)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\ai_benchmark\utils.py", line 241, in getModelSrc
    tf.train.import_meta_graph(test.model_src, clear_devices=True)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\training\saver.py", line 1453, in import_meta_graph
    **kwargs)[0]
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\training\saver.py", line 1477, in _import_meta_graph_with_return_elements
    **kwargs))
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\meta_graph.py", line 809, in import_scoped_meta_graph_with_return_elements
    return_elements=return_elements)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\importer.py", line 405, in import_graph_def
    producer_op_list=producer_op_list)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\importer.py", line 517, in _import_graph_def_internal
    _ProcessNewOps(graph)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\importer.py", line 243, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3575, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3575, in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3465, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "C:\Users\alexv\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1762, in __init__
    self._traceback = tf_stack.extract_stack()

what should I do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants