Skip to content

Deepspeech CUDNN_STATUS_EXECUTION_FAILED #523

@priyakasimbeg

Description

@priyakasimbeg

Deepspeech returns CUDNN_STATUS_EXECUTION_FAILED error when calling the cudnnRNNForward layer.

Description

Traceback:

I0922 23:02:03.070308 140024268789568 spec.py:333] Evaluating on the validation split.
I0922 23:02:03.270553 140024268789568 input_pipeline.py:20] Loading split = dev-clean
I0922 23:02:03.27[48](https://github.com/mlcommons/algorithmic-efficiency/actions/runs/6277930012/job/17073112058#step:3:49)69 140024268789568 input_pipeline.py:20] Loading split = dev-other
I0922 23:03:08.161050 140024268789568 spec.py:3[49](https://github.com/mlcommons/algorithmic-efficiency/actions/runs/6277930012/job/17073112058#step:3:50)] Evaluating on the test split.
I0922 23:03:08.366027 140024268789568 input_pipeline.py:20] Loading split = test-clean
2023-09-22 23:03:16.1873[52](https://github.com/mlcommons/algorithmic-efficiency/actions/runs/6277930012/job/17073112058#step:3:53): E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2469] Execution of replica 6 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.custom_call' failed: jaxlib/gpu/rnn_kernels.cc:256: operation cudnnRNNForward( handle.get(), rnn_desc, fwdMode, (const int32_t*)seq_lengths_buf, input_data_desc, input_buf, output_data_desc, output_buf, h_desc, h_0_buf, h_n_buf, c_desc, c_0_buf, c_n_buf, weight_space_size, weights_buf, d.workspace_size, workspace_buf, d.reserve_space_size, reserve_space_buf) failed: CUDNN_STATUS_EXECUTION_FAILED.

Steps to Reproduce

Git commit: ae3587d

python3 submission_runner.py --framework=jax --workload=librispeech_deepspeech --submission_path=baselines/adamw/jax/submission.py --tuning_search_space=baselines/adamw/tuning_search_space.json --data_dir=/data/librispeech --num_tuning_trials=1 --experiment_dir=/experiment_runs --experiment_name=tests/regression_tests/adamw --overwrite=True --save_checkpoints=False --max_global_steps=10 --librispeech_tokenizer_vocab_path=/data/librispeech/spm_model.vocab 2

Source or Possible Fix

The deepspeech regression test failed in #511. I wrongly thought this was a transient issue.
Traced the change in behavior to fixes in our shard_and_maybe_pad_np function #515.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1 Launch 2023High priority issues for October 2023 AlgoPerf Launch🚀 Launch BlockerIssues that are blocking launch of benchmark

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions