Deepspeech CUDNN_STATUS_EXECUTION_FAILED

Deepspeech returns CUDNN_STATUS_EXECUTION_FAILED error when calling the cudnnRNNForward layer. 

## Description
Traceback: 
```
I0922 23:02:03.070308 140024268789568 spec.py:333] Evaluating on the validation split.
I0922 23:02:03.270553 140024268789568 input_pipeline.py:20] Loading split = dev-clean
I0922 23:02:03.27[48](https://github.com/mlcommons/algorithmic-efficiency/actions/runs/6277930012/job/17073112058#step:3:49)69 140024268789568 input_pipeline.py:20] Loading split = dev-other
I0922 23:03:08.161050 140024268789568 spec.py:3[49](https://github.com/mlcommons/algorithmic-efficiency/actions/runs/6277930012/job/17073112058#step:3:50)] Evaluating on the test split.
I0922 23:03:08.366027 140024268789568 input_pipeline.py:20] Loading split = test-clean
2023-09-22 23:03:16.1873[52](https://github.com/mlcommons/algorithmic-efficiency/actions/runs/6277930012/job/17073112058#step:3:53): E external/xla/xla/pjrt/pjrt_stream_executor_client.cc:2469] Execution of replica 6 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.custom_call' failed: jaxlib/gpu/rnn_kernels.cc:256: operation cudnnRNNForward( handle.get(), rnn_desc, fwdMode, (const int32_t*)seq_lengths_buf, input_data_desc, input_buf, output_data_desc, output_buf, h_desc, h_0_buf, h_n_buf, c_desc, c_0_buf, c_n_buf, weight_space_size, weights_buf, d.workspace_size, workspace_buf, d.reserve_space_size, reserve_space_buf) failed: CUDNN_STATUS_EXECUTION_FAILED.
```

## Steps to Reproduce
Git commit: [ae3587d](https://github.com/mlcommons/algorithmic-efficiency/pull/511/commits/ae3587d4c13fcd29aa4e70b793a79d8d439bc5b7)

```
python3 submission_runner.py --framework=jax --workload=librispeech_deepspeech --submission_path=baselines/adamw/jax/submission.py --tuning_search_space=baselines/adamw/tuning_search_space.json --data_dir=/data/librispeech --num_tuning_trials=1 --experiment_dir=/experiment_runs --experiment_name=tests/regression_tests/adamw --overwrite=True --save_checkpoints=False --max_global_steps=10 --librispeech_tokenizer_vocab_path=/data/librispeech/spm_model.vocab 2
```

## Source or Possible Fix
The deepspeech regression test failed in https://github.com/mlcommons/algorithmic-efficiency/pull/511. I wrongly thought this was a transient issue. 
Traced the change in behavior to fixes in our shard_and_maybe_pad_np function https://github.com/mlcommons/algorithmic-efficiency/pull/515. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deepspeech CUDNN_STATUS_EXECUTION_FAILED #523

Description

Steps to Reproduce

Source or Possible Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deepspeech CUDNN_STATUS_EXECUTION_FAILED #523

Description

Description

Steps to Reproduce

Source or Possible Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions