<a href="https://colab.research.google.com/github/liuyi3013/colab/blob/main/Batch_Inference_ASR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we show how to transcribe in parallel a batch of input sentences using a pre-trained model. Please, use the GPU to speed up the code (Runtime => Manage Sessions => GPU)

Let's install SpeechBrain and download some speech sentences first:

In [None]:
%%capture
!pip install speechbrain

In [None]:
# Download + Unpacking test-clean of librispeech
import shutil
from speechbrain.utils.data_utils import download_file

MINILIBRI_TEST_URL = "https://www.openslr.org/resources/12/test-clean.tar.gz"
download_file(MINILIBRI_TEST_URL, 'test-clean.tar.gz')
shutil.unpack_archive( 'test-clean.tar.gz', '.')

test-clean.tar.gz: 0.00B [00:00, ?B/s]

Downloading https://www.openslr.org/resources/12/test-clean.tar.gz to test-clean.tar.gz


test-clean.tar.gz: 347MB [00:19, 18.0MB/s]                           


Let's decode a single sentence:

In [None]:
from speechbrain.pretrained import EncoderDecoderASR
audio_1 = "/content/LibriSpeech/test-clean/1089/134686/1089-134686-0030.flac"

# Uncomment for using another pre-trained model
#asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", savedir="pretrained_models/asr-crdnn-rnnlm-librispeech",  run_opts={"device":"cuda"})
#asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-transformerlm-librispeech", savedir="pretrained_models/asr-crdnn-transformerlm-librispeech",  run_opts={"device":"cuda"})
asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-transformer-transformerlm-librispeech", savedir="pretrained_models/asr-transformer-transformerlm-librispeech",  run_opts={"device":"cuda"})
asr_model.transcribe_file(audio_1)

'BEWARE OF MAKING THAT MISTAKE'

In [None]:
import torch
import torchaudio

snt_1, fs = torchaudio.load(audio_1)
wav_lens=torch.tensor([1.0])
asr_model.transcribe_batch(snt_1, wav_lens)

(['BEWARE OF MAKING THAT MISTAKE'], [[28, 1934, 18, 7, 623, 15, 1966]])

Let's now decode another sentence in the batch:

In [None]:
audio_2 = "/content/LibriSpeech/test-clean/1089/134686/1089-134686-0007.flac"

snt_2, fs = torchaudio.load(audio_2)
wav_lens=torch.tensor([1.0])
asr_model.transcribe_batch(snt_2, wav_lens)

(['A COLD LUCID INDIFFERENCE REIGNED IN HIS SOUL'],
 [[9, 646, 2706, 520, 4024, 2992, 6, 10, 20, 575]])

Let's now decode both sentences within the same batch:

In [None]:
# Padding
from torch.nn.utils.rnn import pad_sequence
batch = pad_sequence([snt_1.squeeze(), snt_2.squeeze()], batch_first=True, padding_value=0.0)
wav_lens=torch.tensor([snt_1.shape[1]/batch.shape[1], snt_2.shape[1]/batch.shape[1]])
asr_model.transcribe_batch(batch, wav_lens)


(['BEWARE OF MAKING THAT MISTAKE',
  'A COLD LUCID INDIFFERENCE REIGNED IN HIS SOUL'],
 [[28, 1934, 18, 7, 623, 15, 1966],
  [9, 646, 2706, 520, 4024, 2992, 6, 10, 20, 575]])

Let's now set up a batch of 8 sentences:

In [None]:
audio_files=[]
audio_files.append('/content/LibriSpeech/test-clean/1089/134686/1089-134686-0030.flac')
audio_files.append('/content/LibriSpeech/test-clean/1089/134686/1089-134686-0014.flac')
audio_files.append('/content/LibriSpeech/test-clean/1089/134686/1089-134686-0007.flac')
audio_files.append('/content/LibriSpeech/test-clean/1089/134691/1089-134691-0000.flac')
audio_files.append('/content/LibriSpeech/test-clean/1089/134691/1089-134691-0003.flac')
audio_files.append('/content/LibriSpeech/test-clean/1188/133604/1188-133604-0030.flac')
audio_files.append('/content/LibriSpeech/test-clean/1089/134691/1089-134691-0019.flac')
audio_files.append('/content/LibriSpeech/test-clean/1188/133604/1188-133604-0006.flac')

sigs=[]
lens=[]
for audio_file in audio_files:
  snt, fs = torchaudio.load(audio_file)
  sigs.append(snt.squeeze())
  lens.append(snt.shape[1])

batch = pad_sequence(sigs, batch_first=True, padding_value=0.0)

lens = torch.Tensor(lens) / batch.shape[1]

asr_model.transcribe_batch(batch, lens)


(['BEWARE OF MAKING THAT MISTAKE',
  'HE TRIED TO THINK HOW IT COULD BE',
  'A COLD LUCID INDIFFERENCE REIGNED IN HIS SOUL',
  'HE COULD WAIT NO LONGER',
  'THE UNIVERSITY',
  'HE KNOWS THEM BOTH',
  'A VOICE FROM BEYOND THE WORLD WAS CALLING',
  'THEN HE COMES TO THE BEAK OF IT'],
 [[28, 1934, 18, 7, 623, 15, 1966],
  [12, 501, 6, 8, 158, 93, 17, 76, 28],
  [9, 646, 2706, 520, 4024, 2992, 6, 10, 20, 575],
  [12, 76, 383, 54, 118, 47],
  [3, 4342, 22],
  [12, 1880, 65, 329],
  [9, 336, 50, 28, 854, 3, 254, 16, 2395],
  [74, 12, 1395, 8, 3, 28, 792, 7, 17]])

**Note:** We highly recommend creating batches containing sentences of similar length. This way decoding performance is optimized.