# ASR with a pipeline

Automatic Speech Recognition (ASR) involves transcribing speech audio recording into text

In [1]:
"""
Use the MINDS-14 dataset: recordings of people asking an e-banking system questions in several languages and dialects, and has the intent_class for each recording

Start by loading the en-AU subset of the data to try out the pipeline, 
and upsample it to 16kHz sampling rate which is what most speech models require

Remember that if we would like to try transcribing other subsets of MINDS-14 in a diff language
we can find a pre-trained ASR model - filter the models list by task, then by lang
"""

from datasets import load_dataset
from datasets import Audio

minds = load_dataset("PolyAI/minds14", name="en-AU", split="train")
minds = minds.cast_column("audio", Audio(sampling_rate=16_000))

In [2]:
"""
Use the automatic-speech-recognition pipeline from HuggingFace Transformers    

Recall: Pipelines are objects, offering a simple API to many NLP tasks
They are wrappers 
"""

from transformers import pipeline

asr = pipeline("automatic-speech-recognition")

No model was supplied, defaulted to facebook/wav2vec2-base-960h and revision 55bb623 (https://huggingface.co/facebook/wav2vec2-base-960h).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You sho

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]



preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

In [3]:
"""
Take an example from the dataset and pass its raw data to the pipeline
"""

example = minds[0]
asr(example["audio"]["array"])

{'text': 'I WOULD LIKE TO PAY MY ELECTRICITY BILL USING MY CAD CAN YOU PLEASE ASSIST'}

In [4]:
"""
Comparing it to the actual transcription for the example, it has done a gone job
"""

example["english_transcription"]

'I would like to pay my electricity bill using my card can you please assist'

Note:
- a pre-trained model may exist that already solves your task really well, saving you plenty of time
- pipeline() takes care of all the pre/post-processing for you, so you don’t have to worry about getting the data into the right format for a model
- if the result isn’t ideal, this still gives you a quick baseline for future fine-tuning