In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader,Dataset
import torchvision
import torchaudio
import torchaudio.transforms as transfroms
import soundfile as sf



In [2]:
from datasets import load_dataset          # this datasets are huggingface's datasets i think.
from datasets import Audio 

minds = load_dataset("PolyAI/minds14",name="en-AU",split="train")
minds = minds.cast_column("audio",Audio(sampling_rate=16000))



To classify an audio recording into a set of classes, we can use the audio-classification pipeline from Transformers. In our case, we need a model that’s been fine-tuned for intent classification, and specifically on the MINDS-14 dataset. 

In [8]:
from transformers import pipeline
classifier = pipeline("audio-classification",model="anton-l/xtreme_s_xlsr_300m_minds14")

Some weights of the model checkpoint at anton-l/xtreme_s_xlsr_300m_minds14 were not used when initializing Wav2Vec2ForSequenceClassification: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at anton-l/xtreme_s_xlsr_300m_minds14 and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.encoder.pos

The pipeline function allows you to load pre-trained models that have been fine-tuned on specific datasets for specific tasks, such as the MINDS-14 dataset for intent classification in this example.


In [9]:
example = minds[0]
classifier(example["audio"]["array"])        # predicts with 96% prob that its a paybill type 

[{'score': 0.9625310301780701, 'label': 'pay_bill'},
 {'score': 0.02867276780307293, 'label': 'freeze'},
 {'score': 0.0033498003613203764, 'label': 'card_issues'},
 {'score': 0.002005805494263768, 'label': 'abroad'},
 {'score': 0.0008484331192448735, 'label': 'high_value_payment'}]

Pipelines have preprocessing inbuilt


In [10]:

id2label = minds.features["intent_class"].int2str
id2label(example["intent_class"])
# CORRECT

'pay_bill'

# ASR With Pipeline

In [11]:
from transformers import pipeline

asr = pipeline("automatic-speech-recognition")

No model was supplied, defaulted to facebook/wav2vec2-base-960h and revision 55bb623 (https://huggingface.co/facebook/wav2vec2-base-960h).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0']
You sho

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

In [13]:
example = minds[44]
asr(example["audio"]["array"])

{'text': "I I'D LIKE TO NIGHT YOUR PAYMENT WITH MY BANKICONP"}

In [15]:
example["english_transcription"]

'hi ID like to make a payment with my bank account'

WHY PIPELINE :- 

HuggingFace lists the following benefits:-


a pre-trained model may exist that already solves your task really well, saving you plenty of time

pipeline() takes care of all the pre/post-processing for you, so you don’t have to worry about getting the data into the right format for a model

if the result isn’t ideal, this still gives you a quick baseline for future fine-tuning

once you fine-tune a model on your custom data and share it on Hub, the whole community will be able to use it quickly and effortlessly via the pipeline() method making AI more accessible.