Zero-shot audio classification is a method for taking a pre-trained audio classification model trained on a set of labelled examples and enabling it to be able to classify new examples from previously unseen classes.

Currently, 🤗 Transformers supports one kind of model for zero-shot audio classification: the CLAP model. CLAP is a transformer-based model that takes both audio and text as inputs, and computes the similarity between the two. If we pass a text input that strongly correlates with an audio input, we’ll get a high similarity score. Conversely, passing a text input that is completely unrelated to the audio input will return a low similarity.

We can use this similarity prediction for zero-shot audio classification by passing one audio input to the model and multiple candidate labels. The model will return a similarity score for each of the candidate labels, and we can pick the one that has the highest score as our prediction.

Note that we can either pass the full set of labels to the model, or a hand-selected subset that we believe contains the correct label. Passing the full set of labels is going to be more exhaustive, but comes at the expense of lower classification accuracy since the classification space is larger (provided the correct label is our chosen subset of labels):

In [None]:
!pip install datasets transformers

In [2]:
from datasets import load_dataset
from transformers import pipeline

In [18]:
minds = load_dataset("PolyAI/minds14", name="en-US", split="train")
minds

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
    num_rows: 563
})

In [19]:
minds.features

{'path': Value(dtype='string', id=None),
 'audio': Audio(sampling_rate=8000, mono=True, decode=True, id=None),
 'transcription': Value(dtype='string', id=None),
 'english_transcription': Value(dtype='string', id=None),
 'intent_class': ClassLabel(names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill'], id=None),
 'lang_id': ClassLabel(names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR', 'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN'], id=None)}

In [20]:
label_list=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan', 'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill']

In [21]:
example=minds[10]
array = example["audio"]["array"]

In [22]:
example

{'path': '/root/.cache/huggingface/datasets/downloads/extracted/a19fbc5032eacf25eab0097832db7b7f022b42104fbad6bd5765527704a428b9/en-US~JOINT_ACCOUNT/602baf1f5f67b421554f64ce.wav',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/a19fbc5032eacf25eab0097832db7b7f022b42104fbad6bd5765527704a428b9/en-US~JOINT_ACCOUNT/602baf1f5f67b421554f64ce.wav',
  'array': array([ 0.        ,  0.        ,  0.00024414, ..., -0.00024414,
         -0.00024414,  0.        ]),
  'sampling_rate': 8000},
 'transcription': 'I need help setting up a joint account',
 'english_transcription': 'I need help setting up a joint account',
 'intent_class': 11,
 'lang_id': 4}

In [23]:
classifier = pipeline(
    task="zero-shot-audio-classification", model="laion/clap-htsat-unfused"
)
classifier(array, candidate_labels=label_list)

[{'score': 0.2917209267616272, 'label': 'high_value_payment'},
 {'score': 0.22422751784324646, 'label': 'joint_account'},
 {'score': 0.1863952875137329, 'label': 'freeze'},
 {'score': 0.11245342344045639, 'label': 'app_error'},
 {'score': 0.051380500197410583, 'label': 'latest_transactions'},
 {'score': 0.03209715336561203, 'label': 'direct_debit'},
 {'score': 0.025592271238565445, 'label': 'address'},
 {'score': 0.025141267105937004, 'label': 'business_loan'},
 {'score': 0.019772421568632126, 'label': 'cash_deposit'},
 {'score': 0.015598911792039871, 'label': 'card_issues'},
 {'score': 0.012746768072247505, 'label': 'pay_bill'},
 {'score': 0.0016031954437494278, 'label': 'abroad'},
 {'score': 0.000873863697052002, 'label': 'atm_limit'},
 {'score': 0.0003965037758462131, 'label': 'balance'}]

In [24]:
id2label = minds.features["intent_class"].int2str

In [25]:
id2label(example['intent_class'])

'joint_account'

As you can see the prediction is not accurate, because CLAP is pre-trained on generic audio classification data, similar to the environmental sounds in the ESC dataset, rather than specifically speech data

So we fine tune models like whisper to perform downstream task