<a href="https://colab.research.google.com/github/maushamkumar/Hugging-Face/blob/main/Zero-Shot_Audio_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Zero-Shot Audio Classification

  !pip install transformers

  !pip install datasets

  !pip install soundfile

  !pip install librosa
  
The librosa library may need to have ffmpeg installed.

This page on librosa provides installation instructions for ffmpeg.

In [None]:
!pip install transformers
!pip install datasets
!pip install soundfile
!pip install librosa

- Here are some code that suppresses warning messages.

In [None]:
from transformers.utils import logging
logging.set_verbosity_error()

# Prepare the dataset of audio recordings¶

In [None]:
from datasets import load_dataset, load_from_disk

# This dataset is a collection of different sounds of 5 seconds
# dataset = load_dataset("ashraq/esc50",
#                       split="train[0:10]")
dataset = load_from_disk("ashraq/esc50")

In [None]:
audio_sample = dataset[0]

In [None]:
audio_sample

In [None]:
from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"],
             rate=audio_sample["audio"]["sampling_rate"])

# Build the audio classification pipeline using 🤗 Transformers Library

In [None]:
from transformers import pipeline

In [None]:
zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="laion/clap-htsat-unfused")

# Sampling Rate for Transformer Models
- How long does 1 second of high resolution audio (192,000 Hz) appear to the Whisper model (which is train to expect audio file at 16,000Hz)?

In [1]:
(1 * 192000) / 16000

12.0

- The one second of high resolution audio appear to the model as if it is 12 second of audio

- How about 5 sec on audio

In [2]:
(5 * 192000) / 16000

60.0

- 5 seconds of high resolution audio appears to the model as if it is 60 seconds of audio.

In [None]:
zero_shot_classifier.feature_extractor.sampling_rate

In [None]:
audio_sample["audio"]["sampling_rate"]

- Set the correct sampling rate for the input and the model.

In [None]:
from datasets import Audio

In [None]:
dataset = dataset.cast_column(
    "audio",
     Audio(sampling_rate=48_000))

In [None]:
audio_sample = dataset[0]

In [None]:
audio_sample

In [None]:
candidate_labels = ["Sound of a dog",
                    "Sound of vacuum cleaner"]

In [None]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

In [None]:
candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane"]

In [None]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)