# Lesson 5: Zero-Shot Audio Classification

- In the classroom, the libraries have already been installed for you.
- If you are running this code on your own machine, please install the following:
``` 
    !pip install transformers
    !pip install datasets
    !pip install soundfile
    !pip install librosa
```

The `librosa` library may need to have [ffmpeg](https://www.ffmpeg.org/download.html) installed. 
- This page on [librosa](https://pypi.org/project/librosa/) provides installation instructions for ffmpeg.

- Here is some code that suppresses warning messages.

In [3]:
from transformers.utils import logging
logging.set_verbosity_error()

### Prepare the dataset of audio recordings

In [4]:
from datasets import load_dataset, load_from_disk

# This dataset is a collection of different sounds of 5 seconds
# dataset = load_dataset("ashraq/esc50",
#                       split="train[0:10]")
dataset = load_from_disk("./models/ashraq/esc50/train")

In [14]:
audio_sample = dataset[:100]

In [15]:
audio_sample

{'filename': ['1-100032-A-0.wav',
  '1-100038-A-14.wav',
  '1-100210-A-36.wav',
  '1-100210-B-36.wav',
  '1-101296-A-19.wav',
  '1-101296-B-19.wav',
  '1-101336-A-30.wav',
  '1-101404-A-34.wav',
  '1-103298-A-9.wav',
  '1-103995-A-30.wav'],
 'fold': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'target': [0, 14, 36, 36, 19, 19, 30, 34, 9, 30],
 'category': ['dog',
  'chirping_birds',
  'vacuum_cleaner',
  'vacuum_cleaner',
  'thunderstorm',
  'thunderstorm',
  'door_wood_knock',
  'can_opening',
  'crow',
  'door_wood_knock'],
 'esc10': [True,
  False,
  False,
  False,
  False,
  False,
  False,
  False,
  False,
  False],
 'src_file': [100032,
  100038,
  100210,
  100210,
  101296,
  101296,
  101336,
  101404,
  103298,
  103995],
 'take': ['A', 'A', 'A', 'B', 'A', 'B', 'A', 'A', 'A', 'A'],
 'audio': [{'path': None,
   'array': array([0., 0., 0., ..., 0., 0., 0.]),
   'sampling_rate': 44100},
  {'path': None,
   'array': array([-0.01184082, -0.10336304, -0.14141846, ...,  0.06985474,
          

In [34]:
from IPython.display import Audio as IPythonAudio
def showaudio(i=0):
    return IPythonAudio(audio_sample["audio"][i]["array"],
             rate=audio_sample["audio"][i]["sampling_rate"]
            )

In [40]:
showaudio(i=2)

In [39]:
IPythonAudio(audio_sample["audio"][1]["array"],
             rate=audio_sample["audio"][1]["sampling_rate"]
            )

### Build the `audio classification` pipeline using 🤗 Transformers Library

In [10]:
from transformers import pipeline

In [11]:
zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="./models/laion/clap-htsat-unfused")

More info on [laion/clap-htsat-unfused](https://huggingface.co/laion/clap-htsat-unfused).

### Sampling Rate for Transformer Models
- How long does 1 second of high resolution audio (192,000 Hz) appear to the Whisper model (which is trained to expect audio files at 16,000 Hz)? 

In [43]:
(1 * 192000) / 16000

12.0

- The 1 second of high resolution audio appears to the model as if it is 12 seconds of audio.

- How about 5 seconds of audio?

In [44]:
(5 * 192000) / 16000

60.0

- 5 seconds of high resolution audio appears to the model as if it is 60 seconds of audio.

In [45]:
zero_shot_classifier.feature_extractor.sampling_rate

48000

In [47]:
audio_sample["audio"][0]["sampling_rate"]

44100

* Set the correct sampling rate for the input and the model.

In [1]:
from datasets import Audio

In [None]:
from scipy import stats

df=dataframe[]
for i in range(10000):
    df['obs']=i+1
    df['sampling_rate']=dataset["audio"][i]["sampling_rate"]

stats.describe(df['sampling_rate'])

In [5]:
dataset48K = dataset.cast_column(
    "audio",
     Audio(sampling_rate=48_000))

In [15]:
dataset48K[0:10]

{'filename': ['1-100032-A-0.wav',
  '1-100038-A-14.wav',
  '1-100210-A-36.wav',
  '1-100210-B-36.wav',
  '1-101296-A-19.wav',
  '1-101296-B-19.wav',
  '1-101336-A-30.wav',
  '1-101404-A-34.wav',
  '1-103298-A-9.wav',
  '1-103995-A-30.wav'],
 'fold': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'target': [0, 14, 36, 36, 19, 19, 30, 34, 9, 30],
 'category': ['dog',
  'chirping_birds',
  'vacuum_cleaner',
  'vacuum_cleaner',
  'thunderstorm',
  'thunderstorm',
  'door_wood_knock',
  'can_opening',
  'crow',
  'door_wood_knock'],
 'esc10': [True,
  False,
  False,
  False,
  False,
  False,
  False,
  False,
  False,
  False],
 'src_file': [100032,
  100038,
  100210,
  100210,
  101296,
  101296,
  101336,
  101404,
  103298,
  103995],
 'take': ['A', 'A', 'A', 'B', 'A', 'B', 'A', 'A', 'A', 'A'],
 'audio': [{'path': None,
   'array': array([0., 0., 0., ..., 0., 0., 0.]),
   'sampling_rate': 48000},
  {'path': None,
   'array': array([-0.01288922, -0.09524129, -0.14230728, ...,  0.03312215,
          

In [6]:
audio_sample = dataset48K[0]

In [7]:
audio_sample

{'filename': '1-100032-A-0.wav',
 'fold': 1,
 'target': 0,
 'category': 'dog',
 'esc10': True,
 'src_file': 100032,
 'take': 'A',
 'audio': {'path': None,
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 48000}}

In [8]:
candidate_labels = ["Sound of a dog",
                    "Sound of vacuum cleaner"]

In [12]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

[{'score': 0.9985589385032654, 'label': 'Sound of a dog'},
 {'score': 0.0014411123702302575, 'label': 'Sound of vacuum cleaner'}]

In [13]:
candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane"]

In [14]:
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

[{'score': 0.6172530055046082, 'label': 'Sound of a bird singing'},
 {'score': 0.21602635085582733, 'label': 'Sound of vacuum cleaner'},
 {'score': 0.12547191977500916, 'label': 'Sound of an airplane'},
 {'score': 0.04124866798520088, 'label': 'Sound of a child crying'}]

In [29]:
dataset48K[0:20]['category']

['dog',
 'chirping_birds',
 'vacuum_cleaner',
 'vacuum_cleaner',
 'thunderstorm',
 'thunderstorm',
 'door_wood_knock',
 'can_opening',
 'crow',
 'door_wood_knock',
 'door_wood_knock',
 'clapping',
 'clapping',
 'clapping',
 'dog',
 'clapping',
 'thunderstorm',
 'fireworks',
 'fireworks',
 'fireworks']

In [30]:
candidate_labels = set(dataset48K[0:20]['category'])

In [34]:
zero_shot_classifier(dataset48K[16]["audio"]["array"],
                     candidate_labels=candidate_labels)

[{'score': 0.9991794228553772, 'label': 'thunderstorm'},
 {'score': 0.0003758132515940815, 'label': 'fireworks'},
 {'score': 0.0002723723591770977, 'label': 'door_wood_knock'},
 {'score': 6.941783794900402e-05, 'label': 'crow'},
 {'score': 6.154872971819714e-05, 'label': 'chirping_birds'},
 {'score': 2.8280908736633137e-05, 'label': 'clapping'},
 {'score': 6.791943633288611e-06, 'label': 'dog'},
 {'score': 5.3546082199318334e-06, 'label': 'can_opening'},
 {'score': 1.0022015430877218e-06, 'label': 'vacuum_cleaner'}]

### Try it yourself! 
- Try this model with some other labels and audio files!