# Masked Modeling Duo (M2D) Example -- Audio Tagging

We show an example of audio tagging using a fine-tuned M2D model.
[M2D](https://github.com/nttcslab/m2d) is an audio self-supervised learning model pre-trained on [AudioSet](https://research.google.com/audioset/) without using labels.
After the M2D pre-training, the pre-trained model was fine-tuned on AudioSet (with labels).

We use the fine-tuned model and demonstrate how it predicts AudioSet classes for audio segments.

In [1]:
# The code depends on these external modules.
! pip install timm einops nnAudio librosa >& /dev/null

import warnings; warnings.simplefilter('ignore')
import logging; logging.basicConfig(level=logging.INFO)
import numpy as np
import pandas as pd
from pathlib import Path
import torch
import zipfile
import librosa


In [2]:
# Downloads the AudioSet class definition. It has 527 classes.
! wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/class_labels_indices.csv >& /dev/null
classes = pd.read_csv('class_labels_indices.csv').sort_values('mid').reset_index()
classes[:3]

Unnamed: 0,level_0,index,mid,display_name
0,433,433,/g/122z_qxw,Firecracker
1,169,169,/m/011k_j,Timpani
2,108,108,/m/01280g,Wild animals


In [3]:
# Also downloads example audio files for demonstration.
! wget https://github.com/nttcslab/msm-mae/releases/download/v0.0.1/AudioSetWav16k_examples.zip >& /dev/null
with zipfile.ZipFile("AudioSetWav16k_examples.zip", "r") as zip_ref:
    zip_ref.extractall(".")
! ls AudioSetWav16k/eval_segments

-0xzrMun0Rs_30.000.wav	-22tna7KHzI_28.000.wav	5hlsVoxJPNI_30.000.wav
-1nilez17Dg_30.000.wav	3tUlhM80ObM_0.000.wav	--U7joUcTCo_0.000.wav


## Download M2D
- portable_m2d.py -- A portable loader, no dependance on other files from M2D repository.
- m2d_vit_base-80x1001p16x16-221006-mr7_as_46ab246d.zip -- An AudioSet fine-tuned weight file

In [4]:
! wget https://raw.githubusercontent.com/nttcslab/m2d/master/examples/portable_m2d.py
! wget https://github.com/nttcslab/m2d/releases/download/v0.3.0/m2d_vit_base-80x1001p16x16-221006-mr7_as_46ab246d.zip

with zipfile.ZipFile("m2d_vit_base-80x1001p16x16-221006-mr7_as_46ab246d.zip", "r") as zip_ref:
    zip_ref.extractall(".")
! find m2d_vit_base-80x1001p16x16-221006-mr7_as_46ab246d -name *.pth

--2024-03-25 23:47:03--  https://raw.githubusercontent.com/nttcslab/m2d/master/examples/portable_m2d.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11973 (12K) [text/plain]
Saving to: ‘portable_m2d.py’


2024-03-25 23:47:03 (53.0 MB/s) - ‘portable_m2d.py’ saved [11973/11973]

--2024-03-25 23:47:03--  https://github.com/nttcslab/m2d/releases/download/v0.3.0/m2d_vit_base-80x1001p16x16-221006-mr7_as_46ab246d.zip
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/589370928/0bdeb8a7-c3f3-44c5-afb9-9b9edaa3e861?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credenti

## Create model

Two lines of code get a model ready for classification.

In [5]:
from portable_m2d import PortableM2D
model = PortableM2D(weight_file='m2d_vit_base-80x1001p16x16-221006-mr7_as_46ab246d/weights_ep69it3124-0.47929.pth', num_classes=527)

 using 150 parameters, while dropped 10 out of 160 parameters from m2d_vit_base-80x1001p16x16-221006-mr7_as_46ab246d/weights_ep69it3124-0.47929.pth
 (dropped: ['module.ar.runtime.to_spec.mel_basis', 'module.ar.runtime.to_spec.stft.wsin', 'module.ar.runtime.to_spec.stft.wcos', 'module.ar.runtime.to_spec.stft.window_mask', 'module.head.norm.running_mean'] ...)
<All keys matched successfully>


### An audio tagging example

In [6]:
from IPython.display import display, Audio

def show_topk(classes, m2d, wav_file, k=5):
    print(wav_file)
    # Loads and shows an audio clip.
    wav, _ = librosa.load(wav_file, mono=True, sr=m2d.cfg.sample_rate)
    display(Audio(wav, rate=m2d.cfg.sample_rate))
    wav = torch.tensor(wav).unsqueeze(0)
    # Predicts class probabilities for the batch segments.
    with torch.no_grad():
        probs = m2d(wav).squeeze(0).softmax(0)
    # Shows the top-k prediction results.
    topk_values, topk_indices = probs.topk(k=k)
    print(', '.join([f'{classes.loc[i].display_name} ({v*100:.1f}%)' for i, v in zip(topk_indices.numpy(), topk_values.numpy())]))
    print()

files = list(Path('AudioSetWav16k/eval_segments').glob('*.wav'))
files = np.random.choice(files, size=3, replace=False)

for fn in files:
    show_topk(classes, model, fn)

AudioSetWav16k/eval_segments/5hlsVoxJPNI_30.000.wav


Music (62.4%), Speech (26.3%), Singing (1.1%), Music for children (1.1%), Lullaby (0.8%)

AudioSetWav16k/eval_segments/3tUlhM80ObM_0.000.wav


Knock (93.8%), Music (1.0%), Silence (0.8%), Sound effect (0.4%), Musical instrument (0.3%)

AudioSetWav16k/eval_segments/-1nilez17Dg_30.000.wav


Speech (45.9%), Heart sounds, heartbeat (24.5%), Heart murmur (14.4%), Music (2.0%), Throbbing (1.8%)



### Audio tagging with sliding window

The following demonstrates the progress of audio tags over seconds, like a sound event detection.

In [7]:
def repeat_if_short(w, min_duration=48000):
    while w.shape[-1] < min_duration:
        w = np.concatenate([w, w], axis=-1)
    return w[..., :min_duration]

def show_topk_sliding_window(classes, m2d, wav_file, k=5, hop=1, duration=2.):
    print(wav_file)
    # Loads and shows an audio clip.
    wav, sr = librosa.load(wav_file, mono=True, sr=m2d.cfg.sample_rate)
    display(Audio(wav, rate=sr))
    # Makes a batch of short segments of the wav into wavs, cropped by the sliding window of [hop, duration].
    wavs = [wav[int(c * sr) : int((c + duration) * sr)] for c in np.arange(0, wav.shape[-1] / sr, hop)]
    wavs = [repeat_if_short(wav) for wav in wavs]
    wavs = torch.tensor(wavs)
    # Predicts class probabilities for the batch segments.
    with torch.no_grad():
        probs_per_chunk = m2d(wavs).softmax(1)
    # Shows the top-k prediction results.
    for i, probs in enumerate(probs_per_chunk):
        topk_values, topk_indices = probs.topk(k=k)
        sec = f'{i * hop:d}s '
        print(sec, ', '.join([f'{classes.loc[i].display_name} ({v*100:.1f}%)' for i, v in zip(topk_indices.numpy(), topk_values.numpy())]))
    print()

for fn in files:
    show_topk_sliding_window(classes, model, fn)

AudioSetWav16k/eval_segments/5hlsVoxJPNI_30.000.wav


0s  Music (53.8%), Lullaby (7.6%), Female singing (6.6%), Singing (2.5%), Music for children (2.3%)
1s  Music (61.1%), Music for children (7.3%), Electronic music (5.0%), A capella (2.7%), Female singing (2.2%)
2s  Music (74.2%), Lullaby (9.3%), Music for children (4.3%), Singing (1.5%), Electronic music (0.4%)
3s  Music (68.1%), Music for children (4.6%), Humming (4.2%), Lullaby (3.6%), Singing (2.4%)
4s  Speech (89.5%), Female speech, woman speaking (1.2%), Narration, monologue (1.0%), Ping (0.9%), Busy signal (0.8%)
5s  Speech (60.2%), Female speech, woman speaking (26.5%), Narration, monologue (7.7%), Speech synthesizer (1.6%), Conversation (0.5%)
6s  Speech (81.9%), Female speech, woman speaking (7.9%), Speech synthesizer (1.7%), Narration, monologue (1.3%), Music (0.4%)
7s  Silence (14.6%), Owl (7.2%), Busy signal (6.8%), Heart sounds, heartbeat (4.3%), Music (4.1%)
8s  Music (74.9%), Silence (3.6%), Tick (1.7%), Tick-tock (1.6%), Violin, fiddle (1.2%)
9s  Music (74.3%), Pizzicat

0s  Knock (98.4%), Music (0.3%), Sound effect (0.1%), Bouncing (0.1%), Drum machine (0.1%)
1s  Knock (98.3%), Music (0.2%), Silence (0.2%), Sound effect (0.1%), Musical instrument (0.1%)
2s  Silence (96.0%), Music (0.6%), Vehicle (0.2%), Speech (0.2%), Inside, small room (0.1%)
3s  Silence (96.0%), Music (0.6%), Vehicle (0.2%), Speech (0.2%), Inside, small room (0.1%)
4s  Silence (96.0%), Music (0.6%), Vehicle (0.2%), Speech (0.2%), Inside, small room (0.1%)
5s  Knock (99.1%), Silence (0.4%), Music (0.1%), Inside, small room (0.0%), Bouncing (0.0%)
6s  Knock (96.8%), Music (0.5%), Sound effect (0.4%), Drum machine (0.2%), Bouncing (0.2%)
7s  Heart murmur (55.9%), Silence (9.3%), Heart sounds, heartbeat (8.0%), Music (3.3%), Sound effect (1.3%)

AudioSetWav16k/eval_segments/-1nilez17Dg_30.000.wav


0s  Heart sounds, heartbeat (78.8%), Heart murmur (13.3%), Throbbing (3.5%), Hum (1.2%), Pulse (0.5%)
1s  Heart sounds, heartbeat (72.5%), Heart murmur (13.3%), Throbbing (4.1%), Hum (1.9%), Speech (1.3%)
2s  Speech (50.4%), Heart sounds, heartbeat (26.6%), Heart murmur (15.1%), Throbbing (0.8%), Music (0.7%)
3s  Speech (79.2%), Female speech, woman speaking (11.7%), Narration, monologue (2.7%), Speech synthesizer (0.8%), Conversation (0.7%)
4s  Speech (75.0%), Female speech, woman speaking (16.6%), Narration, monologue (3.3%), Speech synthesizer (0.5%), Male speech, man speaking (0.3%)
5s  Speech (72.2%), Female speech, woman speaking (14.2%), Narration, monologue (8.9%), Speech synthesizer (1.2%), Male speech, man speaking (0.5%)
6s  Speech (73.4%), Female speech, woman speaking (16.0%), Narration, monologue (2.9%), Speech synthesizer (0.6%), Tick (0.3%)
7s  Speech (71.1%), Female speech, woman speaking (21.9%), Narration, monologue (2.4%), Speech synthesizer (1.0%), Conversation (0.