# M2D-CLAP Example -- ESC-50 zero-shot classification

This is an example of CLAP part of our Interspeech 2024 paper, and exhibits the zero-shot classification of the [ESC-50](https://github.com/karolpiczak/ESC-50) dataset.

While this example can reproduce the result on the paper, the inference codes are almost simple calculations, not relying on a high-level package.

```bibtex
@article{niizumi2024m2d-clap,
    title   = {{M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation}},
    author  = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Masahiro Yasuda and Shunsuke Tsubaki and Keisuke Imoto},
    journal = {to appear at Interspeech},
    year    = {2024},
    url     = {https://arxiv.org/abs/2406.02032}}
```

In [1]:
# The code depends on these external modules.
! pip install timm einops nnAudio librosa >& /dev/null

import warnings; warnings.simplefilter('ignore')
import logging; logging.basicConfig(level=logging.INFO)
import numpy as np
import pandas as pd
from pathlib import Path
import torch
import zipfile
import librosa

In [2]:
# Downloads the ESC-50 dataset.
! git clone https://github.com/karolpiczak/ESC-50.git

fatal: destination path 'ESC-50' already exists and is not an empty directory.


In [3]:
meta = pd.read_csv('ESC-50/meta/esc50.csv')
meta

Unnamed: 0,filename,fold,target,category,esc10,src_file,take
0,1-100032-A-0.wav,1,0,dog,True,100032,A
1,1-100038-A-14.wav,1,14,chirping_birds,False,100038,A
2,1-100210-A-36.wav,1,36,vacuum_cleaner,False,100210,A
3,1-100210-B-36.wav,1,36,vacuum_cleaner,False,100210,B
4,1-101296-A-19.wav,1,19,thunderstorm,False,101296,A
...,...,...,...,...,...,...,...
1995,5-263831-B-6.wav,5,6,hen,False,263831,B
1996,5-263902-A-36.wav,5,36,vacuum_cleaner,False,263902,A
1997,5-51149-A-25.wav,5,25,footsteps,False,51149,A
1998,5-61635-A-8.wav,5,8,sheep,False,61635,A


In [4]:
classes = meta.category.unique().tolist()
classes[:5]

['dog', 'chirping_birds', 'vacuum_cleaner', 'thunderstorm', 'door_wood_knock']

## Download M2D
- portable_m2d.py -- A portable loader, no dependance on other files from M2D repository.
- m2d_vit_base-80x1001p16x16-221006-mr7_as_46ab246d.zip -- An AudioSet fine-tuned weight file

In [5]:
! wget https://raw.githubusercontent.com/nttcslab/m2d/master/examples/portable_m2d.py
! wget https://github.com/nttcslab/m2d/releases/download/v0.5.0/m2d_clap_vit_base-80x1001p16x16p16kpBpTI-2025.zip

with zipfile.ZipFile("m2d_clap_vit_base-80x1001p16x16p16kpBpTI-2025.zip", "r") as zip_ref:
    zip_ref.extractall(".")
! find m2d_clap_vit_base-80x1001p16x16p16kpBpTI-2025 -name *.pth

--2025-09-23 01:03:26--  https://raw.githubusercontent.com/nttcslab/m2d/master/examples/portable_m2d.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24940 (24K) [text/plain]
Saving to: ‘portable_m2d.py.1’


2025-09-23 01:03:26 (14.5 MB/s) - ‘portable_m2d.py.1’ saved [24940/24940]

--2025-09-23 01:03:26--  https://github.com/nttcslab/m2d/releases/download/v0.5.0/m2d_clap_vit_base-80x1001p16x16p16kpBpTI-2025.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://release-assets.githubusercontent.com/github-production-release-asset/589370928/9f7bcda1-a23c-46d7-a722-e5bc6bb8ab6a?sp=r&sv=2018-11-09&sr=b&spr=https&se=2025-09-23

## Create model

Two lines of code get a model ready for classification.

In [6]:
from portable_m2d import PortableM2D
# Use flat_features=True for CLAP features only. For conventional audio features, flat_features should be False.
model = PortableM2D(weight_file='m2d_clap_vit_base-80x1001p16x16p16kpBpTI-2025/checkpoint-30.pth', flat_features=True)
model = model.to('cuda')
model.eval();

 using 166 parameters from m2d_clap_vit_base-80x1001p16x16p16kpBpTI-2025/checkpoint-30.pth
 (included audio_proj params: ['audio_proj.sem_token', 'audio_proj.sem_blocks.0.norm1.weight', 'audio_proj.sem_blocks.0.norm1.bias', 'audio_proj.sem_blocks.0.attn.qkv.weight', 'audio_proj.sem_blocks.0.attn.qkv.bias']
 (included text_proj params: []
 (dropped: [] )
<All keys matched successfully>


## Get text embeddings for classes

In [7]:
with torch.no_grad():
    class_text_embs = [model.encode_clap_text(f'{" ".join(c.split("_"))} can be heard') for c in classes]
class_text_embs = torch.vstack(class_text_embs).to('cpu')
class_text_embs.shape

 using model.text_encoder from m2d_clap_vit_base-80x1001p16x16p16kpBpTI-2025/checkpoint-30.pth


torch.Size([50, 768])

## Get audio embeddings for samples

In [8]:
audio_embs = []
with torch.no_grad():
    for f in meta.filename.values:
        wav = librosa.load(f'ESC-50/audio/{f}', mono=True, sr=model.cfg.sample_rate)[0]
        wav = torch.tensor(wav).unsqueeze(0).to('cuda')
        audio_embs.append(model.encode_clap_audio(wav))
audio_embs = torch.vstack(audio_embs).to('cpu')
audio_embs.shape

torch.Size([2000, 768])

## Inference

In [9]:
# Calculate cosine similarity.
audio_embs = audio_embs / torch.norm(audio_embs, dim=-1, keepdim=True)
class_text_embs = class_text_embs / torch.norm(class_text_embs, dim=-1, keepdim=True)
similarities = class_text_embs @ audio_embs.T

# Prediction results.
preds = similarities.argmax(0).numpy()

# Ground truth labels.
GT = meta.category.apply(lambda x: classes.index(x)).values  # Convert meta.category into the class index.

print('Accuracy:', sum((preds == GT)) / len(GT))

Accuracy: 0.943


We can see the same result of 94.3% with our paper.

---



Note: the actual code used in the paper was [EVAR](https://github.com/nttcslab/eval-audio-repr).
