Acoustic Prompting via Stage-wise Modulation for Few-Shot Learning in Audio Language Models (ASPL)

Hyebin Cho, Jaehyuk Jang, Changick Kim, Joon Son Chung


This repository implements our method, ASPL (Audio-side Prompt Learning), for few-shot audio classification. ASPL is designed as a plug-in module that can be integrated into CoOp, CoCoOp, and PALM, and is built on top of the PENGI audio-language model.

Overview

This repository contains:

ASPL (Audio-side Prompt Learning) for few-shot audio classification
audio-side prompting with ASPL
stage-wise audio prompting with ASPL+
training scripts for PALM, CoOp, and CoCoOp variants on multiple audio classification datasets

Installation

Create a conda environment.

conda create --name aspl python=3.8
conda activate aspl

Install dependencies.

git clone https://github.com/hyebin-c/aspl
cd aspl
pip install -r requirements.txt

Model

All experiments use PENGI as the underlying audio-language model.

Download the pre-trained PENGI checkpoint and place it in pengi/configs.

Model	Link	Size
PENGI	Download	2.2 GB

You can also download it with:

wget https://zenodo.org/records/8387083/files/base.pth

Datasets

We keep the dataset preparation pipeline from the original PALM setup. Instructions for downloading and processing datasets are provided in DATASETS.md. A Jupyter notebook for downloading datasets is also provided at media/DownloadAudioDatasets.ipynb.

Dataset	Type	Classes	Size	Link
Beijing-Opera	Instrument Classification	4	69 MB	Instructions
CREMA-D	Emotion Recognition	6	606 MB	Instructions
ESC50	Sound Event Classification	50	881 MB	Instructions
ESC50-Actions	Sound Event Classification	10	881 MB	Instructions
GT-Music-Genre	Music Analysis	10	1.3 GB	Instructions
NS-Instruments	Instrument Classification	10	18.5 GB	Instructions
RAVDESS	Emotion Recognition	8	1.1 GB	Instructions
SESA	Surveillance Sound Classification	4	70 MB	Instructions
TUT2017	Acoustic Scene Classification	15	12.3 GB	Instructions
UrbanSound8K	Sound Event Classification	10	6.8 GB	Instructions
VocalSound	Vocal Sound Classification	6	8.2 GB	Instructions

All datasets should be placed in a directory named Audio-Datasets, and the path should be configured through DATASET_ROOT in the shell scripts under scripts.

Expected directory structure:

Audio-Datasets/
    ├── Beijing-Opera/
    ├── CREMA-D/
    ├── ESC50/
    ├── ESC50-Actions/
    ├── GT-Music-Genre/
    ├── NS-Instruments/
    ├── RAVDESS/
    ├── SESA/
    ├── TUT2017/
    ├── UrbanSound8K/
    ├── VocalSound/

Code Structure

There are three main folders in this repository.

pengi: PENGI-based model components and audio encoder code
palm: PALM, CoOp, CoCoOp, and ASPL/ASPL+ model implementations
utils: dataset loading, training, evaluation, and logging utilities

Run Experiments

The current release focuses on the ASPL and ASPL+ settings.

ASPL: pass 1
ASPL+: pass 2

PALM

bash scripts/run_all_datasets_palm.sh 1
bash scripts/run_all_datasets_palm.sh 2

CoOp

bash scripts/run_all_datasets_coop.sh 1
bash scripts/run_all_datasets_coop.sh 2

CoCoOp

bash scripts/run_all_datasets_cocoop.sh 1
bash scripts/run_all_datasets_cocoop.sh 2

The launcher scripts currently fix the following settings:

LR=0.01
EPOCH=100
SHOT=16

By default, the launcher scripts use CUDA_VISIBLE_DEVICES=0. If you want to run on a different GPU, override it at launch time:

CUDA_VISIBLE_DEVICES=0 bash scripts/run_all_datasets_palm.sh 1
CUDA_VISIBLE_DEVICES=0 bash scripts/run_all_datasets_coop.sh 2
CUDA_VISIBLE_DEVICES=0 bash scripts/run_all_datasets_cocoop.sh 2

Logs are saved under logs using directories such as:

logs/palm_aspl1_16
logs/palm_aspl2_16
logs/coop_aspl1_16
logs/cocoop_aspl2_16

Citation

Citation information will be updated here.

Contact

If you have any questions or feedback, feel free to reach out at hyebin.cho@kaist.ac.kr.

License

This project is licensed under the MIT License. See LICENSE for more details.

Acknowledgement

We use PENGI for model instantiation. This repository builds on the original PALM codebase, which already includes CoOp and CoCoOp-based prompt learning implementations adapted from CoOp and CoCoOp.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Acoustic Prompting via Stage-wise Modulation for Few-Shot Learning in Audio Language Models (ASPL)

Overview

Table of Contents

Installation

Model

Datasets

Code Structure

Run Experiments

PALM

CoOp

CoCoOp

Citation

Contact

License

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
media		media
palm		palm
pengi		pengi
scripts		scripts
utils		utils
.gitignore		.gitignore
DATASETS.md		DATASETS.md
LICENSE		LICENSE
README.md		README.md
check_parameters.py		check_parameters.py
main.py		main.py
requirements.txt		requirements.txt
tut.py		tut.py

Folders and files

Latest commit

History

Repository files navigation

Acoustic Prompting via Stage-wise Modulation for Few-Shot Learning in Audio Language Models (ASPL)

Overview

Table of Contents

Installation

Model

Datasets

Code Structure

Run Experiments

PALM

CoOp

CoCoOp

Citation

Contact

License

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages