Skip to content

hyebin-c/aspl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Acoustic Prompting via Stage-wise Modulation for Few-Shot Learning in Audio Language Models (ASPL)

INTERSPEECH 2026 arXiv coming soon GitHub repository

Hyebin Cho, Jaehyuk Jang, Changick Kim, Joon Son Chung


main figure

This repository implements our method, ASPL (Audio-side Prompt Learning), for few-shot audio classification. ASPL is designed as a plug-in module that can be integrated into CoOp, CoCoOp, and PALM, and is built on top of the PENGI audio-language model.


Overview

This repository contains:

  • ASPL (Audio-side Prompt Learning) for few-shot audio classification
  • audio-side prompting with ASPL
  • stage-wise audio prompting with ASPL+
  • training scripts for PALM, CoOp, and CoCoOp variants on multiple audio classification datasets

Table of Contents

  1. Create a conda environment.
conda create --name aspl python=3.8
conda activate aspl
  1. Install dependencies.
git clone https://github.com/hyebin-c/aspl
cd aspl
pip install -r requirements.txt

All experiments use PENGI as the underlying audio-language model.

Download the pre-trained PENGI checkpoint and place it in pengi/configs.

Model Link Size
PENGI Download 2.2 GB

You can also download it with:

wget https://zenodo.org/records/8387083/files/base.pth

We keep the dataset preparation pipeline from the original PALM setup. Instructions for downloading and processing datasets are provided in DATASETS.md. A Jupyter notebook for downloading datasets is also provided at media/DownloadAudioDatasets.ipynb.

Dataset Type Classes Size Link
Beijing-Opera Instrument Classification 4 69 MB Instructions
CREMA-D Emotion Recognition 6 606 MB Instructions
ESC50 Sound Event Classification 50 881 MB Instructions
ESC50-Actions Sound Event Classification 10 881 MB Instructions
GT-Music-Genre Music Analysis 10 1.3 GB Instructions
NS-Instruments Instrument Classification 10 18.5 GB Instructions
RAVDESS Emotion Recognition 8 1.1 GB Instructions
SESA Surveillance Sound Classification 4 70 MB Instructions
TUT2017 Acoustic Scene Classification 15 12.3 GB Instructions
UrbanSound8K Sound Event Classification 10 6.8 GB Instructions
VocalSound Vocal Sound Classification 6 8.2 GB Instructions

All datasets should be placed in a directory named Audio-Datasets, and the path should be configured through DATASET_ROOT in the shell scripts under scripts.

Expected directory structure:

Audio-Datasets/
    ├── Beijing-Opera/
    ├── CREMA-D/
    ├── ESC50/
    ├── ESC50-Actions/
    ├── GT-Music-Genre/
    ├── NS-Instruments/
    ├── RAVDESS/
    ├── SESA/
    ├── TUT2017/
    ├── UrbanSound8K/
    ├── VocalSound/

There are three main folders in this repository.

  • pengi: PENGI-based model components and audio encoder code
  • palm: PALM, CoOp, CoCoOp, and ASPL/ASPL+ model implementations
  • utils: dataset loading, training, evaluation, and logging utilities

The current release focuses on the ASPL and ASPL+ settings.

  • ASPL: pass 1
  • ASPL+: pass 2

PALM

bash scripts/run_all_datasets_palm.sh 1
bash scripts/run_all_datasets_palm.sh 2

CoOp

bash scripts/run_all_datasets_coop.sh 1
bash scripts/run_all_datasets_coop.sh 2

CoCoOp

bash scripts/run_all_datasets_cocoop.sh 1
bash scripts/run_all_datasets_cocoop.sh 2

The launcher scripts currently fix the following settings:

  • LR=0.01
  • EPOCH=100
  • SHOT=16

By default, the launcher scripts use CUDA_VISIBLE_DEVICES=0. If you want to run on a different GPU, override it at launch time:

CUDA_VISIBLE_DEVICES=0 bash scripts/run_all_datasets_palm.sh 1
CUDA_VISIBLE_DEVICES=0 bash scripts/run_all_datasets_coop.sh 2
CUDA_VISIBLE_DEVICES=0 bash scripts/run_all_datasets_cocoop.sh 2

Logs are saved under logs using directories such as:

  • logs/palm_aspl1_16
  • logs/palm_aspl2_16
  • logs/coop_aspl1_16
  • logs/cocoop_aspl2_16

Citation information will be updated here.


If you have any questions or feedback, feel free to reach out at hyebin.cho@kaist.ac.kr.


This project is licensed under the MIT License. See LICENSE for more details.


We use PENGI for model instantiation. This repository builds on the original PALM codebase, which already includes CoOp and CoCoOp-based prompt learning implementations adapted from CoOp and CoCoOp.

About

Official implementation of "Acoustic Prompting via Stage-wise Modulation for Few-Shot Learning in Audio Language Models", INTERSPEECH 2026.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors