Hyebin Cho, Jaehyuk Jang, Changick Kim, Joon Son Chung
This repository contains:
- ASPL (Audio-side Prompt Learning) for few-shot audio classification
- audio-side prompting with
ASPL - stage-wise audio prompting with
ASPL+ - training scripts for PALM, CoOp, and CoCoOp variants on multiple audio classification datasets
- Create a conda environment.
conda create --name aspl python=3.8
conda activate aspl- Install dependencies.
git clone https://github.com/hyebin-c/aspl
cd aspl
pip install -r requirements.txtAll experiments use PENGI as the underlying audio-language model.
Download the pre-trained PENGI checkpoint and place it in pengi/configs.
| Model | Link | Size |
|---|---|---|
| PENGI | Download | 2.2 GB |
You can also download it with:
wget https://zenodo.org/records/8387083/files/base.pthWe keep the dataset preparation pipeline from the original PALM setup. Instructions for downloading and processing datasets are provided in DATASETS.md. A Jupyter notebook for downloading datasets is also provided at media/DownloadAudioDatasets.ipynb.
| Dataset | Type | Classes | Size | Link |
|---|---|---|---|---|
| Beijing-Opera | Instrument Classification | 4 | 69 MB | Instructions |
| CREMA-D | Emotion Recognition | 6 | 606 MB | Instructions |
| ESC50 | Sound Event Classification | 50 | 881 MB | Instructions |
| ESC50-Actions | Sound Event Classification | 10 | 881 MB | Instructions |
| GT-Music-Genre | Music Analysis | 10 | 1.3 GB | Instructions |
| NS-Instruments | Instrument Classification | 10 | 18.5 GB | Instructions |
| RAVDESS | Emotion Recognition | 8 | 1.1 GB | Instructions |
| SESA | Surveillance Sound Classification | 4 | 70 MB | Instructions |
| TUT2017 | Acoustic Scene Classification | 15 | 12.3 GB | Instructions |
| UrbanSound8K | Sound Event Classification | 10 | 6.8 GB | Instructions |
| VocalSound | Vocal Sound Classification | 6 | 8.2 GB | Instructions |
All datasets should be placed in a directory named Audio-Datasets, and the path should be configured through DATASET_ROOT in the shell scripts under scripts.
Expected directory structure:
Audio-Datasets/
├── Beijing-Opera/
├── CREMA-D/
├── ESC50/
├── ESC50-Actions/
├── GT-Music-Genre/
├── NS-Instruments/
├── RAVDESS/
├── SESA/
├── TUT2017/
├── UrbanSound8K/
├── VocalSound/
There are three main folders in this repository.
pengi: PENGI-based model components and audio encoder codepalm: PALM, CoOp, CoCoOp, and ASPL/ASPL+ model implementationsutils: dataset loading, training, evaluation, and logging utilities
The current release focuses on the ASPL and ASPL+ settings.
ASPL: pass1ASPL+: pass2
bash scripts/run_all_datasets_palm.sh 1
bash scripts/run_all_datasets_palm.sh 2bash scripts/run_all_datasets_coop.sh 1
bash scripts/run_all_datasets_coop.sh 2bash scripts/run_all_datasets_cocoop.sh 1
bash scripts/run_all_datasets_cocoop.sh 2The launcher scripts currently fix the following settings:
LR=0.01EPOCH=100SHOT=16
By default, the launcher scripts use CUDA_VISIBLE_DEVICES=0. If you want to run on a different GPU, override it at launch time:
CUDA_VISIBLE_DEVICES=0 bash scripts/run_all_datasets_palm.sh 1
CUDA_VISIBLE_DEVICES=0 bash scripts/run_all_datasets_coop.sh 2
CUDA_VISIBLE_DEVICES=0 bash scripts/run_all_datasets_cocoop.sh 2Logs are saved under logs using directories such as:
logs/palm_aspl1_16logs/palm_aspl2_16logs/coop_aspl1_16logs/cocoop_aspl2_16
Citation information will be updated here.
If you have any questions or feedback, feel free to reach out at hyebin.cho@kaist.ac.kr.
This project is licensed under the MIT License. See LICENSE for more details.
We use PENGI for model instantiation. This repository builds on the original PALM codebase, which already includes CoOp and CoCoOp-based prompt learning implementations adapted from CoOp and CoCoOp.
