Spectral Transcoder : Using Pretrained Urban Sound Classifiers On Arbitrary Spectral Representations
This repo contains code for the paper: Spectral Transcoder : Using Pretrained Urban Sound Classifiers On Undersampled Spectral Representations [1]. It uses YAMNet [2] and PANN [3] pre-trained models in a teacher student approach to transcode 125ms third-octave spectrogral representations into 32ms Mel representations. Results of the paper can be reproduced looking at section 2. Some audio samples can be generated looking at section 3. Some complementary experiment results are in section 4.
If you would like to use this algorithm in your own project, please refer to the code available at the following link:
👉 https://github.com/modantailleur/fast-to-ears
The codebase is developed with Python 3.9.15. To install the required dependencies, run the following command:
pip install -r requirements.txt
Additionally, download the pretrained models (PANN ResNet38 and Efficient Nets) by executing the following command:
python3 download_pretrained_models.py
If you want to replicate paper results, you can download the required datasets, namely the TAU Urban Acoustic Scenes 2020 Mobile, UrbanSound8k, and SONYC-UST datasets using the following commands:
python3 download_datasets.py
It is crucial to download the datasets using this command, as the names and metadata files of the datasets have been slightly modified (correcting the duration calculation issue in the original dataset). The datasets will be added to the root of the folder. If you want to specify a different path for the datasets (e.g. mypath
), use the following command:
python3 download_datasets.py --output mypath
Please make sure that you have around 150G of free space in your hard disk to have enough space for the datasets and for running the experiments.
To generate third-octave and Mel data from the TAU Urban Acoustic Scenes 2020 Mobile, which will be used to train and evaluate the models in the subsequent sections, execute the following command:
python3 exp_train_model/create_mel_tho_dataset.py
By default, the data will be stored in the root of the GitHub folder. If you prefer to store the data in a different location (e.g., mypath
), use the following command:
python3 exp_train_model/create_mel_tho_dataset.py --output_path mypath
The experiment plan is developped with doce. To reproduce only the results shown in the paper, use the following commands (requires approximately 50h of calculation on a single GPU V100):
python3 exp_train_model/launch_experiment/launch_exp_py.py --exp_type restricted
To reproduce detailed results, with a comparison between different hyperparameters, use the following commands (requires approximately 300h of calculation on a single GPU V100):
python3 exp_train_model/launch_experiment/launch_exp_py.py --exp_type detailed
Alternatively, you can launch the experiment using slurms (to run the code on the jean zay server). First, create your slurms by executing the following command:
python3 exp_train_model/launch_experiment/slurms_create.py
Then, launch the reference experiment using the following command:
python3 exp_train_model/launch_experiment/jean_zay_slurm_reference.slurm
Wait until the reference plan of the experiment is finished. Afterward, you can launch all the slurm files on different GPUs using the following command (about 20h of calculation on multiple GPU V100):
python3 exp_train_model/launch_experiment/launch_exp_slurms.py
You can then export the results in a png format in the results
folder using the following commands:
python3 exp_train_model/export_experiment_results.py
The results will show in folder export
and will be named results_training_PANN.png
and results_training_YamNet.png
If you want to show more detailed results, please read doce tutorial, and select
the plan and the parameters you want to show. Here is a command example,
which will create an export file myresults.png
in the export
folder:
python3 exp_train_model/main_doce_training.py -s cnn/dataset=full+classifier=PANN -d -e myresults.png
To plot the training curve of the trained CNN-logits model (feel free to change source code if you want any other model) used in the paper, execute the following command:
python3 exp_train_model/plot_training_curve.py
Generate the logit data (classes predictions) from the classifiers and the transcoded classifiers on the datasets UrbanSound8k and SONYC-UST using the following commands:
python3 exp_classif_eval/create_logit_dataset.py --audio_dataset_name URBAN-SOUND-8K
python3 exp_classif_eval/create_logit_dataset.py --audio_dataset_name SONYC-UST
The models used for this experiment are in the folder reference_models
. To launch the experiment and train the additional fully connected layers, execute the following commands:
First command to execute:
python3 exp_classif_eval/main_doce_score.py -s reference/ -c
Second command to execute:
python3 exp_classif_eval/main_doce_score.py -s deep/ -c
Then, you can export the results of the experiment ("results_classif_urbansound8k.png", "results_classif_sonycust.png") in a png format in the export
folder using the following commands:
python3 exp_classif_eval/main_doce_score.py -s deep/dataset=URBAN-SOUND-8K -d [0] -e results_classif_urbansound8k.png
python3 exp_classif_eval/main_doce_score.py -s deep/dataset=SONYC-UST -d [1] -e results_classif_sonycust.png
As Mel spectrograms can be inverted with librosa using the feature mel_to_audio, we can also invert transcoded Mel spectrograms and thus retrieve audio from third-octave spectrograms. You can try with your own audio files, by putting your wav file (myfile.wav
) in the audio folder and executing this command:
python3 generate_audio.py myfile.wav
The generated wav files will be placed in the audio_generated/myfile
folder. It will contain the normalized original file (myfile_original.wav
) and the audio file generated from PANN 32ms Mel Spectrogram (myfile_generated_from_groundtruth_mel.wav
). The folder will also contain the files generated from 125ms third-octave spectrograms transcoded into 32-ms Mel spectrograms, with the different transcoding techniques mentioned in the paper: the audio file generated from the PINV transcoder (myfile_generated_from_pinv.wav
), the audio file generated from the CNN-mels transcoder(myfile_generated_from_cnn_mels.wav
) and finally, the audio file generated from the CNN-logits transcoder (myfile_generated_from_cnn_logits.wav
).
An audio example obtained on freesound have been re-generated with this algorithm and is available in the audio
and audio_generated
folders. The file is named "birds.wav" (obtained from: https://freesound.org/people/hargissssound/sounds/345851/). Interestingly, the audio files sounds quite realistic when the transcoder CNN-logits is used. Here are the generated audios from "birds.wav" (please unmute videos before playing):
Original audio file
0_original.mp4
Audio generated from Mel spectrogram (ground truth)
1_mel.mp4
Audio generated from transcoded Mel spectrogram (PINV)
4_pinv.mp4
Audio generated from transcoded Mel spectrogram (CNN-mels)
3_cnn_mels.mp4
Audio generated from transcoded Mel spectrogram (CNN-logits)
2_cnn_logits.mp4
The figure below (run plot_spectro_dcase2023.py
to replicate the figure) demonstrates the results of the transcoding process using various transcoding algorithms, namely the PINV transcoder, CNN-mels transcoder, and CNN-logits transcoder, on a 1-second audio excerpt from the evaluation dataset.
In our paper titled "Spectral Transcoder: Using Pretrained Urban Sound Classifiers on Undersampled Spectral Representations," we propose a revised version of the aggregation method introduced by F. Gontier et al. [4]. During the inference process, we group some of the classes that are considered relevant for each SONYC-UST and UrbanSound8k class. The groups used during inference can be found in the files exp_classif_eval/sub_classes_sonyc_ust.xlsx
and exp_classif_eval/sub_classes_urbansound8k.xlsx
.
In Gontier et al. 's paper, if in the highest top 3 predictions of YamNet one predicted class among the 527 belongs to one of the meta-classes (traffic, voice, birds in their paper), the meta-class was considered present. Instead of taking the top 3, the top 8 classes of SONYC-UST is taken into account for the aggregation. In our case, we believe that Gontier et al.'s method can result in some false negatives for multi-label classification. For example, if there are is some music in the audio excerpt, it is very likely that the 8 first predicted classes will be related to music (ex: 1:Music, 2:Musical Instrument, 3:Guitar, 4:Pop Music, 5:Drum, 6:Piano, 7:Bass Guitar, 8:Acoustic Guitar), and so the next prediction at position 9 will not be considered present in the audio excerpt (ex: 9: Speech). To address this issue, we propose grouping classes during inference, which reduces false negatives (e.g., 1: SONYC-UST Music, 2: SONYC-UST HumanVoice, 3: Mosquito, etc.). This aggregation approach leads to a higher mAUC than Gontier et al.'s method for SONYC-UST. A similar type of aggregation is employed for UrbanSound8k but in a simpler multi-class classification paradigm (the meta-class is considered present if one of its classes has the highest score).
The results of this method, compared to the method where we add fully connected layers at the output of the pre-trained models (as explained in the paper), are presented in the tables below. When the "deep" parameter is set to 1, fully connected layers are used. Conversely, when "deep" is set to 0, the aggregation method described in the previous paragraph is employed.
The images above can be replicated using the following commands (after running part 2-b):
python3 exp_classif_eval/main_doce_score.py -s "{'dataset':'URBAN-SOUND-8K'}" -d [0] -e results_classif_urbansound8k.png
python3 exp_classif_eval/main_doce_score.py -s "{'dataset':'SONYC-UST'}" -d [1] -e results_classif_sonycust.png
[1] Tailleur, M., Lagrange, M., Aumond, P., & Tourre, V. (2023, September). Spectral trancoder: using pretrained urban sound classifiers on undersampled spectral representations. In 8th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE).
[2] Tensorflow, “Sound classification with yamnet,” 2020, last access on 09/05/2023. [Online]. Source Code: https://github.com/tensorflow/models/tree/master/research/audioset/yamnet/.
[3] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural net- works for audio pattern recognition,” IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020, publisher: IEEE. Source Code: https://github.com/qiuqiangkong/audioset_tagging_cnn. Available: https://arxiv.org/abs/1912.10211.
[4] F. Gontier, V. Lostanlen, M. Lagrange, N. Fortin, C. La- vandier, and J.-F. Petiot, “Polyphonic training set synthe- sis improves self-supervised urban sound classification,” The Journal of the Acoustical Society of America, vol. 149, no. 6, pp. 4309–4326, 2021, publisher: Acoustical Society of Amer- ica. Available: https://hal-nantes-universite.archives-ouvertes.fr/hal-03262863/.
This repo contains a yamnet
folder which is a pytorch port of the original YAMNet (link). It also contains a pann
folder which is the original PANN repository (link). We have extended the source codes with supplementary functionalities that are significant for our research.
The doce
folder of this repo contains the source code from the doce project (downloaded the 31/05/2023). As this project is still in progress, doce is not present in the requirements.txt
file, in order to ensure the reproductibility of the results.