-
Notifications
You must be signed in to change notification settings - Fork 0
2022 02 09
Reading notes. A few notes on the two papers that use CLIP for audio training (see the previous entry):
- The most significant difference to us is that the two papers work on sound datasets (that is, recordings of different events) and not recordings of utterances
- Interesting choices in terms of the audio encoder:
- Wu et al. use ResNet18 on audio spectograms
- Zhao et al. use an image transformer from CLIP; they even initialize with the same weights as the pretrained image model showing much better performance than random initialization
Axes of exploration. Below I'm listing possible choices for the language documentation setting:
- language: Yorùbá, English (mostly for developmental purposes; allows us to systematically evaluate the amount of training data)
- audio encoder: MFCC + network, wav2vec + network (possibly shallower than the one used on top of MFCC)
- supervisory signal: image, text (allows us to get a sense of an upper-bound performance)
- type of supervision: categorical tags (as in the original work of Herman), embedding (as in the CLIP papers)
- evaluation task: keyword detection, keyword localisation
In terms of evaluation:
- if we are evaluating at utterance level, does it make sense to use the English annotation, as the ones in Yorùbá carry almost1 the same semantics?
Exploring a different audio encoder. Check if using a pretrained ResNet (on images) helps improve the performance. For this I changed the input signal; replaced the MFCCs with spectrogram. Unfortunately, the dimensionality is larger and I was not able to train with a large batch size (I had to resort to 64 samples):
CUDA_VISIBLE_DEVICES=0 python train_emb.py --config resnet18-pretrained-batch-size-64
CUDA_VISIBLE_DEVICES=1 python train_emb.py --config resnet18-batch-size-64For the quantitative evaluation see the sheet 2021-08-31 · train-embedding-space from here.
These experiments were carried on the English version of the dataset.
Training the audio encoder on Yorùbá. Use the same setup as for English, but change the audio input to the Yorùbá utterances:
- at training time, learn the audio encoder by aligning the audio embeddings to the CLIP image features;
- at test time, embed an English word in the same space (using the CLIP text encoder) and retrieve the most similar audio utterances.
Consider two variants:
- randomly initialize the audio encoder weights
- initialize the audio encoder weights from the English audio encoder
The evaluation is done in terms of keyword spotting and we report average precision. We consider only the English transcriptions since they correspond tightly to Yorùbá translation.
The quantitative results are aggregated in the sheet 2022-02-09 · train-embedding-space · yoruba from here.
Scripts to run
- training:
CUDA_VISIBLE_DEVICES=1 python train_emb.py --config yoruba-abuja
CUDA_VISIBLE_DEVICES=2 python train_emb.py --config yoruba-abuja-from-en- predict and evaluate
python predict_emb.py --to-eval-keyword-spotting --config yoruba-abuja
python predict_emb.py --to-eval-keyword-spotting --config yoruba-abuja-from-en- display qualitative results
streamlit run show_keyword_spotting_yoruba.py -- --config yoruba-abuja
streamlit run show_keyword_spotting_yoruba.py -- --config yoruba-abuja-from-enTODO
- Run training on Kayode's annotated data
- Run training by initializing from the English model
- ? Try pretrained
wav2vec2vectors
1: Does this cover the case of "òkun", which means both "ocean" and "beach"?