2022 02 09

Reading notes. A few notes on the two papers that use CLIP for audio training (see the previous entry):

The most significant difference to us is that the two papers work on sound datasets (that is, recordings of different events) and not recordings of utterances
Interesting choices in terms of the audio encoder:
- Wu et al. use ResNet18 on audio spectograms
- Zhao et al. use an image transformer from CLIP; they even initialize with the same weights as the pretrained image model showing much better performance than random initialization

Axes of exploration. Below I'm listing possible choices for the language documentation setting:

language: Yorùbá, English (mostly for developmental purposes; allows us to systematically evaluate the amount of training data)
audio encoder: MFCC + network, wav2vec + network (possibly shallower than the one used on top of MFCC)
supervisory signal: image, text (allows us to get a sense of an upper-bound performance)
type of supervision: categorical tags (as in the original work of Herman), embedding (as in the CLIP papers)
evaluation task: keyword detection, keyword localisation

In terms of evaluation:

if we are evaluating at utterance level, does it make sense to use the English annotation, as the ones in Yorùbá carry almost¹ the same semantics?

Exploring a different audio encoder. Check if using a pretrained ResNet (on images) helps improve the performance. For this I changed the input signal; replaced the MFCCs with spectrogram. Unfortunately, the dimensionality is larger and I was not able to train with a large batch size (I had to resort to 64 samples):

CUDA_VISIBLE_DEVICES=0 python train_emb.py --config resnet18-pretrained-batch-size-64
CUDA_VISIBLE_DEVICES=1 python train_emb.py --config resnet18-batch-size-64

For the quantitative evaluation see the sheet 2021-08-31 · train-embedding-space from here. These experiments were carried on the English version of the dataset.

Training the audio encoder on Yorùbá. Use the same setup as for English, but change the audio input to the Yorùbá utterances:

at training time, learn the audio encoder by aligning the audio embeddings to the CLIP image features;
at test time, embed an English word in the same space (using the CLIP text encoder) and retrieve the most similar audio utterances.

Consider two variants:

randomly initialize the audio encoder weights
initialize the audio encoder weights from the English audio encoder

The evaluation is done in terms of keyword spotting and we report average precision. We consider only the English transcriptions since they correspond tightly to Yorùbá translation.

The quantitative results are aggregated in the sheet 2022-02-09 · train-embedding-space · yoruba from here.

Scripts to run

training:

CUDA_VISIBLE_DEVICES=1 python train_emb.py --config yoruba-abuja
CUDA_VISIBLE_DEVICES=2 python train_emb.py --config yoruba-abuja-from-en

predict and evaluate

python predict_emb.py --to-eval-keyword-spotting --config yoruba-abuja
python predict_emb.py --to-eval-keyword-spotting --config yoruba-abuja-from-en

display qualitative results

streamlit run show_keyword_spotting_yoruba.py -- --config yoruba-abuja
streamlit run show_keyword_spotting_yoruba.py -- --config yoruba-abuja-from-en

TODO

Run training on Kayode's annotated data
Run training by initializing from the English model
? Try pretrained wav2vec2 vectors

1: Does this cover the case of "òkun", which means both "ocean" and "beach"?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2022 02 09

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally