Skip to content

2022 02 23

Dan Oneata edited this page Feb 23, 2022 · 2 revisions

Use two variants of Wav2Vec2 feature extractors (base-960h and large-xlsr-53) instead of the MFCC features. Note that the Wav2Vec2 features are 512-dimensional (as opposed to the 39-D MFCC features), but about twice as shorter in temporal domain. The rest of the architecture remained unchanged.

Train an audio model on top of the Wav2Vec2 features, but for the English data, so that I can latter transfer to the Yorùbá language.


Scripts to run:

# training
python train_emb.py --config yoruba-abuja-wav2vec2-base-960h
python train_emb.py --config yoruba-abuja-wav2vec2-large-xlsr-53
# evaluation
python predict_emb.py --to-eval-keyword-spotting --config yoruba-abuja-wav2vec2-base-960h 
python predict_emb.py --to-eval-keyword-spotting --config yoruba-abuja-wav2vec2-large-xlsr-53

As before, the quantitative results are available in the Google sheet.

# training on the English variant of the Flickr8K dataset
CUDA_VISIBLE_DEVICES=4 python train_emb.py --config english-wav2vec2-base-960h
CUDA_VISIBLE_DEVICES=5 python train_emb.py --config english-wav2vec2-large-xlsr-53
# evaluation
python predict_emb.py --to-eval-keyword-spotting --config english-wav2vec2-base-960h
python predict_emb.py --to-eval-keyword-spotting --config english-wav2vec2-large-xlsr-53

Train on Yorùbá, but start from the English audio models:

# training
CUDA_VISIBLE_DEVICES=2 python train_emb.py --config yoruba-abuja-wav2vec2-base-960h-from-en
CUDA_VISIBLE_DEVICES=2 python train_emb.py --config yoruba-abuja-wav2vec2-large-xlsr-53-from-en
# evaluation
CUDA_VISIBLE_DEVICES=2 python predict_emb.py --to-eval-keyword-spotting --config yoruba-abuja-wav2vec2-base-960h-from-en
CUDA_VISIBLE_DEVICES=2 python predict_emb.py --to-eval-keyword-spotting --config yoruba-abuja-wav2vec2-large-xlsr-53-from-en

Clone this wiki locally