# Mozilla Common Voice Corpus

Mozilla Common Voice (MCV) is a large collection of dataset for speech research. Each entry in the dataset consists of a unique MP3 and corresponding text file. Many of the 20,217 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help train the accuracy of speech recognition engines.

We will only make use of the German portion of the dataset, which is ~28GB.

## Download

First, we install some pre-requisite packages and download the data.

In [1]:
!sudo apt-get install -y wget sox libsox-fmt-mp3 parallel

Reading package lists... Done
Building dependency tree       
Reading state information... Done
parallel is already the newest version (20161222-1).
wget is already the newest version (1.19.4-1ubuntu2.2).
libsox-fmt-mp3 is already the newest version (14.4.2-3ubuntu0.18.04.1).
sox is already the newest version (14.4.2-3ubuntu0.18.04.1).
0 upgraded, 0 newly installed, 0 to remove and 187 not upgraded.


The dataset can be downloaded from https://commonvoice.mozilla.org/en/datasets using a web interface. Upon registration, you will receive a download URL, which can be used with `wget` as follows:

In [3]:
!mkdir -p ./data/raw/mcv
!wget DOWNLOAD_URL -O ./data/raw/mcv/de.tar.gz

--2022-05-02 21:09:36--  http://download_url/
Resolving download_url (download_url)... failed: Temporary failure in name resolution.
wget: unable to resolve host address ‘download_url’


## Preprocessing

Next, we standardize audio data and convert the raw format to NeMo manifest format.

**Audio data**: Audio data acquired from various sources are inherently heterogeneous (file format, sample rate, bit depth, number of audio channels...). Therefore, as a preprocessing step, we build a separate data ingestion pipeline for each source and convert these audio data to a common format with the following characteristics:
- Wav format
- Bit depth: 16 bits
- Sample rate of 16 Khz
- Single audio channel


In [1]:
import os
import sys

CUR_DIR = os.getcwd()
sys.path.insert(0, os.path.join(CUR_DIR, "data_ingestion"))

In [2]:
!mkdir -p data/processed/mcv

OUT_DIR = os.path.join(CUR_DIR, "data/processed/mcv")
DATA_ROOT = os.path.join(CUR_DIR, "data/raw/mcv")

!python3 ./data_ingestion/process_mcv.py --data_root=$DATA_ROOT --data_temp=/tmp --data_out=$OUT_DIR --manifest_dir=$OUT_DIR

INFO:root:Find existing folder /tmp/CV_unpacked
INFO:root:b'/tmp/CV_unpacked\n/tmp/CV_unpacked/cv-corpus-5.1-2020-06-22\n/tmp/CV_unpacked/cv-corpus-5.1-2020-06-22/de\n/tmp/CV_unpacked/cv-corpus-5.1-2020-06-22/de/test.tsv\n/tmp/CV_unpacked/cv-corpus-5.1-2020-06-22/de/other.tsv\n/tmp/CV_unpacked/cv-corpus-5.1-2020-06-22/de/reported.tsv\n/tmp/CV_unpacked/cv-corpus-5.1-2020-06-22/de/dev.tsv\n/tmp/CV_unpacked/cv-corpus-5.1-2020-06-22/de/clips\n/tmp/CV_unpacked/cv-corpus-5.1-2020-06-22/de/invalidated.tsv\n/tmp/CV_unpacked/cv-corpus-5.1-2020-06-22/de/validated.tsv\n/tmp/CV_unpacked/cv-corpus-5.1-2020-06-22/de/train.tsv\n'
INFO:root:Converting mp3 to wav using 20 workers for /tmp/CV_unpacked/cv-corpus-5.1-2020-06-22/de/test.tsv.
INFO:root:Reading metadata using 20 workers for /tmp/CV_unpacked/cv-corpus-5.1-2020-06-22/de/test.tsv
100%|███████████████████████████████████| 15340/15340 [00:01<00:00, 8035.96it/s]
INFO:root:Creating manifests...
100%|█████████████████████████████████| 15340/15340 [0