# Mozilla Common Voice (MCV) Dataset

Mozilla Common Voice (MCV) is a large collection of dataset for speech research. Each entry in the dataset consists of a unique MP3 and corresponding text file. Many of the 20,217 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help train the accuracy of speech recognition engines.

We will only make use of the German portion of the dataset, which is ~28 GB.

## Download

First, we install the prerequisite packages and download the data.

In [None]:
!apt-get update && apt-get install -y wget sox libsox-fmt-mp3 parallel

The dataset can be downloaded [here](https://commonvoice.mozilla.org/en/datasets) using a web interface. Upon registration, you will receive a download URL, which can be used with `wget` as follows:

In [None]:
!mkdir -p ./data/raw/mcv
!wget <DOWNLOAD_URL> -O ./data/raw/mcv/de.tar.gz

## Preprocessing

Next, we standardize audio data and convert the raw format to NeMo manifest format.

**Audio data**: Audio data acquired from various sources are inherently heterogeneous (file format, sample rate, bit depth, number of audio channels, and so on). Therefore, as a preprocessing step, we build a separate data ingestion pipeline for each source and convert the audio data to a common format with the following characteristics:
- Wav format
- Bit depth: 16 bits
- Sample rate of 16 Khz
- Single audio channel


In [None]:
import os
import sys

CUR_DIR = os.getcwd()
sys.path.insert(0, os.path.join(CUR_DIR, "data_ingestion"))

Notes: 
- You will have to pass the correct arg `--version="cv-corpus-xxx"` to process_mcv.py depending on the version of your downloaded corpus.

The default value is `cv-corpus-5.1-2020-06-22` which refers to the 2020 version of the dataset.

- The .tsv file containing metadata of MCV dataset might contain either `accents` or `accent` as the column head, hence you might need to update this pre-processing script to look for "accents" instead of "accent", depending on the particular version.

In [None]:
!mkdir -p data/processed/mcv

OUT_DIR = os.path.join(CUR_DIR, "data/processed/mcv")
DATA_ROOT = os.path.join(CUR_DIR, "data/raw/mcv")

!python3 ./data_ingestion/process_mcv.py --data_root=$DATA_ROOT --data_temp=/tmp --data_out=$OUT_DIR --manifest_dir=$OUT_DIR --save_meta true

In [None]:
# Optionally: to remove the raw dataset to preserve disk space, uncomment the bash command bellow. 

#! rm -rf data/processed/mcv 