# VoxPopuli Dataset

[VoxPopuli](https://aclanthology.org/2021.acl-long.80) is a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. VoxPopuli provides:

- 400K hours of unlabelled speech data for 23 languages
- 1.8K hours of transcribed speech data for 16 languages
- 17.3K hours of speech-to-speech interpretation data for 15x15 directions
- 29 hours of transcribed speech data of non-native English intended for research in ASR for accented speech (15 L2 accents)

The raw data is collected from 2009-2020 European Parliament event recordings. 

In this tutorial, we will only use the German portion of the dataset.

## Download

First, we install the necessary packages and download the dataset.

In [3]:
%%bash 
git clone https://github.com/facebookresearch/voxpopuli.git
cd voxpopuli
pip3 install -r requirements.txt

Collecting tqdm (from -r requirements.txt (line 1))
  Using cached https://files.pythonhosted.org/packages/8a/c4/d15f1e627fff25443ded77ea70a7b5532d6371498f9285d44d62587e209c/tqdm-4.64.0-py2.py3-none-any.whl
Collecting torchaudio (from -r requirements.txt (line 2))
  Using cached https://files.pythonhosted.org/packages/a9/4b/f5c4127441dae6fe75f2da89eb203f05c68c30e10ef24a9639d899fbdf66/torchaudio-0.10.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting num2words (from -r requirements.txt (line 3))
  Using cached https://files.pythonhosted.org/packages/eb/a2/ea800689730732e27711c41beed4b2a129b34974435bdc450377ec407738/num2words-0.5.10-py3-none-any.whl
Collecting edlib (from -r requirements.txt (line 4))
  Using cached https://files.pythonhosted.org/packages/08/a3/37558fb19e54e40e360c8f3e255007d19f262c45298bf3148b92faf3311c/edlib-1.3.9-cp36-cp36m-manylinux1_x86_64.whl
Collecting editdistance (from -r requirements.txt (line 5))
  Using cached https://files.pythonhosted.org/packages/a0/af/8844ecb8

fatal: destination path 'voxpopuli' already exists and is not an empty directory.


Next, we prepare a folder to store the raw data.

In [None]:
!mkdir -p ./data/raw/voxpopuli
!python3 -m voxpopuli.download_audios --root ./data/raw/voxpopuli --subset asr


In [None]:
!python3 -m voxpopuli.get_asr_data --root ./data/raw/voxpopuli --lang de

## Preprocessing

Next, we standardize audio data and convert the raw format to NeMo manifest format.

**Audio data**: Audio data acquired from various sources are inherently heterogeneous (file format, sample rate, bit depth, number of audio channels...). Therefore, as a preprocessing step, we build a separate data ingestion pipeline for each source and convert these audio data to a common format with the following characteristics:
- Wav format
- Bit depth: 16 bits
- Sample rate of 16 Khz
- Single audio channel

In [1]:
!mkdir -p ./data/processed/voxpopuli
!python3 ./data_ingestion/datasets/voxpopuli/process_voxpopuli.py --data_root=./data/raw/voxpopuli/transcribed_data --out_dir=./data/processed/voxpopuli

100%|██████████████████████████████████| 108473/108473 [10:31<00:00, 171.78it/s]
100%|██████████████████████████████████████| 1968/1968 [00:11<00:00, 178.26it/s]
100%|██████████████████████████████████████| 2109/2109 [00:12<00:00, 175.65it/s]
