# VoxPopuli Dataset

[VoxPopuli](https://aclanthology.org/2021.acl-long.80) is a large-scale multilingual speech corpus for representation learning, semi-supervised learning, and interpretation. VoxPopuli provides:

- 400K hours of unlabeled speech data for 23 languages
- 1.8K hours of transcribed speech data for 16 languages
- 17.3K hours of speech-to-speech interpretation data for 15x15 directions
- 29 hours of transcribed speech data of non-native English intended for research in ASR for accented speech (15 L2 accents)

The raw data is collected from 2009-2020 European Parliament event recordings. 

In this tutorial, we will only use the German portion of the dataset.

## Download

First, we install the necessary packages and download the dataset.

In [None]:
%%bash 
git clone https://github.com/facebookresearch/voxpopuli.git
cd voxpopuli
pip3 install -r requirements.txt

Next, we prepare a folder to store the raw data.

In [None]:
!mkdir -p ./data/raw/voxpopuli
!cd voxpopuli && python3 -m voxpopuli.download_audios --root ../data/raw/voxpopuli --subset asr


In [None]:
!cd voxpopuli && python3 -m voxpopuli.get_asr_data --root ../data/raw/voxpopuli --lang de

## Preprocessing

Next, we standardize the audio data and convert the raw format to a NeMo manifest format.

**Audio data**: Audio data acquired from various sources are inherently heterogeneous (file format, sample rate, bit depth, number of audio channels, and so on). Therefore, as a preprocessing step, we build a separate data ingestion pipeline for each source and convert the audio data to a common format with the following characteristics:
- Wav format
- Bit depth: 16 bits
- Sample rate of 16 Khz
- Single audio channel

In [None]:
!mkdir -p ./data/processed/voxpopuli
!python3 ./data_ingestion/process_voxpopuli.py --data_root=./data/raw/voxpopuli/transcribed_data --out_dir=./data/processed/voxpopuli

In [None]:
# Optionally: to remove the raw dataset to preserve disk space, uncomment the bash command bellow. 

#! rm -rf ./data/processed/voxpopuli