# Formatting Datasets

In this notebook, we build the same datasets as used in the original LEAF book.  
Although the original paper said "Default train/test splits are always adopted", there were many datasets that do not have a default split.  
Therefore, in this notebook, we set our own split for datasets that do not have a default split.

***

## Common functions (run the following block before the other blocks)

In [None]:
from pathlib import Path

import librosa
import numpy as np
import pandas as pd
import scipy.io.wavfile
import shutil
import soundfile


def copy_with_format_wave(path_from, path_to, target_rate=16000):
    """Copy audio data after formatting it (16 kHz, PCM_16, mono)."""
    # Use PCM_16 for memory efficiency to unify bit depth and save data volume.
    info = soundfile.info(path_from)
    if info.samplerate == target_rate and info.subtype == 'PCM_16' and info.channels == 1:
        # Just copy
        shutil.copyfile(path_from, path_to)
    else:
        # Downsampe to 16 kHz and convert to mono (the range becomes [-1.0, 1.0], as a side effect).
        wave, sr = librosa.load(path_from, sr=target_rate, mono=True)
        # change the range to [-32768, 32767] (PCM_16)
        wave = np.clip(wave * 32768, -32768, 32767).astype(np.int16)
        scipy.io.wavfile.write(path_to, target_rate, wave)

***

## TUT Urban 2018

### Dataset information

Ref: T. Heittola, A. Mesaros, and T. Virtanen, “TUT Urban Acoustic Scenes 2018, Development dataset.” Apr. 2018, doi: 10.5281/zenodo.1228142.

Dataset URL: https://zenodo.org/record/1228142#.YJZMsWamO3I

* Acoustic scenes
* 10 classes
* 8,640 samples
  * train: 6,122 samples
  * eval: 2,518 samples

### Split

We use the same split as the original.

### Procedure

1. Download all data from https://zenodo.org/record/1228142#.YJZMsWamO3I
2. Compose the directory tree as follows:

```
/foo/bar/TUT_urban_2018/
  ├ audio/____.wav
  ├ evaluation_setup/____.txt
  ├ LICENSE
  ├ meta.csv
  ├ README.html
  └ README.md
```

3. Format the dataset (48 kHz, 24 bit, 2 ch -> 16 kHz, 16 bit, 1 ch)

```
../datasets/TUT_urban_2018/
  ├ eval/____.wav
  ├ eval.csv
  ├ train/____.wav
  └ train.csv
```

In [None]:
# from sklearn.preprocessing import LabelEncoder

# ################################################################################
# ### Configuration
# ################################################################################
# # ROOT_DIR = Path('/foo/var/TUT_urban_2018')
# DIR_FROM = Path('/Users/sky/Documents/__earth/database/TUT_urban_2018')
# DIR_TO = Path('../datasets/TUT_urban_2018')


# ################################################################################
# ### Processing
# ################################################################################
# # Make directories.
# (DIR_TO/'train').mkdir(parents=True, exist_ok=True)
# (DIR_TO/'eval').mkdir(parents=True, exist_ok=True)

# # Read meta data (paths and labels)
# train_meta = pd.read_csv(DIR_FROM/'evaluation_setup/fold1_train.txt', delimiter='\t', header=None)
# eval_meta = pd.read_csv(DIR_FROM/'evaluation_setup/fold1_evaluate.txt', delimiter='\t', header=None)
# le = LabelEncoder().fit(train_meta['label'].values)

# for phase, meta in (('train', train_meta), ('eval', eval_meta)):
#     # Format meta data.
#     meta_2 = meta.copy()
#     meta_2.iloc[:, 0] = meta.iloc[:, 0].str.split('/').map(lambda x: x[1])  # 'audio/xxxx.wav' -> 'xxxx.wav'
#     meta_2.columns = ['audio_filename', 'label']  # Header
#     meta_2['label_id'] = le.transform(meta_2['label'].values)
#     meta_2.to_csv(DIR_TO/f'{phase}.csv', index=False)
#     # Format audio data.
#     for i, audio_path in enumerate(meta_2.audio_filename):
#         print(f'\r{phase:5} {i:08d} {audio_path}', end='')
#         copy_with_format_wave(DIR_FROM/'audio'/audio_path, DIR_TO/phase/audio_path)

***

## DCASE 2018 Task 3 Bird audio detection

### Dataset information

Paper: D. Stowell, M. D. Wood, H. Pamuła, Y. Stylianou, and H. Glotin, “Automatic acoustic detection of birds through deep learning: The first Bird Audio Detection challenge,” Methods Ecol. Evol., vol. 10, no. 3, pp. 368–380, Nov. 2018.

Datset URL: http://dcase.community/challenge2018/task-bird-audio-detection

* Bird audio detection
* 2 classes
* 48,310 samples
  * train: 35,690 samples
    * freefield1010: 7,690 samples
    * warblrb10k: 8,000 samples
    * BirdVox-DCASE-20k: 20,000 samples
  * eval: 12,620 samples (labels not disclosed)
    * warblrb10k: 2,000 samples
    * Chernobyl: 6,620 samples
    * PolandNFC: 4,000 samples

### Split

DCASE 2018 was a competition and the labels of the evaluation sets are not published.
On the other hand, as for warblrb10k, it is used both in training and in evaluation.
Therefore, we use warblrb10k as the evaluation set.
In summary, the split is as follows:

* train: 27,690 samples
  * freefield1010: 7,690 samples
  * BirdVox-DCASE-20k: 20,000 samples
* eval: 8,000 samples
  * warblrb10k: 8,000 samples

### Procedure

1. Download all data from http://dcase.community/challenge2018/task-bird-audio-detection
2. Compose the directory tree as follows:

```
/foo/bar/DCASE2018_task3_bird_audio/
  └ train/
    ├ BirdVox-DCASE-20k/____.wav
    ├ ff1010bird/____.wav
    ├ warblrb10k/____.wav
    ├ BirdVoxDCASE20k_csvpublic.csv
    ├ ff1010bird_metadata_2018.csv
    └ warblrb10k_public_metadata_2018.csv
```

3. Format the dataset (44.1 kHz, 16 bit, 1 ch -> 16 kHz, 16 bit, 1 ch)

```
../datasets/DCASE2018_task3_bird_audio/
  ├ eval/____.wav
  ├ eval.csv
  ├ train/____.wav
  └ train.csv
```

In [None]:
# ################################################################################
# ### Configuration
# ################################################################################
# # ROOT_DIR = Path('/foo/var/DCASE2018_task3_bird_audio/train')
# DIR_FROM = Path('/Users/sky/Documents/__earth/database/DCASE2018_task3_bird_audio/train')
# DIR_TO = Path('../datasets/DCASE2018_task3_bird_audio')


# ################################################################################
# ### Processing
# ################################################################################
# # Make directories.
# (DIR_TO/'train').mkdir(parents=True, exist_ok=True)
# (DIR_TO/'eval').mkdir(parents=True, exist_ok=True)

# # Read meta data (paths and labels)
# train_meta = pd.concat([
#     pd.read_csv(DIR_FROM/'ff1010bird_metadata_2018.csv'),
#     pd.read_csv(DIR_FROM/'BirdVoxDCASE20k_csvpublic.csv')
# ], axis=0)
# eval_meta = pd.read_csv(DIR_FROM/'warblrb10k_public_metadata_2018.csv')

# for phase, meta in (('train', train_meta), ('eval', eval_meta)):
#     # Format meta data.
#     meta_2 = meta.copy()
#     meta_2['audio_filename'] = meta['datasetid'].str.cat(meta['itemid'].astype(str), sep='__') + '.wav'
#     meta_2['label'] = meta['hasbird']
#     meta_2['label_id'] = meta['hasbird']
#     meta_2 = meta_2[['audio_filename', 'label', 'label_id']]
#     meta_2.to_csv(DIR_TO/f'{phase}.csv', index=False)
#     # Format audio data.
#     for i, audio_path in enumerate(meta_2.audio_filename):
#         print(f'\r{phase:5} {i:08d} {audio_path}', end='')
#         dir_, file_ = audio_path.split('__')
#         copy_with_format_wave(DIR_FROM/dir_/file_, DIR_TO/phase/audio_path)

***

## CREMA-D

### Dataset information

Paper: H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset,” IEEE Trans Affect Comput, vol. 5, no. 4, pp. 377–390, Oct. 2014.

Datset URL: https://github.com/CheyneyComputerScience/CREMA-D

* Emotion recognition
* 6 classes
* 7,442 samples (no split information)

### Split

The original CREMA-D is not set to split. On the other hand, the package `tensorflow_datasets` provided by TensorFlow sets up CREMA-D with split ("train", "validation", "test").
If you run the following code in Google colaboratory, you can get speaker information for each partition (it can also be run locally).

```python
!pip install -q tensorflow-datasets tensorflow pydub
```

```python
import tensorflow as tf
import tensorflow_datasets as tfds
from google.colab import files

for phase in ('train', 'validation', 'test'):
    ds = tfds.load('crema_d', split=phase)
    speakers = sorted(set(str(int(ex['speaker_id'].numpy())) for ex in ds.take(-1)))
    with open(f'{phase}_speakers.txt', 'w') as f:
        f.write('\n'.join(speakers))
    files.download(f'{phase}_speakers.txt')
```

This time, we set up the split based on the "train_speakers.csv", "validation_speakers.csv", and "test_ speakers.csv" generated by the above code.
Among these, we assign "train_speakers.csv" for training, and "validation_speakers.csv" and "test_speakers.csv" for evaluation. 
In summary, the split is as follows:

* train: 5,146 samples (63 speakers)
* eval: 2,296 samples
  * validation: 738 samples (9 speakers)
  * test: 1,558 samples (19 speakers)

### Procedure

1. Download all data except for binary data from https://github.com/CheyneyComputerScience/CREMA-D
2. Download the binary data using git-lfs.
3. Get split information ("train_speakers.csv", "validation_speakers.csv", and "test_ speakers.csv") by using tensorflow_datasets (see section Split).
4. Compose the directory tree as follows:

```
/foo/bar/CREMA-D/
  ├ train_speakers.csv       <- generated by using tensorflow_datasets (see section Split)
  ├ validation_speakers.csv  <- generated by using tensorflow_datasets (see section Split)
  ├ test_speakers.csv        <- generated by using tensorflow_datasets (see section Split)
  ├ AudioMP3/____.mp3
  ├ AudioWAV/____.wav
  ├ docs/
  ├ finishedEmoResponses.csv
  ├ finishedResponses.csv
  ├ finishedResponsesWithRepeatWithPractice.csv
  ├ LICENSE.txt
  ├ processedResults/
  ├ processFinishedResponses.R
  ├ README.md
  ├ readTabulatedVotes.R
  ├ SentenceFilenames.csv
  ├ summarizeVotes.r
  ├ tabulateVotesV2.r
  ├ VideoDemographics.csv
  └ VideoFlash/____.flv
```

5. Format the dataset (16 kHz, 16 bit, 1 ch -> 16 kHz, 16 bit, 1 ch; no change)

```
../datasets/CREMA-D/
  ├ eval/____.wav
  ├ eval.csv
  ├ train/____.wav
  └ train.csv
```

In [None]:
# ################################################################################
# ### Configuration
# ################################################################################
# # ROOT_DIR = Path('/foo/var/VoxCeleb1')
# DIR_FROM = Path('/Users/sky/Documents/__earth/database/CREMA-D/')
# DIR_TO = Path('../datasets/CREMA-D')


# ################################################################################
# ### Processing
# ################################################################################
# # Make directories.
# (DIR_TO/'train').mkdir(parents=True, exist_ok=True)
# (DIR_TO/'eval').mkdir(parents=True, exist_ok=True)

# # Format meta data (paths and labels)
# with open(DIR_FROM/'train_speakers.txt', 'r') as f:
#     train_speakers = f.read().split('\n')
# with open(DIR_FROM/'validation_speakers.txt', 'r') as f:
#     eval_speakers = f.read().split('\n')
# with open(DIR_FROM/'test_speakers.txt', 'r') as f:
#     eval_speakers += f.read().split('\n')
# train_meta = pd.DataFrame([
#     {
#         'orig_path': path,
#         'audio_filename': path.name,
#         'label': path.name.split('_')[2]
#     }
#     for train_speaker in train_speakers
#     for path in (DIR_FROM/'AudioWAV').rglob(f'{train_speaker}*.wav')
# ])
# eval_meta = pd.DataFrame([
#     {
#         'orig_path': path,
#         'audio_filename': path.name,
#         'label': path.name.split('_')[2]
#     }
#     for eval_speaker in eval_speakers
#     for path in (DIR_FROM/'AudioWAV').rglob(f'{eval_speaker}*.wav')
# ])
# le = LabelEncoder().fit(train_meta['label'].values)
# train_meta['label_id'] = le.transform(train_meta['label'].values)
# eval_meta['label_id'] = le.transform(eval_meta['label'].values)

# for phase, meta in (('train', train_meta), ('eval', eval_meta)):
#     # Format meta data.
#     meta_2 = meta
#     meta_2.drop('orig_path', axis=1).to_csv(DIR_TO/f'{phase}.csv', index=False)
#     # Format audio data.
#     for i, (orig_path, audio_path) in enumerate(zip(meta_2.orig_path, meta_2.audio_filename)):
#         print(f'\r{phase:5} {i:08d} {audio_path}', end='')
#         copy_with_format_wave(orig_path, DIR_TO/phase/audio_path)

***

## VoxCeleb1

### Dataset information

A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.

Datset URL: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html

* Speaker ID
* 153,516 samples (1,251 speakers)
  * train: 148,642 samples (1,211 speakers)
  * test: 4,874 samples (40 speakers)

### Split

Use both the original train and test set.
For each speaker, split the dataset so that 90% is for training and 10% is for evaluation.
To be more precise, at least for each speaker, we tried to have at least 10% of the eval set, so that overall the percentage of evaluation set is more than 10%.
In order to increase the difficulty of the task, the YouTube video IDs of the train and eval sets are not duplicated.
As a result of executing the code block below, the dataset was divided into the following number of samples:

* train: 128,086 samples (1,251 speakers; 83.4%)
* eval: 25,430 samples (1,251 speakers; 16.6%)

### Procedure

1. Download all data from https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html (form application required)
2. Compose the directory tree as follows:

```
foo/bar/VoxCeleb1/
  ├ train/
  │   ├ txt/id____/____/____.txt
  │   └ wav/id____/____/____.wav
  ├ test/
  │   ├ txt/id____/____/____.txt
  │   └ wav/id____/____/____.wav
  └ vox1_meta.csv
```

3. Format the dataset (16 kHz, 16 bit, 1 ch -> 16 kHz, 16 bit, 1 ch; no change)

```
../datasets/VoxCeleb1/
  ├ eval/____.wav
  ├ eval.csv
  ├ train/____.wav
  └ train.csv
```

In [None]:
# from sklearn.preprocessing import LabelEncoder

# ################################################################################
# ### Configuration
# ################################################################################
# # ROOT_DIR = Path('/foo/var/VoxCeleb1')
# DIR_FROM = Path('/Volumes/ARCHIVE_SSD/Database/_zip/VoxCeleb1')
# DIR_TO = Path('../datasets/VoxCeleb1')


# ################################################################################
# ### Preprocessing
# ################################################################################

# # Format meta data (audio_filename=speakerID__videoID__{number}.wav, label=speakerID)
# master_df = pd.DataFrame([
#     {
#         'orig_path': path,
#         'audio_filename': '__'.join(path.parts[-3:]),
#         'video_id': path.parts[-2],
#         'label': path.parts[-3]
#     }
#     for path in DIR_FROM.rglob('*.wav')
#     if not path.name.startswith('.')
# ])
# master_df['label_id'] = LabelEncoder().fit_transform(master_df['label'].values)
# train_df = []
# eval_df = []


# # Process for each label
# rng = np.random.RandomState(0)  # Fix random seed.
# for label, label_df in master_df.groupby('label'):
#     # Split videoIDs so that they do not overlap.
#     # The videoIDs are randomized.
#     # Train=90%, Eval=10%
#     video_ids = list(label_df['video_id'].unique())
#     rng.shuffle(video_ids)
#     video_ids_count = label_df.groupby('video_id').count().loc[video_ids]
#     video_ids_cumsum_count = video_ids_count.cumsum()
#     video_ids_relative_cumsum_count = video_ids_cumsum_count / len(label_df)
#     train_video_ids \
#         = video_ids_relative_cumsum_count.loc[video_ids_relative_cumsum_count.iloc[:, 0] < 0.9].index
#     train_video_ids = sorted(set(train_video_ids))
#     eval_video_ids = sorted(set(video_ids).difference(train_video_ids))
#     train_df.append(label_df.query('video_id in @train_video_ids'))
#     eval_df.append(label_df.query('video_id in @eval_video_ids'))
# #     print(f'{label} train={len(train_df[-1]):4} files, eval={len(eval_df[-1]):4} files')
# train_df = pd.concat(train_df, axis=0)
# eval_df = pd.concat(eval_df, axis=0)


# ################################################################################
# ### Processing
# ################################################################################
# # Make directories.
# (DIR_TO/'train').mkdir(parents=True, exist_ok=True)
# (DIR_TO/'eval').mkdir(parents=True, exist_ok=True)

# for phase, df in (('train', train_df), ('eval', eval_df)):
#     df.drop('orig_path', axis=1).to_csv(DIR_TO/f'{phase}.csv', index=False)
#     # Format audio data.
#     for i, (orig_path, audio_filename) in enumerate(zip(df.orig_path, df.audio_filename)):
#         print(f'\r{phase:5} {i:08d} {audio_filename}', end='')
#         copy_with_format_wave(orig_path, DIR_TO/phase/audio_filename)

***

## NSynth

### Dataset information

Paper: J. Engel et al., “Neural audio synthesis of musical notes with wavenet autoencoders,” in International Conference on Machine Learning, 2017, pp. 1068–1077.

Datset URL: https://magenta.tensorflow.org/datasets/nsynth

* Speaker ID
* 305,979 samples (11 instruments, 112 pitchs)
  * train: 289,205 samples (11 instruments, 112 pitchs)
  * valid: 12,678 samples (10 instruments, 112 pitchs)
  * test: 4,096 samples (10 instruments, 106 pitchs)

### Split

The original training set is used for training, and the original validation and test set are used for evaluation.
In summary, the split is as follows:

* train: 289,205 samples (11 instruments, 112 pitchs)
* eval: 16,774 samples (10 instruments, 112 pitchs)
  * valid: 12,678 samples (10 instruments, 112 pitchs)
  * test: 4,096 samples (10 instruments, 106 pitchs)

### Procedure

1. Download all data from https://magenta.tensorflow.org/datasets/nsynth
2. Compose the directory tree as follows:

```
/foo/bar/NSynth/
  ├ nsynth-train/
  │   ├ audio/____.wav
  │   └ examples.json
  ├ nsynth-valid/
  │   ├ audio/____.wav
  │   └ examples.json
  └ nsynth-test/
      ├ audio/____.wav
      └ examples.json
```

3. Format the dataset (16 kHz, 16 bit, 1 ch -> 16 kHz, 16 bit, 1 ch; no change)

```
../datasets/NSynth/
  ├ eval/____.wav
  ├ eval.csv
  ├ train/____.wav
  └ train.csv
```

In [None]:
# from sklearn.preprocessing import LabelEncoder

# ################################################################################
# ### Configuration
# ################################################################################
# # ROOT_DIR = Path('/foo/var/NSynth')
# DIR_FROM = Path('/Users/sky/Documents/__earth/database/NSynth')
# DIR_TO = Path('../datasets/NSynth')


# ################################################################################
# ### Processing
# ################################################################################
# # Make directories.
# (DIR_TO/'train').mkdir(parents=True, exist_ok=True)
# (DIR_TO/'eval').mkdir(parents=True, exist_ok=True)

# # Format meta data
# train_meta = pd.read_json(DIR_FROM/'nsynth-train/examples.json').T
# le_pitch = LabelEncoder().fit(train_meta['pitch'].values)
# train_meta['audio_filename'] = train_meta['note_str'] + '.wav'
# train_meta['orig_path'] = 'nsynth-train/audio/' + train_meta['audio_filename']
# train_meta['pitch_id'] = le_pitch.transform(train_meta['pitch'].values)
# train_meta = train_meta[['orig_path', 'audio_filename', 'instrument_family_str', 'instrument_family', 'pitch', 'pitch_id']]
# train_meta.columns = ['orig_path', 'audio_filename', 'inst', 'inst_id', 'pitch', 'pitch_id']

# valid_meta = pd.read_json(DIR_FROM/'nsynth-valid/examples.json').T
# valid_meta['audio_filename'] = valid_meta['note_str'] + '.wav'
# valid_meta['orig_path'] = 'nsynth-valid/audio/' + valid_meta['audio_filename']
# valid_meta['pitch_id'] = le_pitch.transform(valid_meta['pitch'].values)
# valid_meta = valid_meta[['orig_path', 'audio_filename', 'instrument_family_str', 'instrument_family', 'pitch', 'pitch_id']]
# valid_meta.columns = ['orig_path', 'audio_filename', 'inst', 'inst_id', 'pitch', 'pitch_id']

# test_meta = pd.read_json(DIR_FROM/'nsynth-test/examples.json').T
# test_meta['audio_filename'] = test_meta['note_str'] + '.wav'
# test_meta['orig_path'] = 'nsynth-test/audio/' + test_meta['audio_filename']
# test_meta['pitch_id'] = le_pitch.transform(test_meta['pitch'].values)
# test_meta = test_meta[['orig_path', 'audio_filename', 'instrument_family_str', 'instrument_family', 'pitch', 'pitch_id']]
# test_meta.columns = ['orig_path', 'audio_filename', 'inst', 'inst_id', 'pitch', 'pitch_id']

# eval_meta = pd.concat([valid_meta, test_meta], axis=0)

# for phase, meta in (('train', train_meta), ('eval', eval_meta)):
#     meta_2 = meta
#     meta_2.drop('orig_path', axis=1).to_csv(DIR_TO/f'{phase}.csv', index=False)
#     # Format audio data.
#     for i, (orig_path, audio_path) in enumerate(zip(meta_2.orig_path, meta_2.audio_filename)):
#         print(f'\r{phase:5} {i:08d} {audio_path}', end='')
#         copy_with_format_wave(DIR_FROM/orig_path, DIR_TO/phase/audio_path)

***

## SpeechCommands

### Dataset information

Paper: P. Warden, “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition,” arXiv [cs.CL], Apr. 09, 2018.

Datset URL

| Version | URL |
| :- | :- |
| v0.0.1      | http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz |
| v0.0.1_test | http://download.tensorflow.org/data/speech_commands_test_set_v0.01.tar.gz |
| v0.0.2      | http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz |
| v0.0.2_test | http://download.tensorflow.org/data/speech_commands_test_set_v0.02.tar.gz |

* Speech commands
* 35 words
* Multiple subsets
  * v0.0.1
  * v0.0.1_test
  * v0.0.2: 105,829 samples (35 words) + 6samples (6 noise types)
    * train: 84,843 samples (35 words)
    * validation: 9,981 samples (35 words)
    * testing: 11,005 samples (35 words)
    * noise: 6 samples (6 noise types)
  * v0.0.2_test

### Split

The original training set is used for training, and the original validation and test set are used for evaluation.
In summary, the split is as follows:

* train: 84,843 samples (35 words)
* eval: 20,986 samples (35 words)
  * validation: 9,981 samples (35 words)
  * testing: 11,005 samples (35 words)

### Procedure

1. Download all data from http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz
2. Compose the directory tree as follows:

```
/foo/bar/speech_commands_v0.02/
  ├ _background_noise_/____.wav
  ├ backward/____.wav
  ├ bed/____.wav
  ├ bird/____.wav
  ├ cat/____.wav
  ├ dog/____.wav
  ├ down/____.wav
  ├ eight/____.wav
  ├ five/____.wav
  ├ follow/____.wav
  ├ forward/____.wav
  ├ four/____.wav
  ├ go/____.wav
  ├ happy/____.wav
  ├ house/____.wav
  ├ learn/____.wav
  ├ left/____.wav
  ├ marvin/____.wav
  ├ nine/____.wav
  ├ no/____.wav
  ├ off/____.wav
  ├ on/____.wav
  ├ one/____.wav
  ├ right/____.wav
  ├ seven/____.wav
  ├ sheila/____.wav
  ├ six/____.wav
  ├ stop/____.wav
  ├ three/____.wav
  ├ tree/____.wav
  ├ two/____.wav
  ├ up/____.wav
  ├ visual/____.wav
  ├ wow/____.wav
  ├ yes/____.wav
  ├ zero/____.wav
  ├ LICENSE
  ├ README.md
  ├ testing_list.txt
  └ validation_list.txt
```

3. Format the dataset (16 kHz, 16 bit, 1 ch -> 16 kHz, 16 bit, 1 ch; no change)

```
../datasets/speech_commands_v0.02/
  ├ noise/____.wav
  ├ eval/____.wav
  ├ eval.csv
  ├ train/____.wav
  └ train.csv
```

In [None]:
# from sklearn.preprocessing import LabelEncoder

# ################################################################################
# ### Configuration
# ################################################################################
# # ROOT_DIR = Path('/foo/var/speech_commands_v0.02')
# DIR_FROM = Path('/Users/sky/Documents/__earth/database/speech_commands_v0.02')
# DIR_TO = Path('../datasets/speech_commands_v0.02')


# ################################################################################
# ### Processing
# ################################################################################
# # Make directories.
# (DIR_TO/'train').mkdir(parents=True, exist_ok=True)
# (DIR_TO/'eval').mkdir(parents=True, exist_ok=True)
# (DIR_TO/'noise').mkdir(parents=True, exist_ok=True)

# # Format meta data
# word_paths = []
# noise_paths = []
# for p in DIR_FROM.rglob('*.wav'):
#     if p.parts[-2] == '_background_noise_':
#         noise_paths.append(p)
#     else:
#         word_paths.append(p)
# with open(DIR_FROM/'validation_list.txt', 'r') as f:
#     valid_paths = [DIR_FROM/p for p in f.read().split('\n') if p != '']
# with open(DIR_FROM/'testing_list.txt', 'r') as f:
#     test_paths = [DIR_FROM/p for p in f.read().split('\n') if p != '']

# eval_paths = sorted(valid_paths + test_paths)
# train_paths = sorted(set(word_paths).difference(set(eval_paths)))

# train_meta = pd.DataFrame({'orig_path': train_paths})
# train_meta['audio_filename'] = train_meta['orig_path'].apply(lambda x: '__'.join(x.parts[-2:]))
# train_meta['label'] = train_meta['orig_path'].apply(lambda x: x.parts[-2])
# le = LabelEncoder().fit(train_meta['label'].values)
# train_meta['label_id'] = le.transform(train_meta['label'].values)

# eval_meta = pd.DataFrame({'orig_path': eval_paths})
# eval_meta['audio_filename'] = eval_meta['orig_path'].apply(lambda x: '__'.join(x.parts[-2:]))
# eval_meta['label'] = eval_meta['orig_path'].apply(lambda x: x.parts[-2])
# eval_meta['label_id'] = le.transform(eval_meta['label'].values)

# for phase, meta in (('train', train_meta), ('eval', eval_meta)):
#     meta_2 = meta
#     meta_2.drop('orig_path', axis=1).to_csv(DIR_TO/f'{phase}.csv', index=False)
#     # Format audio data.
#     for i, (orig_path, audio_path) in enumerate(zip(meta_2.orig_path, meta_2.audio_filename)):
#         print(f'\r{phase:5} {i:08d} {audio_path}', end='')
#         copy_with_format_wave(orig_path, DIR_TO/phase/audio_path)

# for noise_path in noise_paths:
#     copy_with_format_wave(noise_path, DIR_TO/'noise'/noise_path.name)

***

## Voxforge

### Dataset information

Paper (using Voxforge): S. Revay and M. Teschke, “Multiclass Language Identification using Deep Learning on Spectral Images of Audio Signals,” arXiv [cs.SD], May 10, 2019.

Datset URL: www.voxforge.org/

* Language ID
* 6 languages
* No split information
* Living dataset (currently update)

### Split

Voxforge is a living database.
Here, the dataset is constructed using a fixed list, created by Tensorflow.
This list can be downloaded from https://storage.googleapis.com/tfds-data/downloads/voxforge/voxforge_urls.txt.
The following command will allow you to download the compressed files in the list.
If you are using MacOS and can use Homebrew, you can use the command `wget` after excuting `brew install wget`.

```sh
wget -P your_target_directory -i /foo/bar/voxforge_urls.txt -x
```

**Note: Voxforge is a living database. Even if a fixed list created by Tensorflow is used, the audio in the list might be deleted. This might result in a smaller sample size than expected.**

For each language, split the dataset so that 90% is for training and 10% is for evaluation.
Also, make sure that there is no speaker leakage between the training set and the evaluation set.
To be more precise, at least for each language, we tried to have at least 10% of the evaluation set, so that overall the percentage of evaluation set is more than 10%.

As a result of executing the code block below, the dataset was divided into the following number of samples:

* train: 148,654 samples (6 languages; 84.3%)
* eval: 27,764 samples (6 languages; 15.7%)

### Procedure

1. Download fixed data by a fixed list created by TensorFlow.
   1. Download the fixed list from https://storage.googleapis.com/tfds-data/downloads/voxforge/voxforge_urls.txt.
   2. Download the zipped files by using the following code.
      ```sh
      wget -P your_target_directory -i /foo/bar/voxforge_urls.txt -x
      ```
      If you are using MacOS and can use Homebrew, you can use the command `wget` after excuting `brew install wget`.
2. Compose the directory tree as follows:

```
/foo/bar/Voxforge/www.repository.voxforge1.org/downloads/
  ├ de/Trunk/Audio/Main/16kHz_16bit/____.tgz
  ├ en/Trunk/Audio/Main/16kHz_16bit/____.tgz
  ├ es/Trunk/Audio/Main/16kHz_16bit/____.tgz
  ├ fr/Trunk/Audio/Main/16kHz_16bit/____.tgz
  ├ it/Trunk/Audio/Main/16kHz_16bit/____.tgz 
  └ ru/Trunk/Audio/Main/16kHz_16bit/____.tgz
```

3. Format the dataset (16 kHz, 16 bit, 1 ch -> 16 kHz, 16 bit, 1 ch; no change)

```
../datasets/Voxforge/
  ├ eval/____.wav
  ├ eval.csv
  ├ train/____.wav
  └ train.csv
```

In [None]:
# import tarfile
# import time

# from sklearn.preprocessing import LabelEncoder

# ################################################################################
# ### Configuration
# ################################################################################
# # ROOT_DIR = Path('/foo/var/Voxforge')
# DIR_FROM = Path('/Volumes/ARCHIVE_SSD/Database/_zip/Voxforge')
# DIR_TO = Path('../datasets/Voxforge')


# ################################################################################
# ### Preprocessing
# ################################################################################
# master_df = pd.DataFrame([
#     {
#         'tgz_path': p,
#         'label': p.parts[-6],  # Language
#         'speaker': p.stem.split('-')[0]
#     }
#     for p in DIR_FROM.rglob('*.tgz')
# ])
# master_df['label_id'] = LabelEncoder().fit_transform(master_df['label'].values)
# train_df = []
# eval_df = []

# # Process for each label (language)
# rng = np.random.RandomState(0)  # Fix random seed.
# for label, label_df in master_df.groupby('label'):
#     # Split speakers so that they do not overlap.
#     # The speakers are randomized.
#     # However, "anonymous" is always included in training.
#     # Train=90%, Eval=10%
#     speakers_non_anonymous = list(label_df.query('speaker != "anonymous"')['speaker'].unique())
#     rng.shuffle(speakers_non_anonymous)
#     speakers = ['anonymous'] + speakers_non_anonymous
#     speakers_count = label_df.groupby('speaker').count().loc[speakers]
#     speakers_cumsum_count = speakers_count.cumsum()
#     speakers_relative_cumsum_count = speakers_cumsum_count / len(label_df)
#     train_speakers \
#         = speakers_relative_cumsum_count.loc[speakers_relative_cumsum_count.iloc[:, 0] < 0.9].index
#     train_speakers = sorted(set(train_speakers))
#     eval_speakers = sorted(set(speakers).difference(train_speakers))
#     train_df.append(label_df.query('speaker in @train_speakers'))
#     eval_df.append(label_df.query('speaker in @eval_speakers'))
#     print(f'{label} train={len(train_df[-1]):4} zips, eval={len(eval_df[-1]):4} zips')
# train_df = pd.concat(train_df, axis=0)
# eval_df = pd.concat(eval_df, axis=0)


# ################################################################################
# ### Processing
# ################################################################################
# # Make directories.
# (DIR_TO/'train').mkdir(parents=True, exist_ok=True)
# (DIR_TO/'eval').mkdir(parents=True, exist_ok=True)
# (DIR_TO/'_temp').mkdir(parents=True, exist_ok=True)

# for phase, df in (('train', train_df), ('eval', eval_df)):
#     meta = []
#     for _, row in df.iterrows():
#         tmp_dir = DIR_TO/'_temp'/row.tgz_path.stem
#         try:
#             with tarfile.open(row.tgz_path, 'r') as f:
#                 f.extractall(path=tmp_dir)
#         except tarfile.ReadError as e:
#             print('/'.join(row.tgz_path.parts[-8:]))
#         for orig_path in Path(tmp_dir).rglob('*.wav'):
#             audio_filename = f'{row.label}__{row.tgz_path.stem}__{orig_path.name}'
#             meta.append({
#                 'audio_filename': audio_filename,
#                 'label': row.label,
#                 'label_id': row.label_id,
#                 'speaker': row.speaker if row.speaker != 'anonymous' else None
#             })
#             copy_with_format_wave(orig_path, DIR_TO/phase/audio_filename)
#     pd.DataFrame(meta).to_csv(DIR_TO/f'{phase}.csv', index=False)

***

*End*