# Preprocessing data

In [1]:
import os

[git submodule](https://stackoverflow.com/questions/36236484/maintaining-a-git-repo-inside-another-git-repo)

<br>

## MusicNet

Download the dataset (11 GiB)
* [Deep Complex Networks: MusicNet](https://github.com/ChihebTrabelsi/deep_complex_networks)
  - [official page](https://homes.cs.washington.edu/~thickstn/musicnet.html)

In [2]:
musicnet_path = "./data/musicnet"
file_in = os.path.join(musicnet_path, "musicnet.h5")

if not os.path.exists(file_in):
    os.makedirs(musicnet_path, exist_ok=False)

    !wget https://homes.cs.washington.edu/~thickstn/media/musicnet.h5 -P {musicnet_path}

--2019-11-24 23:58:12--  https://homes.cs.washington.edu/~thickstn/media/musicnet.h5
Resolving homes.cs.washington.edu (homes.cs.washington.edu)... 128.208.3.226, 2607:4000:200:12::e2
Connecting to homes.cs.washington.edu (homes.cs.washington.edu)|128.208.3.226|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7623507914 (7.1G)
Saving to: ‘./data/musicnet/musicnet.h5’


2019-11-25 00:22:06 (5.08 MB/s) - ‘./data/musicnet/musicnet.h5’ saved [7623507914/7623507914]



Extract test and validation subsets as specified in [Trablesi et al. (2019)](https://arxiv.org/abs/1705.09792).

In [3]:
datasets = {
    "test": [
        "id_2303", "id_2382", "id_1819",
    ],

    "valid": [
        "id_2131", "id_2384", "id_1792", "id_2514", "id_2567", "id_1876",
    ],
}

Populate the train with the remaining keys.

In [4]:
import h5py
from itertools import chain

with h5py.File(file_in, "r") as h5_in:
    remaining_keys = set(h5_in.keys()) - set(chain(*datasets.values()))    
    datasets.update({
        "train": list(remaining_keys)
    })

Run resampler on the keys of each dataset (this takes a while)

Code is loosely based on [Trabelsi et al. (2019)](https://github.com/ChihebTrabelsi/deep_complex_networks/blob/master/musicnet/scripts/resample.py)
but has been customized for HDF5
* dependencies: [resampy](https://github.com/bmcfee/resampy)

In [5]:
from musicnet.dataset import resample_h5

for dataset, keys in datasets.items():
    file_out = os.path.join(musicnet_path, f"musicnet_11khz_{dataset}.h5")
    resample_h5(file_in, file_out, 44100, 11000, keys=sorted(keys))

.. resampling ./data/musicnet/musicnet.h5 (44100Hz) into ./data/musicnet/musicnet_11khz_test.h5 (11000Hz)


100%|██████████| 3/3 [00:13<00:00,  4.42s/it]

.. resampling ./data/musicnet/musicnet.h5 (44100Hz) into ./data/musicnet/musicnet_11khz_valid.h5 (11000Hz)



100%|██████████| 6/6 [01:08<00:00, 11.38s/it]

.. resampling ./data/musicnet/musicnet.h5 (44100Hz) into ./data/musicnet/musicnet_11khz_train.h5 (11000Hz)



100%|██████████| 321/321 [1:03:36<00:00, 11.89s/it]


The `utils.musicnet` also implements a `torch.Dataset` which interfaces HDF5 files
* dependencies: [ncls](https://github.com/biocore-ntnu/ncls) -- written in cython/c and
offers termendous speed up compared to pythonic `IntervalTree`.

<br>

## TIMIT

* [TIMIT](https://catalog.ldc.upenn.edu/LDC93S1) paywalled?!
  - [on Kaggle](https://www.kaggle.com/mfekadu/darpa-timit-acousticphonetic-continuous-speech)

In [6]:
# !pip install kaggle

From [Kaggle-api](https://github.com/Kaggle/kaggle-api#api-credentials)

> At the 'Account' tab of your user profile (`https://www.kaggle.com/<username>/account`)
select 'Create API Token'. This will download the token `kaggle.json`.

Download TIMIT dataset without any hassle with `kaggle`!

In [7]:
timit_path = "./data/timit"
if not os.path.exists(timit_path):
    os.makedirs(timit_path, exist_ok=False)
    timit_uri = "mfekadu/darpa-timit-acousticphonetic-continuous-speech"

    !kaggle datasets download -p {timit_path} --unzip {timit_uri}

Downloading darpa-timit-acousticphonetic-continuous-speech.zip to ./data/timit
100%|████████████████████████████████████████| 829M/829M [00:09<00:00, 93.0MB/s]


Run preprocessing scripts

In [8]:
# preprocess the data
# how do?

<br>

In [None]:
assert False

<br>