Goal: explore how to process the finetuning dataset. It's the PhysioNet 2017.

What has been done: the dataset has been downloaded to gdrive under this path:

```
Project > data > physionet
```

Inside the `physionet/` directory, we have 2 subdirectories:

- `training2017`
- `sample2017`

Per the [docs](https://physionet.org/content/challenge-2017/1.0.0/), `training2017` is the "real" dataset. The instructions in `finetuning/README.md` assume that `training2017` is extracted to `data/physionet`.

We are not going to do that. Instead of providing `data/physionet` as the path to the data, we change that to `data/physionet/training2017`.


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# move to my repo
%cd /content/drive/MyDrive/DLHProject/Danielgitrepo

/content/drive/.shortcut-targets-by-id/1vlUILM7cToH5CoX1x0kWRpe55MbBogS-/Project/Danielgitrepo


In [None]:
# note: we assume running git-ops.ipynb in a separate tab and that git state is set up correctly

In [None]:
! git branch

* [32mdaniel-finetune-explore[m
  daniel-test[m
  master[m


Install project requirements

In [4]:
! pip install -r requirements.txt

Collecting wfdb (from -r requirements.txt (line 8))
  Downloading wfdb-4.1.2-py3-none-any.whl (159 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.0/160.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting jedi>=0.16 (from ipython>=5.0.0->ipykernel->-r requirements.txt (line 9))
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, wfdb
Successfully installed jedi-0.19.1 wfdb-4.1.2


In [9]:
# just for reproducibility, dump the installed library versions to a separate
# requirements.txt file
# ! pip freeze > requirements-daniel.txt

Next we want to prepare the train and test datasets. Before we generate them, we first make sure to save them to the right destination. Otherwise we will write data to the git repository which we don't want. Thus, we spend the next few cells figuring out where to save these files.

In [8]:
# We're in the git repo
! pwd

/content/drive/.shortcut-targets-by-id/1vlUILM7cToH5CoX1x0kWRpe55MbBogS-/Project/Danielgitrepo


In [17]:
! ls /content/drive/.shortcut-targets-by-id/1vlUILM7cToH5CoX1x0kWRpe55MbBogS-/Project/data/

icentia11k	   icentia11k_subset_corrupted	physionet	    session_checkpoint.dat
icentia11k_subset  icentia11k_subset_unzipped	physionet_data.zip  temp.torrent


In [28]:
# This be the input to the raw data
PHYSIONET_RAW = "/content/drive/.shortcut-targets-by-id/1vlUILM7cToH5CoX1x0kWRpe55MbBogS-/Project/data/physionet"

In [26]:
# To be on safe side, we will write output files to a *sister* directory of data/physionet
PHYSIONET_OUT_DIR = "/content/drive/.shortcut-targets-by-id/1vlUILM7cToH5CoX1x0kWRpe55MbBogS-/Project/data/physionet_finetune"
! mkdir -p $PHYSIONET_OUT_DIR
! ls "/content/drive/.shortcut-targets-by-id/1vlUILM7cToH5CoX1x0kWRpe55MbBogS-/Project/data/"

icentia11k		     icentia11k_subset_unzipped  physionet_finetune
icentia11k_subset	     physionet			 session_checkpoint.dat
icentia11k_subset_corrupted  physionet_data.zip		 temp.torrent


In [27]:
# double check
print(f'{PHYSIONET_OUT_DIR}/physionet_train.pkl')

/content/drive/.shortcut-targets-by-id/1vlUILM7cToH5CoX1x0kWRpe55MbBogS-/Project/data/physionet_finetune/physionet_train.pkl


In [6]:
from finetuning import datasets
from finetuning.utils import train_test_split
from transplant.utils import save_pkl

In [None]:
data = datasets.get_challenge17_data(
    db_dir=f"{PHYSIONET_RAW}/training2017",
    fs=250,  # keep sampling frequency the same as Icentia11k
    pad=16384,  # zero-pad recordings to keep the same length at about 65 seconds
    normalize=True)  # normalize each recording with mean and std computed over the entire dataset
# maintain class ratio across both train and test sets by using the `stratify` argument
train_set, test_set = train_test_split(
    data, test_size=0.2, stratify=data['y'])
save_pkl(f'{PHYSIONET_OUT_DIR}/physionet_train.pkl', **train_set)
save_pkl(f'{PHYSIONET_OUT_DIR}/physionet_test.pkl', **test_set)

The above took about <> time.