# Main difference

(Compared to `finetuning_explore.ipynb`)

In the previous notebook, we ran the provided sample script code given to us by the paper's authors. Upon closer look and comparing with what the authors wrote in their paper, we find discrepancies:

- The paper says 60 second samples were used, but the code use 65 seconds

## Key contribution

Read in the raw data files and labels. Then save them to pickles to the preread directory.

Further processing like sample rate and padding can be operated on the preread pickles instead of having to first read in the raw data files.

We do this way because read in raw data files takes 1 hour due to poor i/o on google drive, where the raw data files reside.

Goal: explore how to process the finetuning dataset. It's the PhysioNet 2017.

What has been done: the dataset has been downloaded to gdrive under this path:

```
Project > data > physionet
```

Inside the `physionet/` directory, we have 2 subdirectories:

- `training2017`
- `sample2017`

Per the [docs](https://physionet.org/content/challenge-2017/1.0.0/), `training2017` is the "real" dataset. The instructions in `finetuning/README.md` assume that `training2017` is extracted to `data/physionet`.

We are not going to do that. Instead of providing `data/physionet` as the path to the data, we change that to `data/physionet/training2017`.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [18]:
PROJECT_ROOT = '/content/drive/MyDrive/DLHProject'

Now we `cd` to git repo to verify that the code we're going to run
is what we intend to run.

In [7]:
%cd $PROJECT_ROOT/Danielgitrepo

/content/drive/.shortcut-targets-by-id/1vlUILM7cToH5CoX1x0kWRpe55MbBogS-/Project/Danielgitrepo


Note that we will perform git operations in a separate colab tab, not in here.

The following git commands are local only and just are a verification that we have the right code version checked out

In [9]:
! git status

On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   git-ops.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")


In [10]:
! ls

environment.yml  finetuning	jupyter_notebooks  pretraining	requirements-daniel.txt  transplant
example.ipynb	 git-ops.ipynb	LICENSE		   README.md	requirements.txt


Install project requirements

In [11]:
! pip install -r requirements.txt

Collecting wfdb (from -r requirements.txt (line 8))
  Downloading wfdb-4.1.2-py3-none-any.whl (159 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.0/160.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting jedi>=0.16 (from ipython>=5.0.0->ipykernel->-r requirements.txt (line 9))
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, wfdb
Successfully installed jedi-0.19.1 wfdb-4.1.2


In [None]:
# just for reproducibility, dump the installed library versions to a separate
# requirements.txt file
# ! pip freeze > requirements-daniel.txt

Next we want to prepare the train and test datasets. Before we generate them, we first make sure to save them to the right destination. Otherwise we will write data to the git repository which we don't want. Thus, we spend the next few cells figuring out where to save these files.

In [12]:
# We're in the git repo
! pwd

/content/drive/.shortcut-targets-by-id/1vlUILM7cToH5CoX1x0kWRpe55MbBogS-/Project/Danielgitrepo


In [19]:
! ls $PROJECT_ROOT/data

icentia11k		     icentia11k_subset_unzipped  physionet_finetune	 temp.torrent
icentia11k_subset	     physionet			 physionet_finetune_60s
icentia11k_subset_corrupted  physionet_data.zip		 session_checkpoint.dat


In [40]:
# This be the input to the raw data
PHYSIONET_RAW = f"{PROJECT_ROOT}/data/physionet"

In [10]:
# double check
# print(f'{PHYSIONET_OUT_DIR}/physionet_train.pkl')

/content/drive/.shortcut-targets-by-id/1vlUILM7cToH5CoX1x0kWRpe55MbBogS-/Project/data/physionet_finetune_60s/physionet_train.pkl


In [62]:
from finetuning import datasets
from finetuning.utils import train_test_split
from transplant.utils import save_pkl

Please note that the below cell is the initial code provided to us by the authors in `finetuning/README.md`.

The initial run of this on 2024-03-31 was successful.

However, there are some concerns with the correctness of the parameters passed to `get_challenge17_data()`. For example, the `pad` parameter.

FYI, 16384 / 250 = 65.536 = 2^16

In case we find problems with this and need to monkey around with the padding settings, I now write code to separate the read portion of `get_challenge17_data` from the normalization + padding portion.

In [28]:
# data = datasets.get_challenge17_data(
#     db_dir=f"{PHYSIONET_RAW}/training2017",
#     fs=250,  # keep sampling frequency the same as Icentia11k
#     pad=16384,  # zero-pad recordings to keep the same length at about 65 seconds
#     normalize=True,  # normalize each recording with mean and std computed over the entire dataset
#     verbose=True)
# # maintain class ratio across both train and test sets by using the `stratify` argument
# train_set, test_set = train_test_split(
#     data, test_size=0.2, stratify=data['y'])
# save_pkl(f'{PHYSIONET_OUT_DIR}/physionet_train.pkl', **train_set)
# save_pkl(f'{PHYSIONET_OUT_DIR}/physionet_test.pkl', **test_set)

The above took about 1 hour to read. Not sure about the normalization + pad step.

In [27]:
from transplant.datasets import physionet

In [41]:
! ls $PHYSIONET_RAW

sample2017  training2017


In [None]:
records, labels = physionet.read_challenge17_data(f"{PHYSIONET_RAW}/training2017", verbose=True)

Reading records:  52%|█████▏    | 4402/8528 [34:48<33:16,  2.07it/s]

Now we do some inspection of the records and labels

Here we see records is a list of `Record` objects from the `wfdb` package. There's about 8.5K records in total.

In [57]:
print(type(records))
print(len(records))
print(type(records[0]))

<class 'list'>
8528
<class 'wfdb.io.record.Record'>


Labels is a dataframe, with same number of rows as `len(records)` and 4 columns, one for each output class. Thus, a row in `labels` is a one-hot encoded vector.

In [60]:
print(type(labels))
print(labels.shape)
labels.head()

<class 'pandas.core.frame.DataFrame'>
(8528, 4)


Unnamed: 0_level_0,A,N,O,~
record_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A00001,0,1,0,0
A00002,0,1,0,0
A00003,0,1,0,0
A00004,1,0,0,0
A00005,1,0,0,0


Now we save the work of extracting the data to PROJECT_DATA_DIR/physionet_preread

In [64]:
preread_out_dir = f"{PROJECT_DATA_DIR}/physionet_preread"
! mkdir -p $preread_out_dir

In [66]:
# Note: this saves the result to a dict {'data': records}
save_pkl(f"{preread_out_dir}/records.pkl", data=records)

In [72]:
# Note: this saves the result to a dict {'data': labels}
save_pkl(f"{preread_out_dir}/labels.pkl", data=labels)

In [73]:
# check round trip
from transplant.utils import load_pkl

Loading the records pickle file takes less than 5 seconds, an improvement over 1 hour. This means that any adjustment we need to make to padding, or sampling rate to the records will be fairly quick as we will not have to be reading from the raw data files but instead the pickles in the preread directory.

We also verified that the new records matches the records read from the raw files.

In [91]:
# see note above about save_pkl
%%time
new_records = load_pkl(f"{preread_out_dir}/records.pkl")["data"]

CPU times: user 3.75 s, sys: 261 ms, total: 4.02 s
Wall time: 4.09 s


In [85]:
print(len(new_records))
print(new_records[0])
print(all(n == o for n, o in zip(new_records, records)))

8528
<wfdb.io.record.Record object at 0x7f486c163b20>
True


In [92]:
# see note above about save_pkl
%%time
new_labels = load_pkl(f"{preread_out_dir}/labels.pkl")["data"]

CPU times: user 6.51 ms, sys: 34 µs, total: 6.55 ms
Wall time: 22.6 ms


In [77]:
new_labels.shape

(8528, 4)

In [88]:
! ls $preread_out_dir

labels.pkl  records.pkl


TODO: new notebook to read in saved pickles from preread directory.