## Reading pickled finetuning data

Please refer to `finetuning_data_pickles.ipynb` notebook for the details on generating the pickles.

There are 2 pickles of interest:

1. `records.pkl`
2. `labels.pkl`

These 2 pickles form the input to the resampling and padding post processing that ultimately form the creation of the train and test sets for the finetuning process.

Both pickles are found in the shared project folder, under `data/physionet_preread/` directory.

First, we need to mount the drive. This process can be done by running the below command, or just click on the folder icon in the left bar and then clicking the dark gray folder icon with the drive logo. It is between the refresh symbol and the eye symbol.

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [8]:
ROOT = '/content/drive/MyDrive/DLHProject'
REPO = ROOT + '/Danielgitrepo'
DATA_DIR = ROOT + '/data'

Now we `cd` to the repo directory because it hosts the code that we will run.

In [10]:
%cd $REPO
! ls

/content/drive/.shortcut-targets-by-id/1vlUILM7cToH5CoX1x0kWRpe55MbBogS-/Project/Danielgitrepo
environment.yml  finetuning	jupyter_notebooks  pretraining	requirements-daniel.txt  transplant
example.ipynb	 git-ops.ipynb	LICENSE		   README.md	requirements.txt


In [12]:
# install dependencies
# for some reason, pip install -r requirements-daniel.txt didn't work
! pip install -r requirements.txt

Collecting wfdb (from -r requirements.txt (line 8))
  Downloading wfdb-4.1.2-py3-none-any.whl (159 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.0/160.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Collecting jedi>=0.16 (from ipython>=5.0.0->ipykernel->-r requirements.txt (line 9))
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, wfdb
Successfully installed jedi-0.19.1 wfdb-4.1.2


In [14]:
PICKLE_IN_DIR = DATA_DIR + '/physionet_preread'
!ls $PICKLE_IN_DIR

labels.pkl  records.pkl


In [15]:
from transplant.utils import load_pkl, save_pkl

In [16]:
# Note that the object stored as pickle is a dictionary with 'data' key.
records = load_pkl(f"{PICKLE_IN_DIR}/records.pkl")["data"]
labels = load_pkl(f"{PICKLE_IN_DIR}/labels.pkl")["data"]

In [18]:
import functools

from finetuning import datasets
from transplant.datasets import physionet

Now we write a hacked version of `datasets.get_challenge17_data()` which instead takes in `records` and `labels` and applies the normalization, padding, resampling transformations.

In [25]:
def hacked_get_challenge17_data(records, labels, fs=None, pad=None, normalize=False, verbose=False):
    # Already taken care of by `finetuning_data_pickles.ipynb`
    # records, labels = physionet.read_challenge17_data(db_dir, verbose=verbose)
    if normalize:
        normalize = functools.partial(
            physionet.normalize_challenge17, inplace=True)
    data_set = datasets._prepare_data(
        records,
        labels,
        normalize_fn=normalize,
        fs=fs,
        pad=pad,
        verbose=verbose)
    return data_set

Before we run `hacked_get_challenge17_data()`, we need to prepare the output directory.

For convenience, we will ensure the transform parameters are part of the output directory name.

In [21]:
# set the transform parameters:
fs = 250 # hertz
pad = 250 * 60 # 60 seconds, as per the paper
normalize = True

In [22]:
out_path = DATA_DIR + f'/physionet_{fs}hz_{pad}pad_norm_{normalize}'
print(out_path)

/content/drive/MyDrive/DLHProject/data/physionet_250hz_15000pad_norm_True


In [24]:
! mkdir -p $out_path
! ls $DATA_DIR

icentia11k		     physionet				 physionet_preread
icentia11k_subset	     physionet_250hz_15000pad_norm_True  session_checkpoint.dat
icentia11k_subset_corrupted  physionet_data.zip			 temp.torrent
icentia11k_subset_unzipped   physionet_finetune


It takes about 9 seconds to read the pickles and apply the transformations. Contrast this with an 1 hour read time for the raw data files.

Working with pickles makes iteration on transformation of finetuning data **much** faster.

In [29]:
%%time
data = hacked_get_challenge17_data(
    records,
    labels,
    fs=fs,
    pad=pad,
    normalize=normalize,
    verbose=True
)

Resampling records: 100%|██████████| 8528/8528 [00:06<00:00, 1330.31it/s]


CPU times: user 6.76 s, sys: 1.47 s, total: 8.23 s
Wall time: 8.46 s


In [28]:
print(type(data))
print(data.keys())

<class 'dict'>
dict_keys(['x', 'y', 'record_ids', 'classes'])


Now we can run the routine to perform train test split. We mostly take the code as is from the `finetuning/readme.md` but with a twist.

> We pass in `random_state=2024` for reproducibility.

This works because `train_test_split` wraps sklearn [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

> ⛔   We did not pass in a preset random state to generate the original readme code for generating the train and test data.

In [30]:
from finetuning.utils import train_test_split

In [31]:
%%time
# copy pasted directly from finetuning/readme.md
# maintain class ratio across both train and test sets by using the `stratify` argument
train_set, test_set = train_test_split(
    data, test_size=0.2, stratify=data['y'],
    # NEW: pass in random state for reproducibility
    random_state=2024,
)
save_pkl(f'{out_path}/physionet_train.pkl', **train_set)
save_pkl(f'{out_path}/physionet_test.pkl', **test_set)

CPU times: user 34.3 s, sys: 2.77 s, total: 37 s
Wall time: 42.9 s


The above took about 43 seconds to complete.