# Data preparation

The following codes show the development of the `utils` for data prepration. The following were conisidered:

Removal of classes attributed to `Unknown` disorder based on the dataset publication. 

1. Removed from the training set:
- `None (half year after diagnosis of small vocal nodules)`
- `functional`
- `None (higher phonation)`

2. Removed from the test set:
- `None (half year post-phonomicrosurgery for polipoid mid-membranous lesions)`
- `None (one year after presumption of a pseudocyst/sulcus in left vocal fold)`
- `functional`

Train-test splits provided by the BAGLS dataset were taken as is. Subsequently, the train set was further divided in to train and val splits during model training.

## For `tensorflow` image data loader

In [1]:
import os
import json
import pandas as pd
from glob import glob

from tqdm import tqdm

In [6]:
import sys
sys.path.append("..")
import PATHS

In [2]:
# inspect sample
path = "../training/training/19722.meta"
with open(path, 'r') as file:
    data = json.load(file)
data

{'Video Id': 546,
 'Camera': 'KayPentax HSV 9710 (Photron)',
 'Sampling rate (Hz)': 4000,
 'Video resolution (px, HxW)': [512, 256],
 'Color': False,
 'Endoscope orientation': '70°',
 'Endoscope application': 'oral',
 'Age range (yrs)': '10-20',
 'Subject sex': 'w',
 'Subject disorder status': 'healthy',
 'Segmenter': 0,
 'Post-processed': 1}

In [3]:
def get_meta(glob_string):
    paths = glob(glob_string)
    df_list = []
    for path in tqdm(paths):
        id_ = path.split("/")[-1].split(".")[0]
        temp = pd.read_json(path, orient="index").T
        temp["Image Id"] = id_
        df_list.append(temp)
    return pd.concat(df_list)

In [4]:
glob_string = "../training/training/*.meta"
df_train = get_meta(glob_string)

100%|██████████| 55750/55750 [03:25<00:00, 271.02it/s]


In [5]:
df_train.head()

Unnamed: 0,Video Id,Camera,Sampling rate (Hz),"Video resolution (px, HxW)",Color,Endoscope orientation,Endoscope application,Age range (yrs),Subject sex,Subject disorder status,Segmenter,Post-processed,Image Id
0,347,KayPentax HSV 9710 (Photron),4000,"[512, 256]",False,70°,oral,20-30,w,Muscle tension dysphonia,0,1,10772
0,449,KayPentax HSV 9710 (Photron),4000,"[512, 256]",False,70°,oral,30-40,m,healthy,0,2,11097
0,254,KayPentax HSV 9710 (Photron),4000,"[512, 256]",False,70°,oral,10-20,w,healthy,0,1,11596
0,319,KayPentax HSV 9710 (Photron),4000,"[512, 256]",False,70°,oral,50-60,m,Vocal insufficiency and contact granuloma,0,2,12917
0,429,KayPentax HSV 9710 (Photron),4000,"[512, 256]",False,70°,oral,20-30,m,healthy,0,2,1434


In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 55750 entries, 0 to 0
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Video Id                    55750 non-null  object
 1   Camera                      55750 non-null  object
 2   Sampling rate (Hz)          55750 non-null  object
 3   Video resolution (px, HxW)  55750 non-null  object
 4   Color                       55750 non-null  object
 5   Endoscope orientation       53150 non-null  object
 6   Endoscope application       55750 non-null  object
 7   Age range (yrs)             55750 non-null  object
 8   Subject sex                 55750 non-null  object
 9   Subject disorder status     55750 non-null  object
 10  Segmenter                   55750 non-null  object
 11  Post-processed              55750 non-null  object
 12  Image Id                    55750 non-null  object
dtypes: object(13)
memory usage: 6.0+ MB


In [7]:
label_col = 'Subject disorder status'

In [8]:
df_train[label_col].unique()

array(['Muscle tension dysphonia', 'healthy',
       'Vocal insufficiency and contact granuloma', '',
       'Muscle tension dysphonia with M. thyroarythaenoideus atrophy',
       'None (half year after diagnosis of small vocal nodules)', 'scar',
       'Muscle tension dysphonia with nodules',
       'Posterior insufficient glottic closure',
       'Posterior insufficient glottic closure (high phonation)', 'edema',
       'Muscle tension dysphonia with vocal insufficiency and M. thyroarythaenoideus atrophy',
       'Vocal insufficiency and M. thyroarythaenoideus atrophy',
       'laryngitis', 'Right vocal fold polyp with contraleral edema ',
       'Muscle tension dysphonia with vocal insufficiency',
       'Minimal anterior mucosal irregularity right vocal fold ',
       'Irregular vibration anterior and middle portion of both vocal folds',
       'paresis', 'Polyp',
       'Muscle tension dysphonia with contact granuloma', 'functional',
       'Cyst vocal fold left with posterior ins

In [9]:
def _remove_subset(df, col, vals):
    """Return dataset after filtering out vals from column"""
    mask = (df[col].isin(vals))
    return df.loc[~mask,:]
    

In [10]:
remove_from_train = [
    'None (half year after diagnosis of small vocal nodules)',
    'functional',
    'None (higher phonation)',
]
df_train = _remove_subset(df_train, label_col, vals=remove_from_train)
df_train = df_train.reset_index(drop=True)
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55150 entries, 0 to 55149
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Video Id                    55150 non-null  object
 1   Camera                      55150 non-null  object
 2   Sampling rate (Hz)          55150 non-null  object
 3   Video resolution (px, HxW)  55150 non-null  object
 4   Color                       55150 non-null  object
 5   Endoscope orientation       52550 non-null  object
 6   Endoscope application       55150 non-null  object
 7   Age range (yrs)             55150 non-null  object
 8   Subject sex                 55150 non-null  object
 9   Subject disorder status     55150 non-null  object
 10  Segmenter                   55150 non-null  object
 11  Post-processed              55150 non-null  object
 12  Image Id                    55150 non-null  object
dtypes: object(13)
memory usage: 5.5+ MB


In [11]:
glob_string = "../test/test/*.meta"
df_test = get_meta(glob_string)
df_test.info()

100%|██████████| 3500/3500 [00:12<00:00, 276.33it/s]


<class 'pandas.core.frame.DataFrame'>
Int64Index: 3500 entries, 0 to 0
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Video Id                    3500 non-null   object
 1   Camera                      3500 non-null   object
 2   Sampling rate (Hz)          3500 non-null   object
 3   Video resolution (px, HxW)  3500 non-null   object
 4   Color                       3500 non-null   object
 5   Endoscope orientation       2500 non-null   object
 6   Endoscope application       3500 non-null   object
 7   Age range (yrs)             3500 non-null   object
 8   Subject sex                 3500 non-null   object
 9   Subject disorder status     3500 non-null   object
 10  Segmenter                   3500 non-null   object
 11  Post-processed              3500 non-null   object
 12  Image Id                    3500 non-null   object
dtypes: object(13)
memory usage: 382.8+ KB


In [12]:
df_test[label_col].unique()

array(['healthy', '', 'laryngitis', 'Muscle tension dysphonia',
       'None (half year post-phonomicrosurgery for polipoid mid-membranous lesions)',
       'Vocal fold nodules (high phonation)',
       "Reinke's edema right vocal fold (earlier it was bilateral)",
       'Cyst vocal fold left (posterior insufficient glottic closure) ',
       "Reinke's edema right vocal fold with hourglass-shaped insufficient glottic closure",
       'Polyp',
       'Post-resection of extreme polipoid pendulating edema right vocal folds; extreme polipoid pendulating edema obstructing almost complete glottis left vocal folds',
       'Lateral-posterior vocal fold cyst (high phonation)',
       'None (one year after presumption of a pseudocyst/sulcus in left vocal fold)',
       'edema',
       'Bilateral vergeture with bowed insufficient glottic closure',
       'scar', 'functional',
       'Hourglass-shaped insufficient glottic closure (high phonation)',
       'spasmodic dysphonia', 'paresis'], dtype=

In [13]:
remove_from_test = [
    'None (half year post-phonomicrosurgery for polipoid mid-membranous lesions)',
    'None (one year after presumption of a pseudocyst/sulcus in left vocal fold)',
    'functional'
]
df_test = _remove_subset(df_test, label_col, vals=remove_from_test)
df_test = df_test.reset_index(drop=True)
display(df_test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3300 entries, 0 to 3299
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Video Id                    3300 non-null   object
 1   Camera                      3300 non-null   object
 2   Sampling rate (Hz)          3300 non-null   object
 3   Video resolution (px, HxW)  3300 non-null   object
 4   Color                       3300 non-null   object
 5   Endoscope orientation       2300 non-null   object
 6   Endoscope application       3300 non-null   object
 7   Age range (yrs)             3300 non-null   object
 8   Subject sex                 3300 non-null   object
 9   Subject disorder status     3300 non-null   object
 10  Segmenter                   3300 non-null   object
 11  Post-processed              3300 non-null   object
 12  Image Id                    3300 non-null   object
dtypes: object(13)
memory usage: 335.3+ KB


None

In [14]:
# create new target column is_healthy
df_train['is_healthy'] = (df_train[label_col] == "healthy").astype(int)
df_test['is_healthy'] = (df_test[label_col] == "healthy").astype(int)

label_col = "is_healthy"

In [15]:
print("Train samples: ", df_train.shape[0])
print("Test samples: ", df_test.shape[0])

Train samples:  55150
Test samples:  3300


In [16]:
import os, shutil

In [17]:
def create_dataset(ids, src, dst, class_label): 
    dst = os.path.join(dst, class_label)
    
    if os.path.exists(dst):
        # delete if exists
        shutil.rmtree(dst)
    os.makedirs(dst)
    for id_ in tqdm(ids):
        fname = f"{id_}.png"
        src_file = os.path.join(src, fname)
        
        fname = f"{id_}.{class_label}.png"
        dst_file = os.path.join(dst, fname)
        shutil.copyfile(src_file, dst_file)

In [18]:
df_train_healthy = df_train[df_train[label_col] == 1]
df_train_unhealthy = df_train[df_train[label_col] == 0]

# use index as ids
healthy_train_ids = df_train_healthy["Image Id"].tolist()
unhealthy_train_ids = df_train_unhealthy["Image Id"].tolist()

print("Healthy train size: ", len(healthy_train_ids))
print("Unhealthy train size: ", len(unhealthy_train_ids))

Healthy train size:  33950
Unhealthy train size:  21200


In [19]:
src = "../training/training"
dst = "../dataset/train"
create_dataset(healthy_train_ids, src, dst, class_label="healthy")
create_dataset(unhealthy_train_ids, src, dst, class_label="unhealthy")

100%|██████████| 33950/33950 [01:01<00:00, 551.39it/s]
100%|██████████| 21200/21200 [00:38<00:00, 546.21it/s]


In [20]:
df_test_healthy = df_test[df_test[label_col] == 1]
df_test_unhealthy = df_test[df_test[label_col] == 0]

# use index as ids
healthy_test_ids = df_test_healthy["Image Id"].tolist()
unhealthy_test_ids = df_test_unhealthy["Image Id"].tolist()

print("Healthy test size: ", len(healthy_test_ids))
print("Unhealthy test size: ", len(unhealthy_test_ids))

Healthy test size:  1450
Unhealthy test size:  1850


In [21]:
src = "../test/test"
dst = "../dataset/test"
create_dataset(healthy_test_ids, src, dst, class_label="healthy")
create_dataset(unhealthy_test_ids, src, dst, class_label="unhealthy")

100%|██████████| 1450/1450 [00:02<00:00, 591.04it/s]
100%|██████████| 1850/1850 [00:03<00:00, 612.89it/s]


In [23]:
# save reference dfs
df_train.to_csv("../dataset/train.csv", index=False)
df_test.to_csv("../dataset/test.csv", index=False)

## Create 10 bootstraps of the evaluation set: `test.csv`

In [8]:
import pandas as pd
import os

df_test = pd.read_csv("../dataset/test.csv")

if not os.path.exists(PATHS.bootstrap_dir):
    os.makedirs(PATHS.bootstrap_dir)
num_bootstraps = 10
for i in range(num_bootstraps):
    save_path = os.path.join(PATHS.bootstrap_dir, f"test-{i}.csv")
    (df_test.sample(df_test.shape[0], replace=True).to_csv(save_path, index=False))

## Create sample dataset for model development

In [24]:
df_train = df_train.sample(frac=0.01)

df_train_healthy = df_train[df_train[label_col] == 1]
df_train_unhealthy = df_train[df_train[label_col] == 0]

# use index as ids
healthy_train_ids = df_train_healthy["Image Id"].tolist()
unhealthy_train_ids = df_train_unhealthy["Image Id"].tolist()

print("Healthy train size: ", len(healthy_train_ids))
print("Unhealthy train size: ", len(unhealthy_train_ids))

src = "../training/training"
dst = "../sample-dataset/train"
create_dataset(healthy_train_ids, src, dst, class_label="healthy")
create_dataset(unhealthy_train_ids, src, dst, class_label="unhealthy")

Healthy train size:  335
Unhealthy train size:  217


100%|██████████| 335/335 [00:00<00:00, 726.38it/s]
100%|██████████| 217/217 [00:00<00:00, 745.13it/s]


In [25]:
df_test = df_test.sample(frac=0.01)

df_test_healthy = df_test[df_test[label_col] == 1]
df_test_unhealthy = df_test[df_test[label_col] == 0]

# use index as ids
healthy_test_ids = df_test_healthy["Image Id"].tolist()
unhealthy_test_ids = df_test_unhealthy["Image Id"].tolist()

print("Healthy test size: ", len(healthy_test_ids))
print("Unhealthy test size: ", len(unhealthy_test_ids))

src = "../test/test"
dst = "../sample-dataset/test"
create_dataset(healthy_test_ids, src, dst, class_label="healthy")
create_dataset(unhealthy_test_ids, src, dst, class_label="unhealthy")

Healthy test size:  13
Unhealthy test size:  20


100%|██████████| 13/13 [00:00<00:00, 928.88it/s]
100%|██████████| 20/20 [00:00<00:00, 1051.19it/s]


## End