# Dataset preparation

## Creating detailed dataframe
I extracted additional data from csv files containing filepaths to each image, creating a new dataframe. This detailed dataframe will be used instead of original MURA dataset csv files. Detailed dataframe will have these columns:<br>
* sp_id (identifier created from other columns, may be used later for dataset filtering)
* filepath (filepath to each image from original csv files)
* label (positive/negative label, extracted from study folder name)
* body_part (what body part does the X-ray scan belong to, without the XR_ prefix)
* patient (numeric string representing one patient)
* study (number representing study for each patient)
* split (which part of dataset does image belong to based on original data split)

<b>Note: another dataset csv file with added custom test split is created in this notebook.</b>

In [45]:
import glob
import numpy as np
import os
import pandas as pd
import re
import shutil

In [55]:
# Change filepaths as needed
ROOT_DIR = "../" # root dir, in which MURA dir is stored
TRAIN_PATH = "../MURA-v1.1/train_image_paths.csv" # path for loading train set csv file
VALID_PATH = "../MURA-v1.1/valid_image_paths.csv" # path for loading train set csv file

TRAIN_VALID_DETAILED_PATH = "../MURA-v1.1/tv_detailed_paths.csv" # path for storing new dataframe
TRAIN_VALID_TEST_DETAILED_PATH = "../MURA-v1.1/tvt_detailed_paths.csv" # path for storing new dataframe with test set

In [15]:
# Load dataframes
train_df = pd.read_csv(TRAIN_PATH, names=['filepath'])
valid_df = pd.read_csv(VALID_PATH, names=['filepath'])

print('Train image count:', len(train_df))
print('Valid image count:', len(valid_df))
print('Example filepath:', train_df.loc[0][0])

Train image count: 36808
Valid image count: 3197
Example filepath: MURA-v1.1/train/XR_SHOULDER/patient00001/study1_positive/image1.png


In [16]:
# Parse data into new columns
rstring = '.*/XR_(?P<body_part>.*)/[a-z]+(?P<patient>[0-9]+)/[a-z]+(?P<study>[0-9]+)_(?P<label>[a-z]+)/'

train_cols = train_df['filepath'].str.extract(rstring, expand=True)
valid_cols = valid_df['filepath'].str.extract(rstring, expand=True)

# Add "dataset type"
train_cols['split'] = 'train'
valid_cols['split'] = 'valid'

# Create study identifiers from other columns
train_cols['sp_id'] =  train_cols['body_part'].str.slice(0,2) + train_cols['patient'] + "_"\
                        + train_cols['study'] + train_cols['label'].str.get(0)
valid_cols['sp_id'] =  valid_cols['body_part'].str.slice(0,2) + valid_cols['patient'] + "_"\
                        + valid_cols['study'] + valid_cols['label'].str.get(0)

# Concat into final dataframes
train_expanded = pd.concat([train_df, train_cols], axis=1)
valid_expanded = pd.concat([valid_df, valid_cols], axis=1)

# Create detailed dataframe
detailed_df = pd.concat([train_expanded, valid_expanded])

# Reorder columns
detailed_df = detailed_df.reindex(columns=['sp_id', 'filepath', 'label', 'body_part', 'patient', 'study', 'split'])

### Save created dataframe as csv

In [17]:
detailed_df.to_csv(TRAIN_VALID_DETAILED_PATH)

In [18]:
# Show dataframe sample
display(detailed_df.head(5))

Unnamed: 0,sp_id,filepath,label,body_part,patient,study,split
0,SH00001_1p,MURA-v1.1/train/XR_SHOULDER/patient00001/study...,positive,SHOULDER,1,1,train
1,SH00001_1p,MURA-v1.1/train/XR_SHOULDER/patient00001/study...,positive,SHOULDER,1,1,train
2,SH00001_1p,MURA-v1.1/train/XR_SHOULDER/patient00001/study...,positive,SHOULDER,1,1,train
3,SH00002_1p,MURA-v1.1/train/XR_SHOULDER/patient00002/study...,positive,SHOULDER,2,1,train
4,SH00002_1p,MURA-v1.1/train/XR_SHOULDER/patient00002/study...,positive,SHOULDER,2,1,train


# Analysis of dataset
In this section I analysed the number of studies for each body part and their positive/negative ratio. I used this information in order to create a test set similar to the one proposed in original MURA paper, which is not publicly accessible.

* I calculated partial representation of studies for each body part in the dataset.
* Then used these values to approximate number of studies for each body part, that should be in the test set
* For each body part, I picked the studies at random, created a copy of the detailed dataframe and changed the "split" column value to "test" for said studies

In [19]:
# Get study counts for each body part
group1 = detailed_df[['split', 'sp_id', 'body_part', 'label']].groupby(['split', 'sp_id', 'body_part', 'label']).size()
study_counts = group1.groupby(['split', 'body_part', 'label']).size().to_frame(name = 'study_cnt').reset_index()

# Drop counts for validation set and split column
study_counts = study_counts[study_counts['split'] == 'train']
study_counts.drop(columns=['split'], inplace=True)

# Sum of studies
all_studies = study_counts['study_cnt'].sum()

# Get sum for each body part
study_sum = study_counts[['body_part', 'study_cnt']].groupby('body_part').sum()
study_sum = study_sum.reset_index(level=0)
study_sum['label'] = 'any'

# Add study sum for body parts to dataframe
study_counts = pd.concat([study_counts, study_sum], ignore_index=True)
study_counts.sort_values(by=['body_part'], inplace=True)
study_counts.reset_index(drop=True, inplace=True)

# Calculate portion of all studies
study_counts['share'] = np.round(study_counts['study_cnt'] / all_studies, decimals=3)

# Example values for one body part
study_counts.head(3)

Unnamed: 0,body_part,label,study_cnt,share
0,ELBOW,negative,1094,0.081
1,ELBOW,positive,660,0.049
2,ELBOW,any,1754,0.13


In [20]:
# Get average number of images per study for each body part
detailed_df[['sp_id', 'body_part', 'filepath']].groupby(['body_part','sp_id']).size().groupby('body_part').mean()

body_part
ELBOW       2.822176
FINGER      2.638389
FOREARM     2.104950
HAND        2.747368
HUMERUS     2.145805
SHOULDER    2.965837
WRIST       2.816067
dtype: float64

# Creating test set

In [21]:
ORGINAL_TEST_SIZE = 207 # test set size from MURA paper
RD_SEED = 27 # seed for random selection of studies

# Create dataframe with share per body part
shares_df = study_counts[study_counts['label'] == 'any'].copy(deep=True)

# Calculate approximate number of studies in test split for each body part
shares_df['test_size'] = np.floor(shares_df['share'] * ORGINAL_TEST_SIZE).astype('int')

print("Proposed number of selected studies:", shares_df.test_size.sum())
display(shares_df)

Proposed number of selected studies: 204


Unnamed: 0,body_part,label,study_cnt,share,test_size
2,ELBOW,any,1754,0.13,26
5,FINGER,any,1935,0.144,29
8,FOREARM,any,877,0.065,13
11,HAND,any,2018,0.15,31
14,HUMERUS,any,592,0.044,9
17,SHOULDER,any,2821,0.21,43
20,WRIST,any,3460,0.257,53


## Selecting test set studies at random

In [23]:
# Create dataframe with train split
train_df = detailed_df[detailed_df['split'] == 'train'].copy(deep=True)
# Create dataframe with only sp_id and body parts
studies_df = train_df[['sp_id', 'body_part']].groupby(['sp_id', 'body_part']).size().to_frame('img_cnt').reset_index()

# Get all body parts
body_parts = train_df.body_part.unique()

test_studies = []
for body_part in body_parts:
    # Get number of studies from previously created dataframe
    sample_size = shares_df[shares_df['body_part'] == body_part].test_size.values[0]
    
    # Select random studies for test split
    test_studies.extend(studies_df[studies_df['body_part'] == body_part].sample(n=sample_size, random_state=RD_SEED).sp_id.values)

print("Selected", len(test_studies), "studies for test set")

# Change split column for selected study images to "test"
detailed_df.loc[detailed_df['sp_id'].isin(test_studies), 'split'] = 'test'

# Show test split
display(detailed_df[detailed_df['split'] == 'test'])

Selected 204 studies for test set


Unnamed: 0,sp_id,filepath,label,body_part,patient,study,split
132,SH00044_1p,MURA-v1.1/train/XR_SHOULDER/patient00044/study...,positive,SHOULDER,00044,1,test
133,SH00044_1p,MURA-v1.1/train/XR_SHOULDER/patient00044/study...,positive,SHOULDER,00044,1,test
134,SH00044_1p,MURA-v1.1/train/XR_SHOULDER/patient00044/study...,positive,SHOULDER,00044,1,test
185,SH00060_1p,MURA-v1.1/train/XR_SHOULDER/patient00060/study...,positive,SHOULDER,00060,1,test
186,SH00060_1p,MURA-v1.1/train/XR_SHOULDER/patient00060/study...,positive,SHOULDER,00060,1,test
...,...,...,...,...,...,...,...
36213,HA11023_1n,MURA-v1.1/train/XR_HAND/patient11023/study1_ne...,negative,HAND,11023,1,test
36214,HA11023_1n,MURA-v1.1/train/XR_HAND/patient11023/study1_ne...,negative,HAND,11023,1,test
36796,HA11181_1n,MURA-v1.1/train/XR_HAND/patient11181/study1_ne...,negative,HAND,11181,1,test
36797,HA11181_1n,MURA-v1.1/train/XR_HAND/patient11181/study1_ne...,negative,HAND,11181,1,test


## Move test images to separate directory

In [69]:
# Get test split src paths as list
src_paths = detailed_df.loc[detailed_df['split'] == 'test', 'filepath'].to_list()

# Create destination paths
dst_paths = []
for path in src_paths:
    dst_paths.append(path.replace('train', 'test'))

# Move images to test directory, while maintaining higher directories structure
for i in range(0, len(src_paths)):
    src = ROOT_DIR + src_paths[i]
    dst = ROOT_DIR + dst_paths[i]
    
    dst_dir = os.path.dirname(dst)
    if not os.path.exists(dst_dir):
        os.makedirs(dst_dir)
    
    shutil.move(src, dst)

## Delete empty folders in train directory

In [99]:
study_pattern = re.compile(r'(.*)\/.*')
patient_pattern = re.compile(r'(.*)\/study[0-9]+_[a-z]*\/.*')

# Remove all empty study directories
for path in src_paths:
    study_path = ROOT_DIR + re.match(study_pattern, path).group(1)
    if os.path.exists(study_path):
        os.rmdir(study_path)
        
        
# Remove all empty patient directories
for path in src_paths:
    patient_path = ROOT_DIR + re.match(patient_pattern, path).group(1)
    if os.path.exists(patient_path) and len(os.listdir(patient_path)) == 0:
        os.rmdir(patient_path)

## Update filepaths for test split images

In [107]:
# Replace "train" in filepath with "test"
detailed_df.loc[detailed_df['split'] == 'test', 'filepath'] = detailed_df.loc[detailed_df['split'] == 'test', 'filepath']\
                                                                .str.replace('train', 'test')

### Save created dataframe as csv

In [109]:
detailed_df.to_csv(TRAIN_VALID_TEST_DETAILED_PATH)