This notebook splits the imaging data into training and testing such that there are no repeating patients in the test set and that the patients in the test set do not appear in training. 

In [1]:
import pandas as pd
import random
#reading in a dataframe that contains image arrays, patient IDs ("subject"), and diagnosis
m2 = pd.read_pickle("mri_meta.pkl")

#cleaning patient IDs
m2["subject"] = m2["subject"].str.replace("s", "S").str.replace("\n", "")

#reading in the overlap test set
ts = pd.read_csv("overlap_test_set.csv")

#removing ids from the overlap test set
m2 = m2[~m2["subject"].isin(list(ts["subject"].values))]

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


FileNotFoundError: [Errno 2] No such file or directory: 'mri_meta.pkl'

In [None]:
#there are 551 unique patients
subjects = list(set(m2["subject"].values))
len(subjects)

In [None]:
0.1*len(m2) #10% for testing

We have 3674 MRI scans from 551 patients (some patients repeated up to 16 times).
We selected our testing set such that it has 367 unique MRIs (10% of training) shwon below. 
We do not allow for any repeating patients in the testing set. We only allowed repetition during training, and no patient was included in both training and testing sets.

In [None]:
#selecting 367 patient IDs
picked_ids = random.sample(subjects, 367) 

In [None]:
#creating the test set out of the patient IDs
test = pd.DataFrame(columns = ["img_array", "subject", "label"]) 
for i in range(len(picked_ids)):
    s = m2[m2["subject"] == picked_ids[i]].sample()
    test = test.append(s)

In [None]:
indexes = list(set(m2.index) - set(test.index))

In [None]:
#creating the training set using all the other data points
train = m2[m2.index.isin(indexes)]

In [None]:
train[["img_array"]].to_pickle("img_train.pkl")
test[["img_array"]].to_pickle("img_test.pkl")

In [None]:
train[["label"]].to_pickle("img_y_train.pkl")
test[["label"]].to_pickle("img_y_test.pkl")