This notebook splits the imaging data into training and testing such that there are no repeating patients in the test set and that the patients in the test set do not appear in training. 

In [None]:
import pandas as pd
import random
#reading in a dataframe that contains image arrays, patient IDs ("subject"), and diagnosis
m2 = pd.read_pickle("mri_meta.pkl")

#cleaning patient IDs
m2["subject"] = m2["subject"].str.replace("s", "S").str.replace("\n", "")

#reading in the overlap test set
#ts = pd.read_csv("overlap_test_set.csv")

#removing ids from the overlap test set
#m2 = m2[~m2["subject"].isin(list(ts["subject"].values))]

0       1
1       2
2       2
3       2
4       2
       ..
2177    2
2178    0
2179    2
2180    2
2181    2
Name: label, Length: 2182, dtype: object


In [2]:
#there are 551 unique patients
subjects = list(set(m2["subject"].values))
len(subjects)

382

In [3]:
0.1*len(m2) #10% for testing

218.20000000000002

We have 3674 MRI scans from 551 patients (some patients repeated up to 16 times).
We selected our testing set such that it has 367 unique MRIs (10% of training) shwon below. 
We do not allow for any repeating patients in the testing set. We only allowed repetition during training, and no patient was included in both training and testing sets.

In [5]:
#selecting 367 patient IDs
picked_ids = random.sample(subjects, 38) 
other_ids = list(set(subjects)-set(picked_ids))

In [6]:
#creating the test set out of the patient IDs
test = pd.DataFrame(columns = ["im1", "im2", "im3", "subject", "label"]) 
s = [m2[m2["subject"] == picked_ids[i]] for i in range(len(picked_ids))]
test = pd.concat(s)

test.to_csv("test.csv")

In [7]:
#creating the train set out of the patient IDs
train = pd.DataFrame(columns = ["im1", "im2", "im3", "subject", "label"]) 
s = [m2[m2["subject"] == other_ids[i]] for i in range(len(other_ids))]
train = pd.concat(s)

In [8]:
train[["im1", "im2", "im3", "subject", "visit"]].to_pickle("img_train.pkl")
test[["im1", "im2", "im3", "subject", "visit"]].to_pickle("img_test.pkl")

In [None]:
train[["label"]].to_pickle("img_y_train.pkl")
test[["label"]].to_pickle("img_y_test.pkl")

797     0
914     0
1206    0
1691    0
2156    0
       ..
869     2
1064    2
1212    2
1601    2
1903    2
Name: label, Length: 216, dtype: object
