This notebook splits the imaging data into training and testing such that there are no repeating patients in the test set and that the patients in the test set do not appear in training. 

In [96]:
import pandas as pd
import random
#reading in a dataframe that contains image arrays, patient IDs ("subject"), and diagnosis
m2 = pd.read_pickle("mri_meta.pkl")

#cleaning patient IDs
m2["subject"] = m2["subject"].str.replace("s", "S").str.replace("\n", "")

#reading in the overlap test set
ts = pd.read_csv("/Users/ishaponugoti/Desktop/DL AD/ADDetection/preprocess_overlap/overlap_test_set.csv")

#removing ids from the overlap test set
m2 = m2[~m2["subject"].isin(list(ts["subject"].values))]

In [97]:
#there are 551 unique patients
subjects = list(set(m2["subject"].values))
len(subjects)

193

In [98]:
m2 = m2[~m2['label'].isin([3, 4, 5])]
print(m2['label'].value_counts())


label
0    161
2     84
1     26
Name: count, dtype: int64


In [99]:
len(m2)

271

In [100]:
0.1*len(m2) #10% for testing

27.1

We have 3674 MRI scans from 551 patients (some patients repeated up to 16 times).
We selected our testing set such that it has 367 unique MRIs (10% of training) shwon below. 
We do not allow for any repeating patients in the testing set. We only allowed repetition during training, and no patient was included in both training and testing sets.

In [101]:
subjects = list(set(m2["subject"].values))
len(subjects)

179

In [102]:
#selecting 367 patient IDs
# picked_ids = random.sample(subjects, 367) 
picked_ids = random.sample(subjects, 100) 

In [103]:
#creating the test set out of the patient IDs
# test = pd.DataFrame(columns = ["img_array", "subject", "label"]) 
# for i in range(len(picked_ids)):
#     s = m2[m2["subject"] == picked_ids[i]].sample()
#     test = test.append(s)

# Initialize an empty list to store the DataFrame slices
dataframes = []

# Loop through each id in picked_ids and sample the corresponding rows from m2
for picked_id in picked_ids:
    s = m2[m2["subject"] == picked_id].sample(n=1)
    dataframes.append(s)

# Concatenate all the DataFrame slices into a single DataFrame
test = pd.concat(dataframes, ignore_index=True)

# Now 'test' contains your sampled data


In [104]:
indexes = list(set(m2.index) - set(test.index))

In [105]:
#creating the training set using all the other data points
train = m2[m2.index.isin(indexes)]

In [106]:
import numpy as np
# train = train[~train['label'].isin([3, 4, 5])]
print("Unique labels in training data:", np.unique(train[["label"]]))
print("Unique labels in testing data:", np.unique(test[["label"]]))

Unique labels in training data: [0 1 2]
Unique labels in testing data: [0 1 2]


In [107]:
len(train[["label"]])

185

In [108]:
len(test["label"])

100

In [109]:
train[["img_array"]].to_pickle("img_train.pkl")
test[["img_array"]].to_pickle("img_test.pkl")

In [110]:
train[["label"]].to_pickle("img_y_train.pkl")
test[["label"]].to_pickle("img_y_test.pkl")