### Data Loading 2: TFRecords Object 
We need to express our dataset as a TFRecords object which is what the nobrainer models use. Since we are going to be using transfer learning to train them, our dataset format has to be compatible with what they use. It has the added benifit of being easily expressed as text files, which we need since our TF-GPU kernel is not compatible with our nobrainer kernel. 

### ChatGPT Function 
the nobrainer library has a function to convert the dataset from a csv format to a tfRecords object. Before I dive into implementing that, I wanted to see if chatGPT could come up with a working adaptation of the function I was already using. Below is what it gave me. 

In [6]:
import tensorflow as tf
import os
import nibabel as nib
import numpy as np
from sklearn.model_selection import train_test_split

### Summarizing the Above Code
It's a mess, and frankly I have no idea if its complete / works. I am stubborn, so before moving on I wanted to try instead having it just return an unsplit dataset object, then have it modify that function to return a tfrecords object. Here is what I got: 

In [25]:
def serialize_example(feature0, feature1):
    """
    Creates a tf.train.Example message from two input features.
    """
    feature = {
        'data': tf.train.Feature(float_list=tf.train.FloatList(value=feature0.flatten())),
        'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[feature1])),
    }
    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()

def load_data_old(top_dir):
    # Get a list of all the file paths and labels
    file_paths = []
    labels = []
    for i, folder in enumerate(os.listdir(top_dir)):
        for file in os.listdir(os.path.join(top_dir, folder)):
            if 'max' in str(file):
                file_paths.append(os.path.join(top_dir, folder, file))
                labels.append(i)

    # Load the images into a numpy array
    data = []
    for filename in file_paths:
        img = nib.load(filename)
        data.append(img.get_fdata())
    data = np.array(data)

    # Convert labels to numpy array
    labels = np.array(labels)

    # Reassign labels assigned to 2 to 0
    labels[labels==2] = 0

    # Serialize the numpy arrays into a tfrecords file
    with tf.io.TFRecordWriter('data.tfrecords') as writer:
        for d, l in zip(data, labels):
            serialized_example = serialize_example(d, l)
            writer.write(serialized_example)

    # Create a tf.data.Dataset from the tfrecords file
    dataset = tf.data.TFRecordDataset('data.tfrecords')

    return dataset


### Pretty Good 
That seems like it should work pretty nicely. Now I just need to figure out a reasonable way to test it out to make sure that its working the way I think it should be. 

In [12]:
path_to_data = os.getcwd()+'\datasets'
path_to_data

'c:\\Users\\inspect\\InSpect\\datasets'

In [13]:
dataset = load_data_old(path_to_data)

In [16]:
dataset.element_spec

TensorSpec(shape=(), dtype=tf.string, name=None)

### Could Work but Not Ideal 
The problem with this is that the data is not split the way I want it to be, and I also won't be able to fit a lot of these models in memory without splitting. So it looks I will have to do things the "right way". Let's start the process of learning how to use this nobrainer library to build models these models. Learning how to use nobrainer will be good experience anyways so the discomfort will be worth the price. We will start by looking at this tutorial "train_binary_classification.ipynb" to take a look at exactly how this input data needs to be formatted. Then we will return to the data loading tutorial, except we will follow along with our SPECT scan code. 

In [17]:
import nobrainer

In [18]:
import random

csv_of_filepaths = nobrainer.utils.get_data() 
filepaths = nobrainer.io.read_csv(csv_of_filepaths)

# Add random boolean values (our labels)
filepaths = [(x, random.choice([0, 1])) for x, _ in filepaths]

train_paths = filepaths[:9]
evaluate_paths = filepaths[9:]

In [24]:
type(filepaths)

list

In [20]:
np.shape(filepaths)

(10, 2)

In [19]:
filepaths

[('C:\\Users\\inspect\\AppData\\Local\\Temp\\nobrainer-data\\datasets\\sub-01_t1.mgz',
  0),
 ('C:\\Users\\inspect\\AppData\\Local\\Temp\\nobrainer-data\\datasets\\sub-02_t1.mgz',
  1),
 ('C:\\Users\\inspect\\AppData\\Local\\Temp\\nobrainer-data\\datasets\\sub-03_t1.mgz',
  0),
 ('C:\\Users\\inspect\\AppData\\Local\\Temp\\nobrainer-data\\datasets\\sub-04_t1.mgz',
  0),
 ('C:\\Users\\inspect\\AppData\\Local\\Temp\\nobrainer-data\\datasets\\sub-05_t1.mgz',
  0),
 ('C:\\Users\\inspect\\AppData\\Local\\Temp\\nobrainer-data\\datasets\\sub-06_t1.mgz',
  1),
 ('C:\\Users\\inspect\\AppData\\Local\\Temp\\nobrainer-data\\datasets\\sub-07_t1.mgz',
  1),
 ('C:\\Users\\inspect\\AppData\\Local\\Temp\\nobrainer-data\\datasets\\sub-08_t1.mgz',
  0),
 ('C:\\Users\\inspect\\AppData\\Local\\Temp\\nobrainer-data\\datasets\\sub-09_t1.mgz',
  0),
 ('C:\\Users\\inspect\\AppData\\Local\\Temp\\nobrainer-data\\datasets\\sub-10_t1.mgz',
  1)]

### Nobrainer Input Shape 
So I am dumb because I wasn't rock solid on how csv's work. This is actually a pretty trivial problem to solve since my original data loading function already generates a numpy array and finds all the filepaths. I just need to modify it a bit. 

In [46]:
import csv

def load_data(top_dir):


    
    # Get a list of all the file paths and labels
    file_paths = []
    labels = []
    for i, folder in enumerate(os.listdir(top_dir)):
        for file in os.listdir(os.path.join(top_dir, folder)):
            if 'max' in str(file):
                file_paths.append(os.path.join(top_dir, folder, file))
                labels.append(i)

    # Reassign labels assigned to 2 to 0
    labels = np.array(labels)
    labels[labels==2] = 0

    # Create a list of tuples with file paths and labels
    data = list(zip(file_paths, labels))

    # Save the data to a .csv file
    with open('dataCsv.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['path', 'label'])
        writer.writerows(data)


In [47]:
# get path to data 
path_to_data = os.getcwd()+'\datasets'
path_to_data

'c:\\Users\\inspect\\InSpect\\datasets'

In [48]:
# call the function to make the CSV 
load_data(path_to_data)

### Results 
Inspection of the CSV confirms that the function is working as intended. Now we should be able to follow along with the tutorial for data loading using our own dataset! Even though this format isn't technically allowed by this tutorial (the tutorial allows for data, label or pathtodata, pathtolabel) a later tutorial uses this format so I think it should be fine. Let's press on! 

In [53]:
# getting the correct shape 
data_path = os.getcwd() + '/datasets/ptsd/'
example_filename = os.path.join(data_path, 'sub-1928_ses-rest_spect_MNI_max.nii')
img = nib.load(example_filename)
shape = img.shape
shape

(79, 95, 79)

### Sharding: Breaking up our dataset based on its size
Accoding to this the tensorflow documentation. https://www.tensorflow.org/tutorials/load_data/tfrecord . We need to shard our dataset to take full advantage of this optimization. We will be using 1 host so that means we want about **10 shards**. Ideally these shards would be greater than 100 mb in size. The tfrecord we made with our naive implementation earlier produced a file with a size of 2.3 gb, so each shard should be about 230 mb, so we are good to go! Since the function takes number of exmaples per shard we just divide our dataset size by num_shards to get examples per shard. (this will usually get us 11 shards, but 10 of them will be optimally sized)

In [65]:
num_shards = 10
dataset_size = 1033
examples_per_shard = dataset_size // num_shards 
examples_per_shard

103

In [63]:
# we save our path to csv 
path_to_csv = str(os.getcwd() + '\dataCsv.csv')
path_to_csv

'c:\\Users\\inspect\\InSpect\\dataCsv.csv'

### Generating the Object 
If we have everything set up correctly, we should be able to call the function using the parameters we defined above. Checking that all our stull makes sense. 

In [76]:
filepaths = nobrainer.io.read_csv(path_to_csv)
invalid = nobrainer.io.verify_features_labels(train_paths)
assert not invalid

invalid = nobrainer.io.verify_features_labels(evaluate_paths)
assert not invalid

Verifying 9 examples
Verifying 1 examples


In [111]:
# C:\Users\inspect\InSpect\dataCsv.csv
path_to_csv

'c:\\Users\\inspect\\InSpect\\dataCsv.csv'

In [113]:
!nobrainer convert \
    --csv='c:\Users\inspect\InSpect\dataCsv.csv' \
    --tfrecords-template='data/data_shard-{shard:03d}.tfrec' \
    --examples-per-shard=103 \
    --volume-shape=79 95 79 \
    --verbose


Usage: nobrainer convert [OPTIONS]
Try 'nobrainer convert --help' for help.

Error: Invalid value for '-c' / '--csv': Path "'c:\\Users\\inspect\\InSpect\\dataCsv.csv'" does not exist.
