<div style="text-align: center; font-size: 30px; font-weight: bold; padding:20px">University of London: BSc Computer Science (Final Project)</div>
<div style="text-align: center; font-size: 30px; margin-top: 10px;">Using Neural Network based Graph Model for Alzhiemer's Classification</div>
<br>
<div style="text-align: center; font-size: 20px; margin-top: 10px;">Pragya Modi</div>
<div style="text-align: center; font-size: 20px; margin-top: 15px;">190308090</div>

<div style="font-size: 70px; font-weight: bold; border-bottom: 6px solid black; padding-bottom:20px ">1.Building The Dataset</div>

<div style="font-size: 30px; padding:10px">
<h1>1.1 Introduction</h1>


<h2>1.1.1 Project Overview </h2>

<p>
This project aims to build a neural network based graph model for Alzheimer's classification. While there are various studies already aiming to solve the problem, very few take into account the real world scenario. Most studies focus on building an image classification model using MRI scans. However, in the real world, patient information like family history, disease history, and tests are taken into account alongside MRI scans. 

Therefore, this project aims to build a graph model taking into account both patient information (features) and MRI scans (images).
</p>

<h2>1.1.2 Data Overview </h2>

<p>
The dataset used for this project is the OASIS brain scan dataset <code> [1] </code>. The dataset allowed downloading images data in NIFTI format and all features data was available across multiple CSV files. </p>

<p>
<b> NIFTI Format: </b>  NIFTI is an open-source file format commonly used for storing brain imaging data obtained from an MRI scan.
</p>

</div>




<div style="font-size: 30px; padding:10px">
<h1>1.2 Cleaning Features Data</h1>

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

<div style="font-size: 30px; padding:10px">

<h2> 1.2.1 Reading Raw Data </h2>

</div>

In [2]:
# load the demographics data
demographics_df = pd.read_csv('../data/features_data/demographics.csv')
# load the CDR data
cdr_df = pd.read_csv('../data/features_data/cdr.csv')
# load the json data
json_df = pd.read_csv('../data/features_data/json.csv')

In [3]:
cdr_df.head()

Unnamed: 0,OASISID,OASIS_session_label,days_to_visit,age at visit,MMSE,memory,orient,judgment,commun,homehobb,...,dx1_code,dx2_code,dx3_code,dx4_code,dx5_code,dx1,dx2,dx3,dx4,dx5
0,OAS30001,OAS30001_UDSb4_d0000,0,65.19,28.0,0.0,0.0,0.0,0.0,0.0,...,1.0,,,,,Cognitively normal,.,.,.,.
1,OAS30001,OAS30001_UDSb4_d0339,339,66.12,28.0,0.0,0.0,0.0,0.0,0.0,...,1.0,,,,,Cognitively normal,.,.,.,.
2,OAS30001,OAS30001_UDSb4_d0722,722,67.17,30.0,0.0,0.0,0.0,0.0,0.0,...,1.0,,,,,Cognitively normal,.,.,.,.
3,OAS30001,OAS30001_UDSb4_d1106,1106,68.22,30.0,0.0,0.0,0.0,0.0,0.0,...,1.0,,,,,Cognitively normal,.,.,.,.
4,OAS30001,OAS30001_UDSb4_d1456,1456,69.18,30.0,0.0,0.0,0.0,0.0,0.0,...,1.0,,,,,Cognitively normal,.,.,.,.


<div style="font-size: 30px; padding:10px">

<h2>1.2.2 Cleaning Individual Files </h2>

<h3> JSON information </h3>

<p>
There are different weights of MRI scans. T2w scans are used for this project. Therefore, the <code>json_df</code> file is filtered to get the rows where <code>scan category</code> is <code>T2w</code>.

Next, relevant columns are extracted from the data frame and the column names are changed as requires. 
<p>

</div>

In [4]:
# explore the demographics data
json_df.head()

Unnamed: 0,subject_id,label,acccession,release version,scan category,filename,Modality,MagneticFieldStrength,Manufacturer,ManufacturersModelName,...,PhaseOversampling,EchoNumber,Interpolation2D,RawImage,ConsistencyInfo,MultibandAccelerationFactor,PhaseEncodingAxis,AcquisitionDateTime,PhaseEncodingLines,AccelFactPE
0,OAS30044,OAS30044_MR_d0061,CENTRAL_E09340,2018 Release,fieldmap,sub-OAS30044_ses-d0061_echo-1_run-01_fieldmap....,MR,3.0,Siemens,Biograph_mMR,...,,,,,,,,,,
1,OAS31435,OAS31435_MR_d0071,CENTRAL02_E02488,2022 Release,asl,sub-OAS31435_sess-d0071_asl.json,MR,3.0,Siemens,Biograph_mMR,...,,,,,,,,,,
2,OAS31172,OAS31172_MR_d0407,CENTRAL_E11127,2018 Release,T2star,sub-OAS31172_ses-d0407_T2star.json,MR,3.0,Siemens,TrioTim,...,,,,,,,,,,
3,OAS31430,OAS31430_MR_d0060,CENTRAL02_E02831,2022 Release,T2star,sub-OAS31430_sess-d0060_T2star.json,MR,3.0,Siemens,MAGNETOM_Vida,...,,,,,,,,,,
4,OAS30096,OAS30096_MR_d2948,CENTRAL02_E02487,2022 Release,T1w,sub-OAS30096_sess-d2948_T1w.json,MR,3.0,Siemens,Biograph_mMR,...,,,,,,,,,,


In [5]:
# filter for the rows with scan category of 'T2w'
json_df = json_df[json_df['scan category'] == 'T2w']
# get the columns relevant to the features for this project
json_df = json_df[['subject_id', 'label']]
# rename columns to be more descriptive
json_df.rename(columns={'subject_id': 'id', 'label': 'session'}, inplace=True)

# get an overview of the updated json dataframe
json_df.head()

Unnamed: 0,id,session
6,OAS30574,OAS30574_MR_d1917
8,OAS30505,OAS30505_MR_d5746
13,OAS30919,OAS30919_MR_d4506
17,OAS31398,OAS31398_MR_d0098
20,OAS30576,OAS30576_MR_d0084


<div style="font-size: 30px; padding:10px">

<h3> Demographic Information </h3>

<p>
The demographics' information from OASIS brain dataset <code>[1]</code> provides detailed information. However, for this project, not all these values are used. Hence, the details not relevant to the project are dropped. Lastly, the column names are changed as required. 
<p>

</div>

In [6]:
# explore the demographics data
demographics_df.head()

Unnamed: 0,OASISID,Subject_accession,AgeatEntry,AgeatDeath,GENDER,EDUC,SES,racecode,race,ETHNIC,AIAN,NHPI,ASIAN,AA,WHITE,daddem,momdem,HAND,APOE
0,OAS30001,,65.1945,,2,12.0,4.0,5,White,0.0,0.0,0.0,0.0,0.0,1.0,5.0,1.0,R,23.0
1,OAS30002,,67.2521,76.9397,1,18.0,2.0,5,White,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,R,34.0
2,OAS30003,,58.8137,,2,18.0,1.0,5,White,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,R,33.0
3,OAS30004,,55.1342,,2,17.0,1.0,5,White,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,L,23.0
4,OAS30005,,48.063,,2,16.0,3.0,2,ASIAN,,0.0,0.0,1.0,0.0,0.0,0.0,0.0,R,33.0


In [7]:
# drop the columns that are not relevant to this project
demographics_df.drop(['Subject_accession', 'AgeatDeath', 'ETHNIC', 'AIAN', 'NHPI', 'ASIAN', 'AA', 'WHITE', 'daddem', 'momdem', 'APOE', 'race'], axis=1, inplace=True)

# rename columns to be more descriptive
demographics_df.rename(columns={'OASISID': 'id', 'AgeatEntry': 'age', 'GENDER': 'gender', 'racecode': 'race_code', 'HAND': 'hand', 'EDUC': 'education', 'SES': 'socio_economic_statis'}, inplace=True)

# get an overview of the updated demographics dataframe
demographics_df.head()

Unnamed: 0,id,age,gender,education,socio_economic_statis,race_code,hand
0,OAS30001,65.1945,2,12.0,4.0,5,R
1,OAS30002,67.2521,1,18.0,2.0,5,R
2,OAS30003,58.8137,2,18.0,1.0,5,R
3,OAS30004,55.1342,2,17.0,1.0,5,L
4,OAS30005,48.063,2,16.0,3.0,2,R


<div style="font-size: 30px; padding:10px">

<h3> Clinical Dementia Rating </h3>

<p>
Similar to the other two data frames, this dataset is cleaned by taking only the relevant columns from the raw data and then changing the column name as required.

Since, this is a longitudinal study, patients came in for multiple sessions for testing. In some cases, the CDR value for the same patient differed across sessions. However, to keep this research simple, this difference was discarded and the maximum value of CDR, per subject, was taken into account when building the features' dataset.
<p>

</div>


In [8]:
# get the relevant columns from the dataframe
cdr_df = cdr_df[['OASISID', 'MMSE', 'CDRTOT']]

# rename columns to be more descriptive
cdr_df.rename(columns={'OASISID': 'id', 'MMSE': 'mmse', 'CDRTOT': 'cdr'}, inplace=True)

# keep the row with the max cdr value for each subject
cdr_df = cdr_df.groupby('id', as_index=False).max()

# cdr_df['session'] = cdr_df['session'].apply(lambda x: x.split('_')[2])

# get an overview of the updated CDR dataframe
cdr_df.head()

Unnamed: 0,id,mmse,cdr
0,OAS30001,30.0,0.0
1,OAS30002,30.0,0.0
2,OAS30003,30.0,0.0
3,OAS30004,30.0,0.0
4,OAS30005,30.0,0.0


<div style="font-size: 30px; padding:10px">

<h2>1.2.3 Merging the data</h2>

<h3> Demographics and JSON Data </h3>

<p>
In this section, <code>demographics_df</code> and <code>json_df</code> are merged based on the <code>id </code>. There is also <code>session</code> in <code>json_df</code>. However, a single participant can have multiple session. The participant is identified from the <code>id</code>. Therefore, these data frames are merged on <code>id</code> so all sessions pertaining to the same patient have the same demographic information. 

Lastly, the <code>session</code> column is edited to remove the <code>id</code> and the scan type (MR), which is added as the prefix and separated by <code>_</code>.
<p>

<h3> Merging with the CDR Values </h3>

<p>Next, this new dataframe is merged with the CDR values for the same reason.</p>

</div>

In [9]:
# merge JSON data and Demographics data on 'id'
features_df = pd.merge(json_df, demographics_df, on='id')

# merge the new dataframe with the CDR dataframe on 'id'
features_df = pd.merge(features_df, cdr_df, on=['id'])

# update the session column to remove extra information
features_df['session'] = features_df['session'].apply(lambda x: x.split('_')[2])

# get an overview of the merged dataset
features_df.head()

Unnamed: 0,id,session,age,gender,education,socio_economic_statis,race_code,hand,mmse,cdr
0,OAS30574,d1917,72.1644,2,18.0,1.0,5,R,30.0,0.0
1,OAS30574,d0918,72.1644,2,18.0,1.0,5,R,30.0,0.0
2,OAS30574,d0918,72.1644,2,18.0,1.0,5,R,30.0,0.0
3,OAS30505,d5746,67.8986,1,16.0,1.0,5,R,30.0,2.0
4,OAS30505,d4682,67.8986,1,16.0,1.0,5,R,30.0,2.0


<div style="font-size: 30px; padding:10px">

<h2>1.2.4 Cleaning the Combined Dataset</h2>

<p>
In this section, all numeral values are converted to float and kept in the range of <code>0 to 1</code>. 

Categorical data like <code>hand</code> is also converted to numerical data. Lastly, the data is cleaned to make sure there are no empty columns in any row. 
<p>


</div>

In [10]:
# converting the categorical data to numeric values
features_df['hand'] = features_df['hand'].apply(lambda x: 0. if x == 'L' else 1.)

# converting numeric data to float and range from 0 to 1
features_df['race_code'] = features_df['race_code'].apply(lambda x: float(x) / 6)
features_df['gender'] = features_df['gender'].apply(lambda x: float(x))
max_age = np.max(features_df['age'])
features_df['age'] = features_df['age'].apply(lambda x: float(x) / max_age)
features_df['mmse'] = features_df['mmse'].apply(lambda x: float(x))
features_df['education'] = features_df['education'].apply(lambda x: float(x))
features_df['socio_economic_statis'] = features_df['socio_economic_statis'].apply(lambda x: float(x))

# remove the rows with missing values
features_df.dropna(inplace=True)

# reorder the columns
features_df = features_df[['id', 'session', 'cdr', 'age', 'gender', 'education', 'socio_economic_statis', 'race_code', 'hand', 'mmse']]

# get an overview of the updated features dataframe
features_df.head()

Unnamed: 0,id,session,cdr,age,gender,education,socio_economic_statis,race_code,hand,mmse
0,OAS30574,d1917,0.0,0.754598,2.0,18.0,1.0,0.833333,1.0,30.0
1,OAS30574,d0918,0.0,0.754598,2.0,18.0,1.0,0.833333,1.0,30.0
2,OAS30574,d0918,0.0,0.754598,2.0,18.0,1.0,0.833333,1.0,30.0
3,OAS30505,d5746,2.0,0.709992,1.0,16.0,1.0,0.833333,1.0,30.0
4,OAS30505,d4682,2.0,0.709992,1.0,16.0,1.0,0.833333,1.0,30.0


In [11]:
# check the shape of the dataframe
features_df.shape

(4014, 10)

<div style="font-size: 30px; padding:10px">

<h2>1.2.5 Checking the split between classes </h2>

<p>
The level of Alzheimer for each patient is defined by the CDR value. A value of <code>0</code> identifies no dementia, whereas any value greater than <code>0</code> signifies different level of Alzheimer's. 
</p>

<p>
The split between these classes is not equal. Therefore, instead of dividing the data into multiple classes with different levels of Alzheimer's, the data is divided into two classes: Alzheimer's and no Alzheimer's.
<p>


</div>

In [12]:
# checking the split between classes
features_df.groupby('cdr').count()

Unnamed: 0_level_0,id,session,age,gender,education,socio_economic_statis,race_code,hand,mmse
cdr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0.0,2319,2319,2319,2319,2319,2319,2319,2319,2319
0.5,896,896,896,896,896,896,896,896,896
1.0,401,401,401,401,401,401,401,401,401
2.0,348,348,348,348,348,348,348,348,348
3.0,50,50,50,50,50,50,50,50,50


<div style="font-size: 30px; padding:10px">

<h2>1.2.6 Creating a more balanced dataset </h2>

<p>
As seen above, the data between the classes is not equally split. While deep learning can easily handle the unequal split, for the purposes of this research, the classes are merged to make the dataset more balanced. 0 is the class with no Alzheimer and anything greater than 0 is set to 1 to denote presence of Alzheimer's.
</p>

</div>

In [13]:
# combining the classes to get a more balanced dataset

# imports
import math

# function to combine the classes
features_df['cdr'] = features_df['cdr'].apply(lambda x: 0 if x == 0 else 1)

# get an overview of the updated dataset
features_df.groupby('cdr').count()

Unnamed: 0_level_0,id,session,age,gender,education,socio_economic_statis,race_code,hand,mmse
cdr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,2319,2319,2319,2319,2319,2319,2319,2319,2319
1,1695,1695,1695,1695,1695,1695,1695,1695,1695


In [14]:
features_df.shape

(4014, 10)

<div style="font-size: 30px; padding:10px">

<h2>1.2.7 Writing the dataframe to a CSV file </h2>
</div>


In [15]:
# save as csv
features_df.to_csv('../data/clean_data/features_data.csv', index=False)

<div style="font-size: 30px; padding:10px">
<h1>1.3 Cleaning Image Data</h1>

<h2> 1.3.1 Understanding the Image Data </h2>

<p> The image data retrieved is in <code>NIFTI</code> format. Images are extracted from this format using the <code>conversion-script.sh</code> in <code>scripts</code> folder. When extracted, it returns a folder of images with multiple slices of the <code>MRI</code> Scan. </p>

<p> To get the slice providing the most information, a deep learning model is built to sort and pick the slice with the most information. </p>

<p> A small dataset is created manually using the extracted images and a model is built. </p>

<p> Next, selected images from a single folder are passed to the model and inference is run. The image with the highest confidence interval is stored in a separate folder to be used later when building the final dataset. </p>


<h2> 1.3.2 Using a Deep Learning Model to Filter Images </h2>

<p>The first step for cleaning the image data is to build a deep learning model that helps sort images. </p>

<p>The model built is a simple Convolution Neural Network model for image classification. The data used in this model was sorted manually with around 100 images in every class. </p>
</div>

In [16]:
# imports
import tensorflow as tf
from tensorflow.keras import layers, utils
from matplotlib import pyplot as plt
import os, shutil, pathlib
import keras

<div style="font-size: 30px; padding:10px">
<h3>Importing the Data and Splitting into Train and Validation Set</h3>
</div>

In [37]:
train = utils.image_dataset_from_directory(
  "../data/image_with_classess",
  validation_split=0.25,
  subset="training",
  seed=1337,
  image_size=(128, 128),
  batch_size=8,
  label_mode="binary"
)

val = utils.image_dataset_from_directory(
  "../data/image_with_classess",
  validation_split=0.25,
  subset="validation",
  seed=1337,
  image_size=(128, 128),
  batch_size=8,
  label_mode="binary"

)

Found 203 files belonging to 2 classes.
Using 153 files for training.
Found 203 files belonging to 2 classes.
Using 50 files for validation.


<div style="font-size: 30px; padding:10px">
<h3>Building the model</h3>

<p> The highest validation accuracy on this model is <code>96%</code>. </p>
</div>

In [40]:
cleanup_model = tf.keras.Sequential([
  layers.Input(shape=(128, 128, 3)),
  layers.Rescaling(1./255),
  layers.Conv2D(8, 3, activation="relu"),
  layers.MaxPooling2D(2),
  layers.Conv2D(16, 3, activation="relu"),
  layers.MaxPooling2D(2),
  layers.Conv2D(32, 3, activation="relu"),
  layers.MaxPooling2D(2),
  layers.Conv2D(64, 3, activation="relu"),
  layers.MaxPooling2D(2),
  layers.Conv2D(128, 3, activation="relu"),
  layers.GlobalAveragePooling2D(),
  layers.Dropout(0.7),
  layers.Dense(1, activation='sigmoid')
])

cleanup_model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

cleanup_model.summary()

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 rescaling_8 (Rescaling)     (None, 128, 128, 3)       0         
                                                                 
 conv2d_40 (Conv2D)          (None, 126, 126, 8)       224       
                                                                 
 max_pooling2d_33 (MaxPoolin  (None, 63, 63, 8)        0         
 g2D)                                                            
                                                                 
 conv2d_41 (Conv2D)          (None, 61, 61, 16)        1168      
                                                                 
 max_pooling2d_34 (MaxPoolin  (None, 30, 30, 16)       0         
 g2D)                                                            
                                                                 
 conv2d_42 (Conv2D)          (None, 28, 28, 32)       

<div style="font-size: 30px; padding:10px">
<h3>Training The Model</h3>
</div>

In [41]:
history = cleanup_model.fit(
  train.cache(),
  epochs=15,
  validation_data=val.cache()
)

Epoch 1/15


2022-08-30 02:49:26.167658: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.




2022-08-30 02:49:27.349930: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<div style="font-size: 30px; padding:10px">
<h3>Running the Model to Filter and Save Images</h3>

<p> The next step is to run the model on every folder and store the image with the highest confidence interval to a separate folder.</p>

<p>The middle 10 images were taken from each folder and inference was run on these. After this, the image with the highest probability was chosen and moved to a folder with the naming form <code>id_session</code> </p>
</div>

In [42]:
# imports
import time

# create variables for main folders
input_dir = '../data/raw_files'
output_dir = '../data/clean_data/selected_images/'

shutil.rmtree(output_dir, ignore_errors=True)
os.makedirs(output_dir, exist_ok=True)

# get a list of all the folders in the input directory
list_of_dirs = [x for x in os.listdir(input_dir) if x not in ['.DS_Store']]
# length of the list of directories
dirs_len = len(list_of_dirs)

# initialise a variable to store the number of folders processed
processed = 0

# iterate through the list of folders
for img_dir in list_of_dirs:
  # increase the processed counter
  processed += 1

  # get the path to the folder
  f = f"{input_dir}/{img_dir}"

  # get list of files in the folder
  list_of_images = [x for x in os.listdir(f) if x not in ['.DS_Store']]
  # iniatialise an array to store the tensors for selected images
  image_tensors = []
  # get the middle index of the list of images
  middle = int(len(list_of_images) / 2)

  # if lenfth of the list of images is less than 10, skip the folder
  if len(list_of_images) < 10:
    print(f"{img_dir} has {len(list_of_images)} images")
    continue

  # get the middle 10 images from the list of images
  middle_images = np.sort(list_of_images)[middle - 5:middle + 5]

  # iterate through the middle images
  for img in middle_images:
    # get the path to the image
    folder_path = f'{f}/{img}'
    # create a tensorflow image object from the image
    img = tf.keras.preprocessing.image.load_img(folder_path, target_size=(128, 128))
    img = tf.keras.preprocessing.image.img_to_array(img)
    # add the image to the array
    image_tensors.append(img)

  # run the predictions on the images
  preds = cleanup_model.predict(np.array(image_tensors))
  # get the index of the highest prediction
  preds = [p[0] for p in preds]
  index = np.argmax(preds)
  # zip the middle images with the predictions
  preds = list(zip(middle_images, preds))

  # print the processed percentage
  print(f"Processed: {int(processed / (dirs_len - 1) * 100)}%", end="\r")

  # get the image and pred corresponding to the highest prediction
  (img, pred) = preds[index] 

  # get the subject name and the session number
  sub_name = img_dir.split("_")[0].split("-")[1]
  ses_name = img_dir.split("_")[1].split("-")[1]  

  # get the path to the source image
  source = pathlib.Path(f"{f}/{img}").expanduser()
  # get the path to the destination image
  destination = pathlib.Path(f'{output_dir}{sub_name}_{ses_name}.jpg').expanduser()
  # copy the image to the destination
  shutil.copy(source, destination)

Processed: 0%

2022-08-30 02:49:32.252941: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.


Processed: 100%

<div style="font-size: 30px; padding:10px">
<h1>1.4 Building the Final Dataset</h1>

<h2>1.4.1 Matching features data to images</h2>

<p>The first step to build the dataset is to match the images with the features. A simple function is written to iterate through the features CSV, get the <code>id</code> and the <code>session</code> value from there, match them and store them in an array alongside the label. </p>
</div>

In [46]:
def match_images_and_features(features_dir, images_path):

  # initialise arrays to store the features, images and labels
  global_images = []
  global_features = []
  global_labels = []

  # open the features file
  with open(features_dir) as features_data:
    # store all the lines from the file in a list
    feature_rows = features_data.read().splitlines()

    
    # iterate through the list of rows
    for i in range(1, len(feature_rows)):
      # seperate the row based on the delimiter to get the columns
      feature_rows[i] = feature_rows[i].split(',')
      # get the participant id and session number
      participant_id = feature_rows[i][0]
      session = feature_rows[i][1]
      # get the label
      label = float(feature_rows[i][2])
      # get the features
      feature_rows[i] = feature_rows[i][3:]
      # convert the features values to floats
      features = [float(x) for x in feature_rows[i]]
      # convert the features to a numpy array
      features = np.array(features)

      # get the path for the corresponding image in the images folder
      image_path = f'{images_path}/{participant_id}_{session}.jpg'
      # create a path object to the image
      image_path = pathlib.Path(image_path)
      image_path = image_path.expanduser()

      # if the image exists, add the features and label to the global arrays
      if image_path.exists():
        # get the image tensor obejct
        img = tf.keras.preprocessing.image.load_img(image_path,
          target_size=(128, 128), color_mode='grayscale')
        img = tf.keras.preprocessing.image.img_to_array(img)
        # add the image to the global array
        global_images.append(img)
        # add the features to the global array
        global_features.append(features)
        # add the label to the global array
        global_labels.append(label)
  
  print('DATA READY')

  # return the global arrays
  return global_images, global_features, global_labels

<div style="font-size: 30px; padding:10px">

<h2>1.4.2 Function to Create TensorFlow dataset from Input and Labels</h2>

<p>Next, a function is created to use the inputs and labels from the function above to create a TF dataset. </p>
</div>


In [47]:
def save_dataset_using_input(inputs, buffer, labels, output_dir):

  # check if the output path exists
  if not os.path.exists(output_dir):
    # delete the output path if it exists
    shutil.rmtree(output_dir, ignore_errors=True)

  # create the output path
  os.makedirs(output_dir, exist_ok=True)

  # create input data with the inputs and labels
  data_input = (inputs, labels)

  # create a tf.data.Dataset object from the input data
  input_slices = tf.data.Dataset.from_tensor_slices(inputs)
  label_slices = tf.data.Dataset.from_tensor_slices(labels)
  data_s = tf.data.Dataset.zip((input_slices, label_slices))

  # shuffle the data
  dataset = data_s.shuffle(buffer_size=buffer)

  # create a tf.data.Dataset object from the input data and store to the output path
  tf.data.experimental.save(dataset, output_dir, compression='GZIP')

<div style="font-size: 30px; padding:10px">

<h2>1.4.3 Create and Store the Final Dataset</h2>

<p>In the last section, the functions above are used to get the inputs and store the TF dataset to file for further use.  </p>
</div>

In [49]:
# get the inputs
global_images, global_features, global_labels = match_images_and_features("../data/clean_data/features_data.csv", "../data/clean_data/selected_images")

# save the dataset to the output path
save_dataset_using_input({"images": global_images, "features": global_features}, len(global_labels), global_labels, f'../data/tfdataset/final_dataset.tfrecords.gz')

# print the length to compare with the features CSV length
print(len(global_features))

DATA READY
4014


<div style="font-size: 30px; padding:10px">
<h1>1.5 References</h1>

<p>[1] Pamela J. LaMontagne, Tammie LS. Benzinger, John C. Morris, Sarah Keefe, Russ Hornbeck, Chengjie Xiong, Elizabeth Grant, Jason Hassenstab, Krista Moulder, Andrei G. Vlassenko, Marcus E. Raichle, Carlos Cruchaga, and Daniel Marcus. 2019. OASIS-3: Longitudinal Neuroimaging, Clinical, and Cognitive Dataset for Normal Aging and Alzheimer Disease. Radiology and Imaging. DOI:https://doi.org/10.1101/2019.12.13.19014902 </p>

<br>

<p> [2] Daniel S. Marcus, Tracy H. Wang, Jamie Parker, John G. Csernansky, John C. Morris, and Randy L. Buckner. 2007. Open Access Series of Imaging Studies (OASIS): Cross-sectional MRI Data in Young, Middle Aged, Nondemented, and Demented Older Adults. Journal of Cognitive Neuroscience 19, 9 (September 2007), 1498–1507. DOI:https://doi.org/10.1162/jocn.2007.19.9.1498 </p>

</div>

----
<div style="text-align: center; font-size: 30px; font-weight: bold; padding:20px">End of File</div>

----