# 0 - Data Preparation

<p> 
Prior to creating this notebook and developing the code, I went through our current dataset and I compiled all of the folders of images (named according to the person) into one folder. Inside of that folder, I renamed all of the folders with images to measured glucose value of the corresponding person. This process resulted in a folder containing several other folders with glucose values as names of the folders and the folders containing images with those same glucose values. 
</p>
<p>
Also removed many "bad" images from the datasets; these images were ones that were captured incorrectly. Furthermore, many of the images in the second image capture were renamed to random numbers to allow for the file-folders to be merged into one single folder with subdirectories described above.
</p>

# 1 - Importing Prerequisites

In [101]:
#Importing Python Libraries
import os
import glob
import numpy as np
import pandas as pd
from pathlib import Path
import tensorflow as tf
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# 2 - Creating Dataset

In [102]:
#Initializing Print Settings for Dataframes
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

In [103]:
#Getting the Directory of this Notebook for Later Use
directory = os.getcwd() + '\data_second'
print(directory)

X:\Machine Learning\Glucose Estimation\data_second


In [104]:
#Creating Series for Image-Filepaths and Glucose Values

#Creating list with all image filepaths and one for glucose values.
files = glob.glob(directory + '\**\*')
values = [None] * len(files)

#Correcting all filepaths and adding their respective values to the other list. 
x = 0
while x < len(files):
    files[x] = files[x].replace('\\','/')
    str = files[x][51:]
    values[x] = int(str[0:str.index('/')])
    x = x + 1

#Converting lists into Panda Series for creating a Dataframe
files = pd.Series(files, name='Filepath')
values = pd.Series(values, name='Glucose')

In [105]:
#Combining the Series into a Dataframe
images = pd.concat([files, values], axis=1)
images

Unnamed: 0,Filepath,Glucose
0,X:/Machine Learning/Glucose Estimation/data_second/100/image0 (2).jpg,100
1,X:/Machine Learning/Glucose Estimation/data_second/100/image0 (3).jpg,100
2,X:/Machine Learning/Glucose Estimation/data_second/100/image0.jpg,100
3,X:/Machine Learning/Glucose Estimation/data_second/100/image1 (2).jpg,100
4,X:/Machine Learning/Glucose Estimation/data_second/100/image1 (3).jpg,100
...,...,...
1151,X:/Machine Learning/Glucose Estimation/data_second/99/image5.jpg,99
1152,X:/Machine Learning/Glucose Estimation/data_second/99/image6.jpg,99
1153,X:/Machine Learning/Glucose Estimation/data_second/99/image7.jpg,99
1154,X:/Machine Learning/Glucose Estimation/data_second/99/image8.jpg,99


# 3 - Data Processing

In [106]:
#Shuffling the Dataset

#Settings Random State for Replication and Resetting Indices for Ordering 
ds = images.sample(1156, random_state=7).reset_index(drop=True)
ds

Unnamed: 0,Filepath,Glucose
0,X:/Machine Learning/Glucose Estimation/data_second/84/image5.jpg,84
1,X:/Machine Learning/Glucose Estimation/data_second/101/524356.jpg,101
2,X:/Machine Learning/Glucose Estimation/data_second/95/image2.jpg,95
3,X:/Machine Learning/Glucose Estimation/data_second/85/image13 (2).jpg,85
4,X:/Machine Learning/Glucose Estimation/data_second/84/image13 (2).jpg,84
...,...,...
1151,X:/Machine Learning/Glucose Estimation/data_second/91/image13 (2).jpg,91
1152,X:/Machine Learning/Glucose Estimation/data_second/110/image12.jpg,110
1153,X:/Machine Learning/Glucose Estimation/data_second/140/image6.jpg,140
1154,X:/Machine Learning/Glucose Estimation/data_second/147/342.jpg,147


In [107]:
#Splitting the Dataset

#Chose higher test sample because the dataset size is small and reset indices again.
train, test = train_test_split(ds, train_size=0.75, random_state = 7)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
train

Unnamed: 0,Filepath,Glucose
0,X:/Machine Learning/Glucose Estimation/data_second/112/image9.jpg,112
1,X:/Machine Learning/Glucose Estimation/data_second/123/image10.jpg,123
2,X:/Machine Learning/Glucose Estimation/data_second/95/image7 (2).jpg,95
3,X:/Machine Learning/Glucose Estimation/data_second/105/image11.jpg,105
4,X:/Machine Learning/Glucose Estimation/data_second/83/2.jpg,83
...,...,...
862,X:/Machine Learning/Glucose Estimation/data_second/98/image6.jpg,98
863,X:/Machine Learning/Glucose Estimation/data_second/79/image1.jpg,79
864,X:/Machine Learning/Glucose Estimation/data_second/113/image9.jpg,113
865,X:/Machine Learning/Glucose Estimation/data_second/109/image7.jpg,109


In [108]:
#Creating Image Processors for Normalizing Image Data

#Scaling the pixel RGB values of each image down by 255 to make the RGB values 0-1.
#This standardizes the data like how it would be done with numeric data.
#This process makes the model train much more efficiently.

#A validation set is created for testing model during training.
train_generator = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1./255,
    validation_split=0.1
)

#A validation set is not needed for testing.
test_generator = tf.keras.preprocessing.image.ImageDataGenerator(
    rescale=1./255
)

In [109]:
#Uses the previous image generators to convert the images into tensors.
#The tensors are numeric matrices containing the respective RGB values for each pixel.
#The tensors have 3 dimensions: height, width, and RGB colors.
#In our case those would be: 480, 640, and 3.


#First the dataframe and it's columns are selected for creating the training data.
#Setting target size to 160 x 120 rescales the images to a smaller size for speed/efficiency.
#Setting class_mode to raw makes the generator disregard classes to make sure that the model is regression, not classification.
#The batch size determines how many images are processed in a single iteration.
#Using 32 as the batchsize helps the generator use less computing power.
#We also shuffle the data again to make sure that the model gets a random sample of the data.
#We set the random seed to make the generation replicable.

#We first create the training subset for our model (the data used to train).
train_data = train_generator.flow_from_dataframe(
    dataframe=train,
    x_col='Filepath',
    y_col='Glucose',
    target_size=(120, 160),
    color_mode='rgb',
    class_mode='raw',
    batch_size=32,
    shuffle=True,
    seed=7,
    subset='training'
)

#Then we create the validation subset for our model (the data used to test performance during training).
val_data = train_generator.flow_from_dataframe(
    dataframe=train,
    x_col='Filepath',
    y_col='Glucose',
    target_size=(120, 160),
    color_mode='rgb',
    class_mode='raw',
    batch_size=32,
    shuffle=True,
    seed=7,
    subset='validation'
)

#Finally we create the testing subset for our model (the data used to test performance after training).
test_data = test_generator.flow_from_dataframe(
    dataframe=test,
    x_col='Filepath',
    y_col='Glucose',
    target_size=(120, 160),
    color_mode='rgb',
    class_mode='raw',
    batch_size=32,
    shuffle=False
)

Found 781 validated image filenames.
Found 86 validated image filenames.
Found 289 validated image filenames.


# 4 - Model Creation

In [110]:
#Creating the model for training.


#The input layer fits the following layers to the dimensions of the tensors created by the generators.

#Convolutional layers Slides a 3x3 window across the image to extract features in the form of shapes, corners, edges, etc.
#The window is 3x3 because our image is small the window should be proportionate to the image size to detect small patterns.
#It does this by taking the dot product of that sliding window and setting it to the middle pixel to create feature images.
#The sliding window can overlap with previous slides but it cannot go outside of the image.
#Different filters use different values (weights) in the windows to find different features: edges, shapes, and other patterns.
#The number of filters starts low to detect bigger and more general features but increase to detect smaller features.
#Because the window is 3x3 and it must not cover the outside of the image, a portion of the border of the image is lost.

#Max Pool layers downscale the image tensors by taking the maximum of a certain area of an image.
#This downscaling helps by making the tensors easier to process, which is needed because more filters are used.

#Flatten layers take all of the features extracted from the image and puts them on a single plane.

#Dense layers are just normal neural perceptrons that try to train to the data and find patterns within the features.

#Then the output layer takes the cumalation of the patterns in the Dense layers to output a singular linear value (Glucose).


inputs = tf.keras.Input(shape=(120, 160, 3))

x = tf.keras.layers.Conv2D(filters=16, kernel_size=(3, 3), activation='relu')(inputs)
x = tf.keras.layers.MaxPool2D()(x)

x = tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), activation='relu')(x)
x = tf.keras.layers.MaxPool2D()(x)

x = tf.keras.layers.Conv2D(filters=64, kernel_size=(3, 3), activation='relu')(x)

x = tf.keras.layers.Flatten()(x)

x = tf.keras.layers.Dense(64, activation='relu')(x)
x = tf.keras.layers.Dense(64, activation='relu')(x)

outputs = tf.keras.layers.Dense(1, activation='linear')(x)


#Creates the model using the previous layers.
model = tf.keras.Model(inputs=inputs, outputs=outputs)


#Compiles the model using a standard optimizer and uses MSE for measuring performance.
#MSE is the Mean-Square-Error the model calculates for glucose compared to the actual glucose values.
#MSE is the mean of the squared deviations of the predicted values from the actual values.
model.compile(
    optimizer='adam',
    loss='mse'
)

#Summarizes the features of the models.
model.summary()

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 120, 160, 3)]     0         
                                                                 
 conv2d_10 (Conv2D)          (None, 118, 158, 16)      448       
                                                                 
 max_pooling2d_9 (MaxPooling  (None, 59, 79, 16)       0         
 2D)                                                             
                                                                 
 conv2d_11 (Conv2D)          (None, 57, 77, 32)        4640      
                                                                 
 max_pooling2d_10 (MaxPoolin  (None, 28, 38, 32)       0         
 g2D)                                                            
                                                                 
 conv2d_12 (Conv2D)          (None, 26, 36, 64)        1849

# 5 - Model Training

In [111]:
#Fits the model to training and validation data.


#Uses 100 epochs as the number of training iterations the model goes through. 
#The EarlyStopping callback ensures that the model stops training after the validation loss stagnates for 5 iterations (epochs).
#The callback then chooses the weights from the best epoch to save for the final model.
model.fit(
    train_data,
    validation_data=val_data,
    epochs=100,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=5,
            restore_best_weights=True
        )
    ]
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100


<keras.callbacks.History at 0x26ac5857340>

# 6 - Results

In [112]:
#Tests the model to the testing data.
#Squeezes the output array into a single list.
predicted_ages = np.squeeze(model.predict(test_data))
true_ages = test_data.labels



In [113]:
#Showing the different values that our model predicted compared to their actual counterparts.
#Our model seems to overfit towards values between 100-105
print(predicted_ages)
print(true_ages)

[101.46773  101.16337  101.268196 102.08155  102.28543  101.821175
 101.77609  104.64228  102.03384  101.81301  103.49935  102.60635
 101.60746  101.6601   102.238045 104.44096  102.04885  101.368416
 100.89205  102.068756 101.744316 101.53761  103.02194  101.415085
 101.49498  103.84246  102.21619  101.63508  101.6478   101.000534
 108.735054 101.71859  101.35136  101.639565 102.111244 101.97823
 101.46116  102.27172  101.78421  101.47797  101.79011  101.76888
 102.05133  101.66015  102.120094 108.01523  106.0877   101.65765
 101.72177  101.81748  101.2004   101.21171  101.41065  101.29216
 101.23504  102.04482  110.19067  101.32485  101.505295 101.61008
 102.37527  100.14978  101.179886 101.81262  101.75602  102.10489
 101.69028  101.36074  101.71201  101.52713  100.81574  101.77793
 102.43168  101.396675 101.52246  101.29345  101.10218  101.540085
 104.981865 101.884346 101.78011  102.01992  102.047134 104.381035
 101.64436  101.06092  103.33722  101.484505 100.50694  101.63083
 102

In [115]:
#Finds the Root of the MSE of the previous prediction.
rmse = np.sqrt(model.evaluate(test_data, verbose=0))
print("Test RMSE: {:.5f}".format(rmse))
print("On Average We Are {:.2f} Off When Predicting Glucose".format(rmse))

Test RMSE: 22.25247
On Average We Are 22.25 Off When Predicting Glucose
