<img src = "https://drive.google.com/uc?export=view&id=1FNyRcLn_ZqLm263UvW1gEDW1_QaTr9hG" alt = "Encabezado MLDS" width = "100%">  </img>

#**Data Loading and Preprocessing**
---
In the present notebook, we will load, preprocess, and save the images that will be used during the applied project (MLDS5)

In [1]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os, random

plt.style.use("ggplot")
# We select a seed for the RNG
tf.random.set_seed(0)
np.random.seed(0)
import gdown
import os

###**1. Image loading and characterization**
---

In [5]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


File paths:

In [None]:
train_path = "/content/gdrive/MyDrive/Colab Notebooks/MLDS - UNAL/Deep Learning (MLDS 5)/Applied Project/Data/WBC_Data/Train"
test_A_path = "/content/gdrive/MyDrive/Colab Notebooks/MLDS - UNAL/Deep Learning (MLDS 5)/Applied Project/Data/WBC_Data/Test-A"
test_B_path = "/content/gdrive/MyDrive/Colab Notebooks/MLDS - UNAL/Deep Learning (MLDS 5)/Applied Project/Data/WBC_Data/Test-B"

**Load datasets as tensor arrays:**

**- Train dataset**

In [None]:
#/content/gdrive/MyDrive/Colab Notebooks/MLDS - UNAL/Deep Learning (MLDS 5)/Applied Project/Data/X_train_prep.npy
train_images = []
labels = []

class_names = ["Basophil", "Eosinophil", "Monocyte", "Lymphocyte", "Neutrophil"]

for i, class_name in enumerate(class_names):
    temp_path = os.path.join(train_path, class_name)

    if not os.path.exists(temp_path):
        print(f"Directory not found: {temp_path}")
        continue

    for im_path in os.listdir(temp_path):
        full_im_path = os.path.join(temp_path, im_path)
        if not os.path.isfile(full_im_path):
            print(f"File not found: {full_im_path}")
            continue

        train_images.append(np.array(tf.keras.preprocessing.image.load_img(full_im_path,
                                                                           target_size=(224, 224, 3))))
        labels.append(i)


X_train = np.array(train_images)
y_train = np.array(labels)

#save the created arrays:
np.save('X_train.npy', X_train)
np.save('y_train.npy', y_train)


#X_train_tf = tf.convert_to_tensor(X_train)
#y_train_tf = tf.convert_to_tensor(y_train)


In [None]:
display(X_train.shape)
display(y_train.shape)

(10175, 224, 224, 3)

(10175,)

In [None]:
test_A_images = []
labels_A = []

class_names_A = ["Basophil", "Eosinophil", "Monocyte", "Lymphocyte", "Neutrophil"]

for i, class_name in enumerate(class_names_A):
    temp_path_A = os.path.join(test_A_path, class_name)

    if not os.path.exists(temp_path_A):
        print(f"Directory not found: {temp_path_A}")
        continue

    for im_path in os.listdir(temp_path_A):
        full_im_path_A = os.path.join(temp_path_A, im_path)
        if not os.path.isfile(full_im_path_A):
            print(f"File not found: {full_im_path_A}")
            continue

        test_A_images.append(np.array(tf.keras.preprocessing.image.load_img(full_im_path_A,
                                                                           target_size=(224, 224, 3))))
        labels_A.append(i)

X_test_A = np.array(test_A_images)
y_test_A = np.array(labels_A)

#save the created arrays:
np.save('X_test_A.npy', X_test_A)
np.save('y_test_A.npy', y_test_A)


#X_test_A = tf.convert_to_tensor(X_test_A)
#y_test_A = tf.convert_to_tensor(y_test_A)

In [None]:
display(X_test_A.shape)
display(y_test_A.shape)

(4339, 224, 224, 3)

(4339,)

##**2. Data Augmentation and Preprocessing**
---
We use _EfficientNet_ preprocessing to transform the datsets:

##**Training and Test Images:**
---

We load the previously preprocessed images:

In [2]:
# X_train:
X_train_id = '14rVXKfHIh794IZ4PpiLCE0VCLZFMM_wW'
output_X_train = 'X_train.npy'
gdrive_url_1 = f'https://drive.google.com/uc?id={X_train_id}'
gdown.download(gdrive_url_1, output_X_train, quiet=False)
X_train = np.load(output_X_train)

# y_train:
#y_train_id = '1-Dlv_m8OJjILUTXXhVvkxMoUiq3E6Zng'
#output_y_train = 'y_train.npy'
#gdrive_url_2 = f'https://drive.google.com/uc?id={y_train_id}'
#gdown.download(gdrive_url_2, output_y_train, quiet=False)
#y_train = np.load(output_y_train)

Downloading...
From: https://drive.google.com/uc?id=14rVXKfHIh794IZ4PpiLCE0VCLZFMM_wW
To: /content/X_train.npy
100%|██████████| 1.53G/1.53G [00:06<00:00, 249MB/s]


In [4]:
display(X_train.shape)
display(y_train.shape)

(10175, 224, 224, 3)

(10175,)

In [4]:
# X_test_A:
X_test_A_id = '1-JVDUYPO0lJYNGCU7pkEKfG30JQ0MhhO'
output_X_test = 'X_test_A.npy'
gdrive_url_3 = f'https://drive.google.com/uc?id={X_test_A_id}'
gdown.download(gdrive_url_3, output_X_test, quiet=False)
X_test_A = np.load(output_X_test)

# y_train:
y_test_id = '1-KDSVDLgMi3Cj3-KhFGZiNsewwpfCzN7'
output_y_test = 'y_test_A.npy'
gdrive_url_4 = f'https://drive.google.com/uc?id={y_test_id}'
gdown.download(gdrive_url_4, output_y_test, quiet=False)
y_test_A = np.load(output_y_test)

Downloading...
From: https://drive.google.com/uc?id=1-JVDUYPO0lJYNGCU7pkEKfG30JQ0MhhO
To: /content/X_test_A.npy
100%|██████████| 653M/653M [00:09<00:00, 71.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1-KDSVDLgMi3Cj3-KhFGZiNsewwpfCzN7
To: /content/y_test_A.npy
100%|██████████| 34.8k/34.8k [00:00<00:00, 67.7MB/s]


In [5]:
display(X_test_A.shape)
display(y_test_A.shape)

(4339, 224, 224, 3)

(4339,)

We use the _EfficientNet_ preprocessing to transform the datasets:

In [3]:
X_train_prep = tf.keras.applications.mobilenet_v2.preprocess_input(X_train)
#X_test_A_prep = tf.keras.applications.mobilenet_v2.preprocess_input(X_test_A)

Save the objects `X_train_prep` and `X_test_A_prep` as numpy arrays:

In [4]:
# Save X_train_prep and X_testA_prep to numpy files
np.save('X_train_prep.npy', X_train_prep)
#np.save('X_test_A_prep.npy', X_test_A_prep)

In [8]:
# Define the directory where you want to save the file
save_directory = 'C:\\Users\\User\\Documents\\Francisco\\MLDS_UNAL\\Módulo 5_Deep_Learning\\Applied Project\\Data'

# Save X_train_prep to a numpy file in the specified directory
np.save(save_directory + '\\X_train_prep.npy', X_train_prep)


We codify the labels by employing _one-hot representation_:

In [7]:
Y_train = tf.keras.utils.to_categorical(y_train)
Y_test_A = tf.keras.utils.to_categorical(y_test_A)

Create an object **generator** using the `ImageDataGenerator`function:

In [8]:
train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(rescale = 1./255,
                                                                  width_shift_range = 0.05,
                                                                  height_shift_range = 0.05,
                                                                  horizontal_flip = True,
                                                                  vertical_flip = True,
                                                                  validation_split = 0.2)



We now define the generator for `X_train_prep` and `X_val`:

**Note:** We do not apply the `ImageDataGenrator` to the test datasets, as we want to keep the original images to evaluate the model.  

In [9]:
train_gen = train_datagen.flow(x = X_train_prep,
                               y = Y_train,
                               batch_size = 32)

val_gen = val_datagen.flow(x = X_train_prep,
                               y = Y_train,
                               batch_size = 32,
                               subset = 'validation')

Save the two Generator Objects to use them when training the model:

In [None]:
import pickle

# Save the train_gen and val_gen objects to files
with open('train_generator.pkl', 'wb') as train_gen_file:
  pickle.dump(train_gen, train_gen_file)

with open('validation_generator.pkl', 'wb') as val_gen_file:
  pickle.dump(val_gen, val_gen_file)