<a href="https://colab.research.google.com/github/lawrenceN/ASPBaseApp/blob/master/Copy_of_01_Classical_Image_Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing and Visualization Techniques

This notebook provides a method of reading images from a local file system and preprocesses the images into a NumPy format.

This image also shares a quick method to visualize train / test data distribution and methods to ensure data sanity is preserved when converting data from different formats to a machine learning pipeline.

## Mounting Data from Google Drive

If you are running this notebook on Colab and you would like to mount data from Google Drive, you can run the cell below and check if you are able to view the contents inside the `Data` folder.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:


!ls './drive/My Drive/Data/'

In [None]:
!ls './drive/My Drive/Data/cars/train/'

## Download Images 

Download the images from the official [Stanford Cars dataset project page](https://ai.stanford.edu/~jkrause/cars/car_dataset.html). 

## Import Statements

These are the libraries that would be needed to run this notebook. If you miss any of the libraries below, you can run it by creating a new cell and the following command:
```!pip install <library-name>```

In [None]:
from collections import Counter
from cv2 import imread, resize, cvtColor, COLOR_BGR2RGB
from glob import glob
from random import randint
from tqdm import tqdm


import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [None]:
root_folder = "./drive/My Drive/Data"
class_names = ["swift", "wagonr"]
dataset_name = "cars"
train_folder = "train"
val_folder = "validation"
test_folder = "test"

In [None]:
train_list = []
for class_name in class_names:
    for file_name in glob(f"{root_folder}/{dataset_name}/{train_folder}/{class_name}/*.jpg"):
        train_list.append(file_name)

print(len(train_list))

In [None]:
val_list = []
for class_name in class_names:
    for file_name in glob(f"{root_folder}/{dataset_name}/{val_folder}/{class_name}/*.jpg"):
        val_list.append(file_name)

print(len(val_list))

In [None]:
test_list = []
for class_name in class_names:
    for file_name in glob(f"{root_folder}/{dataset_name}/{test_folder}/{class_name}/*.jpg"):
        test_list.append(file_name)

print(len(test_list))

## Helper Methods

In [None]:
# Defining some constants
new_w, new_h = 100, 100
n_channels = 3

In [None]:
def read_and_process_image(file_path, show_details=False):
  '''Reads the image from the file and folder name, resizes and preprocesses it'''
  # Read image using OpenCV 
  img = imread(file_path)

  if show_details: print(f"Shape: {img.shape}")  # Print only when needed

  # Resize the image to a constant width and height
  img = resize(img, (new_w, new_h))
  img = cvtColor(img, COLOR_BGR2RGB)
  # Normalize the image
  img = img / 255
  
  if show_details: print(f"Reshape: {img.shape}") # Print only when needed

  return img

In [None]:
# Call the method to process the image 
img = read_and_process_image(train_list[10], show_details=True)

# Plots image inline due to the `matplotlib inline` command above
plt.imshow(img)

In [None]:
def show_images(images_list):
    '''Method for debugging and visualization of images'''
    n: int = len(images_list)
    f = plt.figure(figsize=(15, 15))
    columns = 4
    rows = 4
    for i in range(columns*rows):
        image_path = images_list[randint(0, n - 1)]
        fol_name = image_path.split("/")[-2]
        # Debug, plot figure
        ax = f.add_subplot(rows, columns, i + 1)
        ax.set_title(fol_name)
        img = read_and_process_image(image_path)
        plt.axis('off')
        plt.imshow(img)

    plt.show(block=True)

In [None]:
show_images(train_list)

## Visualization of Class Distribution

It is important to check the distribution of the train and test data. This would give an idea of the bias present in the data even before a classifier is trained.

In [None]:
def visualize_classes(image_path_list):
    image_class_list = []
    for image_path in image_path_list:
        fol_name = image_path.split("/")[-2]
        image_class_list.append(fol_name)

    image_class_counter = Counter(image_class_list)
    plt.bar(image_class_counter.keys(), image_class_counter.values())

    return image_class_list

In [None]:
# Train data 
train_class_list = visualize_classes(train_list)

In [None]:
# Validation data 
val_class_list = visualize_classes(val_list)

In [None]:
# Test data 
test_class_list = visualize_classes(test_list)

## Convert Data to NumPy

In [None]:
def prepare_numpy_image_array(image_path_list):
  '''Prepare a NumPy version of the image data for further usage in training'''
  images_numpy = np.zeros((len(image_path_list), new_w, new_h, n_channels))
  
  for idx in tqdm(range(len(image_path_list))):
    img = read_and_process_image(image_path_list[idx])
    images_numpy[idx, :, :, :] = img

  return images_numpy

In [None]:
train_images_numpy = prepare_numpy_image_array(train_list)
print(f"Train images shape: {train_images_numpy.shape}")

In [None]:
val_images_numpy = prepare_numpy_image_array(val_list)
print(f"Validation images shape: {val_images_numpy.shape}")

In [None]:
test_images_numpy = prepare_numpy_image_array(test_list)
print(f"Test images shape: {test_images_numpy.shape}")

## Validate NumPy and Original Data

This step makes sure that the transformations that we have applied and the data that is converted into NumPy array are one and the same. We check various random example indices to ensure that the data sanity is preserved.

In [None]:
print(np.amin(test_images_numpy), np.amax(test_images_numpy))

In [None]:
plt.imshow(test_images_numpy[100, :, :, :])

In [None]:
# Call the method to process the image 
img = read_and_process_image(test_list[100], show_details=True)

# Plots image inline due to the `matplotlib inline` command above
plt.imshow(img)

## Save to Pickle 

Pickling data ensures that all the preprocessing steps are preserved and the data is saved to local disk in the state at which we have processed it.

In [None]:
import pickle as pkl

pickle_path = f"{root_folder}/{dataset_name}/train_val_test_numpy.pkl"
pkl.dump([train_images_numpy, train_class_list, 
          val_images_numpy, val_class_list,
          test_images_numpy, test_class_list], 
         open(pickle_path, "wb"))
print(f"Saving NumPy arrays to {pickle_path}")