# Preprocessing Pipeline

This is a pipeline that I modified for the second version of my model, where I wanted to work with smaller images, preserve the desired image shape where each pixel is represented by three color values, and remove any unnecessary noise from my images. Building on my work from the ```exploration.ipynb``` notebook, I discovered that the features of these retinal scans were most visible after converting the images to grayscale, increasing contrast using the ```equalizeHist``` function, and removing noise using a gaussian blur and separating the features of interest from the noise that the blur indicated. I also chose to further increase contrast using an adaptive thresholding function to make the veins in the image more prominent. 

After abstracting these image modifications to a function, I then applied the function to all of the images in the dataset and saved the output to use later in case any part of the pipeline broke as I continued working. From there, I did an 80-20 train-test split of the images to follow the spec, splitting each class proportionally to ensure that both the training and the testing sets have proportional amounts of each class to work with. I then saved the split data to pickled numpy files to use for modeling. 

## From the Spec:
Referencebale retinopathy folders (Class 1): ('03 Moderate NPDR', '04 Severe NPDR',
'05 PDR', '06 Mild NPDR, with DME', '07 Moderate NPDR, with DME', '08 Severe NPDR,
with DME', '09 PDR, with DME');
- Non-referenceable folders (Class 2): (‘01 No DR', '02 Mild NPDR')
- Ignore folders: '00 5-Field Images', '10 Ungradable'

In [1]:
# imports
import os
from pathlib import Path
import cv2 #opencv-python
from matplotlib import pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split #scikit-learn

## Function Definitions

In [2]:
# img_reshaper(orig_img: 2d image array): given an array of shape (x,y) representing an image
    # where each pixel is represented by a single value, produces an array of shape (x,y,3)
    # where each pixel is represented by a triple of identical values
def img_reshaper(orig_img):
    new_img = []
    for row in orig_img:
        nrow = []
        for item in row:
            nrow.append([item, item, item])
        new_img.append(nrow)
    return new_img

In [3]:
# Preprocess(img: string): given filepath img to an image, returns the preprocessed cv2 image object
    # throws invalidArgument error if img is not a valid path
def Preprocess(img):
    try:
        i = cv2.imread(img)
        small = cv2.resize(i, (150,150))
        img_gray = cv2.cvtColor(small, cv2.COLOR_BGR2GRAY)
        img_hcontrast = cv2.equalizeHist(img_gray)
        blur = cv2.GaussianBlur(img_hcontrast, (0,0), sigmaX=33, sigmaY=33)
        divide = cv2.divide(img_hcontrast, blur, scale=255)
        th3 = cv2.adaptiveThreshold(divide,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,\
            cv2.THRESH_BINARY,11,2)
        shaped = img_reshaper(th3)
        return shaped
    except:
        raise ValueError(f"input {img} is not a valid filepath.")

## Preprocessing Application

In [4]:
# initialize data
ref_imgs = []
nonref_imgs = []

# iterate through all images in dataset
data_dir = os.path.join(os.getcwd(), "../data/SAUNAR/")
img_dirs = [name for name in os.listdir(data_dir) if os.path.isdir(data_dir + name)]
for d in img_dirs:
    print(f"starting directory {d[:2]}")
    # process and store nonref images from the 01 and 02 folders
    if d[:2] == "01" or d[:2] == "02":
        for i in os.listdir(data_dir + d):
            nonref_imgs.append(Preprocess(os.path.join(data_dir + d,i)))
    # pass on non-classifiable images from 00 and 10 folders
    elif d[:2] == "00" or d[:2] == "10":
        pass
    # process and store ref images from all other folders
    else:
        for i in os.listdir(data_dir + d):
            ref_imgs.append(Preprocess(os.path.join(data_dir + d,i)))

starting directory 06
starting directory 05
starting directory 01
starting directory 07
starting directory 03
starting directory 09
starting directory 10
starting directory 00
starting directory 02
starting directory 08
starting directory 04


In [5]:
# save processed image data to file
np.save("../data/processed/ref_imgsn.npy", ref_imgs)
np.save("../data/processed/nonref_imgsn.npy", nonref_imgs)

In [6]:
# load processed image data for splitting
ref_imgs_load = np.load("../data/processed/ref_imgsn.npy")
nonref_imgs_load = np.load("../data/processed/nonref_imgsn.npy")

# train test split
Xref_train, Xref_test, yref_train, yref_test = train_test_split(ref_imgs_load, np.zeros(len(ref_imgs_load)), test_size=.2)
Xnref_train, Xnref_test, ynref_train, ynref_test = train_test_split(nonref_imgs_load, np.ones(len(nonref_imgs_load)), test_size=.2)


In [7]:
# combine arrays for both classes to produce final split arrays
Xtrain = np.concatenate((Xref_train, Xnref_train))
Ytrain = np.concatenate((yref_train, ynref_train))
Xtest = np.concatenate((Xref_test, Xnref_test))
Ytest = np.concatenate((yref_test, ynref_test))

# save split to files for modeling
np.save("../data/test/Xtestnew.npy", Xtest)
np.save("../data/test/Ytestnew.npy", Ytest)
np.save("../data/train/Xtrainnew.npy", Xtrain)
np.save("../data/train/Ytrainnew.npy", Ytrain)