# 4 Data Preprocessing and Modeling<a id='2_Data_wrangling'></a>

In [1]:
# imports for handling lists
import itertools
# handling warnings
import warnings
# skimage imports
from skimage import data, color, filters, morphology, graph, measure, exposure
from skimage.filters import threshold_otsu, threshold_local, try_all_threshold, sobel, gaussian
from skimage.transform import rotate, rescale, resize
from skimage.feature import canny
from skimage.io import imsave
from skimage.util import img_as_ubyte
# scipy for image
from scipy import ndimage as ndi
# import for file interaction
import os
import io
from pathlib import Path
import cv2
# imports for reading from zip files
import zipfile
from PIL import Image
# array and data frame imports
import numpy as np
import pandas as pd
# helper functions
import helpers as h
# visualization tools
from tqdm.notebook import tqdm
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

The first step wil be to read the dataset were the labels and image's IDs are stored and check if the latest are unique. In this preprocessing and modeling stage, we will use a different library for interacting with the storage paths, so a couple of more transformations to the tabular data will be necessary, let's explore them.

In [7]:
# loading the dataset saved in 03_EDA
dir_path = r'C:\SPRINGBOARD\retinopathy-detection' # path to repository
labels = pd.read_csv(r'{}\data_processed\labels_sizes_aug.csv'.format(dir_path))

In [8]:
labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Image name              1309 non-null   object
 1   Zip File                1309 non-null   object
 2   Image Size              1309 non-null   object
 3   label                   1309 non-null   int64 
 4   Risk of macular edema   1309 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 51.3+ KB


In [9]:
labels.head()

Unnamed: 0,Image name,Zip File,Image Size,label,Risk of macular edema
0,20051019_38557_0100_PP.tif,Base11.zip,"(1488, 2240, 3)",3,1
1,20051020_43808_0100_PP.tif,Base11.zip,"(1488, 2240, 3)",0,0
2,20051020_43832_0100_PP.tif,Base11.zip,"(1488, 2240, 3)",1,0
3,20051020_43882_0100_PP.tif,Base11.zip,"(1488, 2240, 3)",2,0
4,20051020_43906_0100_PP.tif,Base11.zip,"(1488, 2240, 3)",3,2


The first transformation we will apply to this data frame will be changing the name of the 'Image name' colum for 'image_id'

In [10]:
# renaming image name column
labels.rename(columns={'Image name': 'image_id'}, inplace=True)
labels.head()

Unnamed: 0,image_id,Zip File,Image Size,label,Risk of macular edema
0,20051019_38557_0100_PP.tif,Base11.zip,"(1488, 2240, 3)",3,1
1,20051020_43808_0100_PP.tif,Base11.zip,"(1488, 2240, 3)",0,0
2,20051020_43832_0100_PP.tif,Base11.zip,"(1488, 2240, 3)",1,0
3,20051020_43882_0100_PP.tif,Base11.zip,"(1488, 2240, 3)",2,0
4,20051020_43906_0100_PP.tif,Base11.zip,"(1488, 2240, 3)",3,2


Now, since we will specify the file extension while handling images, let's just get rid of the '.tif' extension on the image_id values

In [13]:
labels['image_id'] = labels['image_id'].apply(lambda x: x.split('.', 1)[0]) # removing '.tif' string from image_id
labels.head()

Unnamed: 0,image_id,Zip File,Image Size,label,Risk of macular edema
0,20051019_38557_0100_PP,Base11.zip,"(1488, 2240, 3)",3,1
1,20051020_43808_0100_PP,Base11.zip,"(1488, 2240, 3)",0,0
2,20051020_43832_0100_PP,Base11.zip,"(1488, 2240, 3)",1,0
3,20051020_43882_0100_PP,Base11.zip,"(1488, 2240, 3)",2,0
4,20051020_43906_0100_PP,Base11.zip,"(1488, 2240, 3)",3,2


Good, now it is time to check if all of our images IDs are unique in the data frame (very important for the modeling stage)

In [15]:
# checking the uniqueness of each image ID
labels['image_id'].nunique() == labels.shape[0]

True

Good, there is no need to drop the duplicates in the dataset since all the IDs found are unique. But anyways, let's do it to see no changes in it and for the purpose of how would it be done:

In [16]:
labels.drop_duplicates('image_id', inplace=True)

In [17]:
labels.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 1308
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   image_id                1309 non-null   object
 1   Zip File                1309 non-null   object
 2   Image Size              1309 non-null   object
 3   label                   1309 non-null   int64 
 4   Risk of macular edema   1309 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 61.4+ KB


As expected, no changes made on the dataframe. Let's save this data frame in a csv file for later use in the modeling.

Now, it is time to load and pre process the image data. Our goal here is preprocess the images and the creating a training and validation dataset with folders of images divided by classes, being:

* --> train_folder
  * --> class_0
  * --> class_1
  * --> class_2
  * --> class_3

And then a folder called:

* --> validation_folder
  * --> class_0
  * --> class_1
  * --> class_2
  * --> class_3
  
Let's see how we can achieve this

In [12]:
sums, sums_squared = 0, 0

for c, image_id in enumerate(tqdm(labels['Image name'])):
    image_id = labels['Image name'].iloc[c]
    img_path = dir_path + '\data_processed\data_original\{}'.format(c)
    tif = plt.imread(img_path) / 255
    tif_array = cv2.rezise(dcm, 224, 224).astype(np.float64)
    label = labels['label'].iloc[c]
    train_or_val = 'train' if c < 1000 else 'val'
    save_path = '{}\data_processed\train_val_data{}\{}'.format(dir_path, train_or_val, str(label))

## 4.n Summary<a id='2.7_Summary'></a>