# 2 Data Wrangling<a id='2_Data_wrangling'></a>

## 2.1 Contents<a id='2.1_Contents'></a>
* [2 Data Wrangling](#2_Data_wrangling)
  * [2.1 Contents](#2.1_Contents)
  * [2.2 Introduction](#2.2_Introduction)
    * [2.2.1 Recap Of Retinopathy Detection Problem](#2.2.1_recap_reti)
    * [2.2.2 Retinopathy Dataset](#2.2.2_reti_dataset)
    * [2.2.3 Introduction To Notebook](#2.2.3_Introduction_To_Notebook)
  * [2.3 Imports](#2.3_Imports)
  * [2.4 Objectives](#2.4_Objectives)
  * [2.5 Load The Messidor Data](#2.5_Load_The_messidor_Data)
     * [2.5.1 Handling tabular data for labels](#2.5.1_htb_d)
     * [2.5.2 Handling image shapes for analysis](#2.5.2_his_a)
     * [2.5.3 String processing for tabular data](#2.5.3_spro_t)
     * [2.5.4 Joining image sizes and labels](#2.5.4_joims_l)
  * [2.6 Reading Image Data](#2.6_rimd_l)
     * [2.6.1 Dealing with image sizes](#2.6.1_dims_l)
     * [2.6.2 Saving data into local directory for EDA](#2.6.2_sdim_eda)
  * [2.7 Summary](#2.7_Summary)

## 2.2 Introduction<a id='2.2_Introduction'></a>

The **retina** is part internal of the nerve tissue of the eye that lines the back two-thirds of the organ. Its central region is in charge of the central vision and the outer layer is in charge of the peripheral vision. All these parts together as one, are responsible for receiving visual images, and it is the first step in decoding the signal. We can think of it as a sensor.

<p style="text-align:center;"><img src="https://www.doctordiegoruizcasas.com/ext/r/800x592-204/el_ojo_humano.jpg" alt="Drawing" style="width: 500px;"/></p>

**Retinopathy** term is used for diseases in general that are localized at the retina. There are several types of retinopathies and most of them affect the small retinal blood vessels and can be diagnosed by using a medical device called ophthalmoscope. 

These diseases can be caused for many reasons, including congenital, hypertension, or any other pathology that can affect small blood vessels. However, the most common type of retinopathy is the one caused by diabetic complications called (non-surprisingly) **Diabetic Retinopathy**.

As subclasses of the disease, we can find **proliferative** and **nonproliferative**. This term refers to the growth of abnormal blood vessels in the retina, being the second one the most dangerous since it can impact the vision of the patient (including blindness in severe cases). Nonproliferative can also derive in proliferative, therefore, regular eye examinations and treatment (when necessary) are important for controlling this disease.

**Diabetes** has reached impressive numbers by 2017, [having almost 425 million people facing the disease and estimating almost 629 million by 2045](https://www.nature.com/articles/s41433-019-0566-0). According to this [article](https://www.asrs.org/patients/retinal-diseases/3/diabetic-retinopathy) from the [American Society of Retina Specialists](https://www.asrs.org/), half of the patients with diabetes and is the number one cause of irreversible blindness in working-age people, but there is substantial scientific evidence that early detection and timely treatment can prevent vision loss.

In recent years, there have been programs based on the **analysis of fundus photographs** by specially trained ophthalmologists (mostly remote graders). However, although great eyes are behind the analysis, the diagnostic accuracy achieved may not be optimal, and scaling and [sustaining such programs has been found to be challenging](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5733521/). Additionally, the costs of those type of services could be quite high even for developed countries.

**Deep learning**, a machine learning (ML) technique, [has shown promising diagnostic performance in several applications](https://chemport.cas.org/cgi-bin/sdcgi?APP=ftslink&action=reflink&origin=npg&version=1.0&coi=1%3ACAS%3A528%3ADC%2BC2MXht1WlurzP&md5=0114b2ed20ab1840e1c58bd5ab4170b8) such as in image recognition and computer vision tasks. These technologies have been widely adopted in many domains including healthcare and medicine. For medical imaging analysis in general, it has [achieved robust results in various medical specialities](https://pubs.rsna.org/doi/10.1148/radiol.2017162326) such as radiology; for ophthalmology in particular, deep learning (DL) continues the tradition of [autonomous and assisted analysis of retinal photographs](https://www.nature.com/articles/s41433-019-0566-0). Such **artificial intelligence (AI) systems** have been [demonstrated to lower cost, improve diagnostic accuracy, and increase patient access to Diabetic Retinopathy screening](https://www.nature.com/articles/s41433-019-0566-0). Recent works on DL in ophthalmology showcase its potential to at least partially replace human graders, while providing a similar level of accuracy. Nonetheless, being more adopted as a medical-aid systems than a replacement itself.


### 2.2.1 Recap Of Retinopathy Detection Problem<a id='2.2.1_recap_reti'></a>

As we can see in the introduction, Diabetes is a global eye health issue. Given the rising in diabetes prevalence, old population, retinopathy screening can pose a significant challenge even in developed countries. Int he past recent years, artificial intelligence using machine learning and deep learning have been adopted by the scientific community with the purpose of developing automated Diabetic Retinopathy detection algorithms. This project aims to implement one of these technologies using a Convolutional Neural Network using the above explained dataset and technical methodologies. Although many academics have published robust diagnostic performance of the AI algorithms for Diabetic Retinopathy screening, future research is required to address several challenges, for example and clinical deployment model to expedite the translation of these novel technologies into the healthcare setting (_text adapted from this [article](https://www.nature.com/articles/s41433-019-0566-0)_).

### 2.2.2 Retinopathy Dataset<a id='2.2.2_reti_dataset'></a>

The dataset consists of 1200 eye fundus color numerical images of the posterior pole. The images and the following information were taken from of the [Messidor database](https://www.adcis.net/en/third-party/messidor/). 

These images were acquired by 3 ophthalmologic departments using a color video 3CCD camera mounted on a Topcon TRC NW6 non-mydriatic retinography with a 45-degree field of view. Images were captured using 8 bits per color plane at (1440*960), (2240*1488) or (2304*1536) pixels.

800 images were acquired with pupil dilation (one drop of Tropicamide at 0.5%) and 400 without dilation.

The 1200 images are packaged in 3 sets, one per ophthalmologic department. Each set is divided into 4 zipped subsets containing 100 images in TIFF format and an Excel file with medical diagnoses for each image that will be used for the label identification.

Medical diagnoses
Two diagnoses have been provided by the medical experts for each image:

1. Retinopathy grade
2. Risk of macular edema

**Retinopathy grade**

* 0 (Normal): (μA = 0) AND (H = 0)
* 1: (0 < μA <= 5) AND (H = 0)
* 2: ((5 < μA < 15) OR (0 < H < 5)) AND (NV = 0)
* 3: (μA >= 15) OR (H >=5) OR (NV = 1)

Where:

* μA: number of microaneurysms
* H: number of hemorrhages
* NV = 1: neovascularization
* NV = 0: no neovascularization

**Risk of macular edema**
Hard exudates have been used to grade the risk of macular edema.

* 0 (No risk): No visible hard exudate
* 1: Shortest distance between macula and hard exudates > one papilla diameter
* 2: Shortest distance between macula and hard exudates <= one papilla diameter

All images contained in the database were used for making actual clinical diagnoses. To ensure the upmost protection of patient privacy, information that might allow us to identify a patient has been discarded. To minimize any further risk of breach of privacy, the use of this database is restricted to individuals or organizations that obtained the database directly from this [website](https://www.adcis.net/en/third-party/messidor/), which was exactly the one used for downloading the .zip files and stored locally on a Windows computer.

Links
Other databases with retinal images are available on the following sites:

Stare project: Retinal color images and results of automatic location of the optic nerve.
[Drive project:](https://grand-challenge.org/api/) Retinal color images and results of automatic segmentation of blood vessels.

Insight on how to use the database and the files in it were taken from this article:

Decencière et al.. Feedback on a publicly distributed database: the Messidor database.
Image Analysis & Stereology, v. 33, n. 3, p. 231-234, aug. 2014. ISSN 1854-5165  available [here](http://www.ias-iss.org/ojs/IAS/article/view/1155) or [here](http://dx.doi.org/10.5566/ias.1155).

### 2.2.3 Introduction To Notebook<a id='2.2.3_Introduction_To_Notebook'></a>

In this notebook, we will be doing step 2 of the Data Science Method (data wrangling). This step focuses on collecting the data, importing it in a suitable way, organizing it, applying basic transformations to it and making sure it's well defined. Paying attention to these tasks will pay off greatly later in our future notebooks.

Some data cleaning will be done at this stage as well so we can understand it better and compute a new and cleaned data set to input our third step (Exploratory Data Analysis). 

At the end of this notebook, we will be selecting some useful features from the tabular data to use based on the statistical analysis made on them and deliver a proper image database for further analysis and, eventually, use it as model inputs.

## 2.3 Imports<a id='2.3_Imports'></a>

In [2]:
# warnings handling
import warnings
# skimage imports
from skimage import data, color, filters, morphology, graph, measure, exposure
from skimage.filters import threshold_otsu, threshold_local, try_all_threshold, sobel, gaussian
from skimage.transform import rotate, rescale, resize
from skimage.feature import canny
from skimage.io import imsave
# scipy for image
from scipy import ndimage as ndi
# imports for file interaction
import os
import io
# imports for reading from zip files
import zipfile
from PIL import Image
# array and data frame imports
import numpy as np
import pandas as pd
# helper functions
import helpers as h
# visualization tools
import matplotlib.pyplot as plt
%matplotlib inline

## 2.4 Objectives<a id='2.4_Objectives'></a>

The following three objectives are considered as the criteria for success in this project:

* [x] Implementing a deep learning algorithm that can successfully detect at least 80% of Diabetic Retinopathy classes (grades).
* [x] Implementing a deep learning algorithm that can successfully detect at least 80% of risk of macular edema.
* [x] Additional to the classifier, implementing an automated pipeline to deploy the model in a productive setting.

## 2.5 Load The Messidor Data<a id='2.5_Load_The_messidor_Data'></a>

We'll start by identifying and reading the zip files in the PATH location where the images and xls label files are stored.

In [3]:
# folder path
dir_path = r'C:\SPRINGBOARD\retinopathy-detection'
# list file and directories
file_names = os.listdir(dir_path+'\data')
display(file_names)

['Base11.zip',
 'Base12.zip',
 'Base13.zip',
 'Base14.zip',
 'Base21.zip',
 'Base22.zip',
 'Base23.zip',
 'Base24.zip',
 'Base31.zip',
 'Base32.zip',
 'Base33.zip',
 'Base34.zip']

### 2.5.1 Handling tabular data for labels <a id='2.5.1_htb_d'></a>

In this section, we will be concatenating all excel files for gathering all images files with their specific label.

In [4]:
warnings.simplefilter(action='ignore', category=FutureWarning)
xls_count = 1
df_total = pd.DataFrame()

for file in file_names:
    # reading the zip files
    if '.zip' in file:
        # reading internal files
        zip_file = zipfile.ZipFile('{}\data\{}'.format(dir_path, file)) 
        for int_file in zip_file.namelist():
            # reading xls files for labels
            if '.xls' in int_file:
                df = pd.read_excel(io.BytesIO(h.read_zip(r'{}\data\{}'.format(dir_path, file), int_file))) # loading excel label files
                df_total = df_total.append(df, ignore_index=True) # appending all xls files
                df_total.drop(columns='Ophthalmologic department', inplace=True)
                print('Excel File {} ready!'.format(xls_count))
                xls_count += 1
df_total.to_csv(r'{}\data_processed\labels.csv'.format(dir_path), index=False) # saving the dataframe to path

Excel File 1 ready!
Excel File 2 ready!
Excel File 3 ready!
Excel File 4 ready!
Excel File 5 ready!
Excel File 6 ready!
Excel File 7 ready!
Excel File 8 ready!
Excel File 9 ready!
Excel File 10 ready!
Excel File 11 ready!
Excel File 12 ready!


Now that we have all .xls files read and concatenated, let's explore the type of variables we have in the data frame and other relevant information.

In [5]:
df_total.head()

Unnamed: 0,Image name,Retinopathy grade,Risk of macular edema
0,20051019_38557_0100_PP.tif,3,1
1,20051020_43808_0100_PP.tif,0,0
2,20051020_43832_0100_PP.tif,1,0
3,20051020_43882_0100_PP.tif,2,0
4,20051020_43906_0100_PP.tif,3,2


In [6]:
df_total.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 3 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Image name              1200 non-null   object
 1   Retinopathy grade       1200 non-null   int64 
 2   Risk of macular edema   1200 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 28.2+ KB


### 2.5.2 Handling image shapes for analysis <a id='2.5.2_his_a'></a>

Reading tif images from zip files and extracting shapes of each image for reshaping analysis purposes. This is the code we used for generating the dictionary with zip file names as keys and (image sizes, file name) as values:

<p style="text-align:center;"><img src="https://raw.githubusercontent.com/pablo-git8/retinopathy-detection/main/images/getting_imge_sizes.jpg" alt="Code1" style="width: 800px;"/></p>

** _Codes presented in the form of snippets are since the runtimes were quite long_

Then this code was used for appending the name of the zip files to each row (name of tif files) and saving the data frame into a reusable (loadable) csv file. The name of the csv file is img_sizes.csv

<p style="text-align:center;"><img src="https://raw.githubusercontent.com/pablo-git8/retinopathy-detection/main/images/saving_df_img_sizes.jpg" alt="Code1" style="width: 900px;"/></p>

### 2.5.3 String processing for tabular data <a id='2.5.3_spro_t'></a>

Now let's read the csv file to explore its contents and remove the 'BaseX' string from the file names to add consistency to our data frames.

In [7]:
base_df = pd.read_csv('C:/SPRINGBOARD/retinopathy-detection/data_processed/img_sizes.csv')

In [8]:
base_df.head() # exploring the base_df

Unnamed: 0,File Name,Image Size,Zip File
0,Base11/20051019_38557_0100_PP.tif,"(1488, 2240, 3)",Base11.zip
1,Base11/20051020_43808_0100_PP.tif,"(1488, 2240, 3)",Base11.zip
2,Base11/20051020_43832_0100_PP.tif,"(1488, 2240, 3)",Base11.zip
3,Base11/20051020_43882_0100_PP.tif,"(1488, 2240, 3)",Base11.zip
4,Base11/20051020_43906_0100_PP.tif,"(1488, 2240, 3)",Base11.zip


In [9]:
base_df['Image name'] = base_df['File Name'].apply(lambda x: h.remove_base_x(x)) # removing 'BaseX' string from File name
base_df.drop(columns=['File Name'], inplace=True) # column no longer needed for identification
base_df = h.swap_columns(base_df, 'Image name', 'Image Size') # putting image name in the first column

In [10]:
base_df.head() # exploring the base_df modified

Unnamed: 0,Image name,Zip File,Image Size
0,20051019_38557_0100_PP.tif,Base11.zip,"(1488, 2240, 3)"
1,20051020_43808_0100_PP.tif,Base11.zip,"(1488, 2240, 3)"
2,20051020_43832_0100_PP.tif,Base11.zip,"(1488, 2240, 3)"
3,20051020_43882_0100_PP.tif,Base11.zip,"(1488, 2240, 3)"
4,20051020_43906_0100_PP.tif,Base11.zip,"(1488, 2240, 3)"


In [11]:
base_df.tail()

Unnamed: 0,Image name,Zip File,Image Size
1195,20051208_41318_0400_PP.tif,Base34.zip,"(1536, 2304, 3)"
1196,20051208_41373_0400_PP.tif,Base34.zip,"(1536, 2304, 3)"
1197,20051208_41570_0400_PP.tif,Base34.zip,"(1536, 2304, 3)"
1198,20051208_41707_0400_PP.tif,Base34.zip,"(1536, 2304, 3)"
1199,20051208_42314_0400_PP.tif,Base34.zip,"(1536, 2304, 3)"


For sanity check, let's explore the information of this data frame 

In [12]:
base_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Image name  1200 non-null   object
 1   Zip File    1200 non-null   object
 2   Image Size  1200 non-null   object
dtypes: object(3)
memory usage: 28.2+ KB


### 2.5.4 Joining image sizes and labels <a id='2.5.4_joims_l'></a>

Good. Now that we have this information available, let's join the two data frames into one so we can have it all together for future analysis.

In [13]:
img_labels_sizes = pd.merge(base_df, df_total, how='inner', on='Image name') # merging the two data frames

Again, let's explore the information at the merged data frame and confirm that we, in fact, have 1200 rows.

In [14]:
img_labels_sizes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1200 entries, 0 to 1199
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Image name              1200 non-null   object
 1   Zip File                1200 non-null   object
 2   Image Size              1200 non-null   object
 3   Retinopathy grade       1200 non-null   int64 
 4   Risk of macular edema   1200 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 56.2+ KB


As a final step, we will save the data into a new excel data set to future easy loading.

In [15]:
img_labels_sizes.to_csv(r'{}\data_processed\labels_sizes.csv'.format(dir_path), index=False) # saving the final tabular dataset

## 2.6 Reading Image Data <a id='2.6_rimd_l'></a>

Now that we have our tabular data ready, it is time to start with the images and provide an uncompressed image database for our next stage 03 (Exploratory Data Analysis).

One crucial feature that our data needs to have for future steps in modeling stages is a homogeneous sizes. Therefore, this uncompressed database is going to have the recompilation of .tif images in one same size. This is going to be our benchmark image database since it will keep the original color and RGB features (no image processing nor operations).

However, before doing so, we need to confirm how many image sizes we have so we can apply a re-shaping operation in a correct way.

### 2.6.1 Dealing with image sizes <a id='2.6.1_dims_l'></a>

Let's see and confirm how many different image sizes we have among all of our images.

In [16]:
set(list(img_labels_sizes['Image Size'])) # unique sizes of all images in zip files

{'(1488, 2240, 3)', '(1536, 2304, 3)', '(960, 1440, 3)'}

In [17]:
# taking 3 images from different files with different sizes
img_nparray1 = plt.imread('20051020_43808_0100_PP.tif') # (1488, 2240, 3) Base11.zip

zip_file = zipfile.ZipFile('{}\data\{}'.format(dir_path, 'Base31.zip')) # (1536, 2304, 3) Base31.zip
ifile = zip_file.open('20051116_44482_0400_PP.tif')
img_nparray2 = np.asarray(Image.open(ifile))

zip_file = zipfile.ZipFile('{}\data\{}'.format(dir_path, 'Base21.zip')) # (960, 1440, 3) Base21.zip
ifile = zip_file.open('20060407_41937_0200_PP.tif')
img_nparray3 = np.asarray(Image.open(ifile))

print(img_nparray1.shape)
print(img_nparray2.shape)
print(img_nparray3.shape)

(1488, 2240, 3)
(1536, 2304, 3)
(960, 1440, 3)


We have three different image sizes

* (1488, 2240, 3)
* (1453, 2304, 3)
* (960, 1449, 3)

So, we must come up with a reduced size that can fit all these three. One that can be used is **(372, 560, 3)**, however, since some of the deep learning models require to input squared-sized images to the first layer and since we are going to significantly reduce the training/computational time by using smaller images, we can reshape our database to **(800, 800, 3)**.

This last step will be done in 03_EDA part, for now, we will generate our image database with the original files.

### 2.6.2 Saving data into local directory for EDA <a id='2.6.2_sdim_eda'></a>

Since we have been already using the same code block for a couple of times, it is time to make use of our helper function to perform the reading of each file and doing a specific type of operation, this is basically what we are doing with the helper function:

<p style="text-align:center;"><img src="https://raw.githubusercontent.com/pablo-git8/retinopathy-detection/main/images/helper_read_func.jpg" alt="Code1" style="width: 700px;"/></p>

So, let's now use this function to save all images in the data_processed/data_original file. In this way we will be creating our benchmark database of images in our own working directory!

In [18]:
# Creating databse of images in our working directory
#h.parse_images(file_names=file_names, dir_path=dir_path, op='original')

<p style="text-align:center;"><img src="https://raw.githubusercontent.com/pablo-git8/retinopathy-detection/main/images/creating_original_databse.jpg" alt="Code1" style="width: 600px;"/></p>

## 2.7 Summary<a id='2.7_Summary'></a>

In this data wrangling stage, we can conclude that:

- There are 1200 images stored in RGB 3D format in 12 zip files (100 images each) 
- On those zip files, a tabular spreadsheet is included with the labels referencing the images by its file names with the tif extension
- We have three different sizes on the images: (1488, 2240, 3), (1453, 2304, 3) and (960, 1449, 3)
- The above sizes can be somewhat large for model input, so it may be necessary to resize the images
- This step was particularly useful to unpack all information, save it into the local directory and provide the necessary adjustment for future reading, exploring and accessing the data