### [Dataset citation]
##### <EyePACS dataset 2015>
- Emma Dugas, Jared, Jorge, Will Cukierski. (2015). Diabetic Retinopathy Detection. Kaggle. https://kaggle.com/competitions/diabetic-retinopathy-detection

##### <APTOS 2019>
- Karthik, Maggie, Sohier Dane. (2019). APTOS 2019 Blindness Detection. Kaggle. https://kaggle.com/competitions/aptos2019-blindness-detection

##### <INDIAN DIABETIC RETINOPATHY IMAGE DATASET (IDRID)>
- Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabuddhe, Fabrice Meriaudeau, April 24, 2018, "Indian Diabetic Retinopathy Image Dataset (IDRiD)", IEEE Dataport, doi: https://dx.doi.org/10.21227/H25W98.
- https://www.kaggle.com/datasets/gami4388/diabetic-retinopathy-resized-train-15-19-dg?select=resized_train_15_19_DG

##### <Messidor 2>
- Abramoff et al, Automated analysis of retinal images for detection of referable diabetic retinopathy, JAMA Ophthalmol. 2013;131:351-7, and in Abramoff et al, Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning, IOVS. 57:5200-06.
- https://medicine.uiowa.edu/eye/abramoff
- https://www.kaggle.com/datasets/mariaherrerot/messidor2preprocess/data

### [Package load]

In [1]:
import torch
import torchvision
import torchvision.transforms as transforms
import cv2
import numpy as np
import pandas as pd
from PIL import Image
import glob
import os
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import shutil

# 이걸 해줘야 matplotlib 시행 시 에러가 안 남
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

In [2]:
original_data_dir = '.\original data'
original_test_csv = pd.read_csv(os.path.join('.','test_label.csv'), index_col=0)
original_train_csv = pd.read_csv(os.path.join('.','train_label_original.csv'), index_col=0)
original_val_csv = pd.read_csv(os.path.join('.','val_label.csv'), index_col=0)

concat_csv = pd.concat([original_test_csv, original_train_csv, original_val_csv])
concat_csv

Unnamed: 0,diagnosis
20051020_45050_0100_PP,3
20051020_54209_0100_PP,3
20051020_57761_0100_PP,3
20051216_45226_0200_PP,3
20051216_47000_0200_PP,3
...,...
26067_right,0
36551_left,2
25142_right,2
5124_right,0


In [3]:
train_csv = concat_csv.sample(frac=0.6, replace=False)
train_id = train_csv.index.to_list()
concat_csv = concat_csv.drop(train_id)
val_csv = concat_csv.sample(frac=0.5, replace=False)
val_id = val_csv.index.to_list()
test_csv = concat_csv.drop(val_id)

print(len(train_csv))
print(len(val_csv))
print(len(test_csv))

train_csv.to_csv(os.path.join('.', 'train_label_rescale.csv'), index=True)
val_csv.to_csv(os.path.join('.', 'val_label_rescale.csv'), index=True)
test_csv.to_csv(os.path.join('.', 'test_label_rescale.csv'), index=True)

24304
8102
8101


### [Image Rescale]

In [4]:
import math

def img_crop(img):
    # 대각선만큼 상하좌우 패딩
    img_h, img_w = img.shape[0], img.shape[1]
    diag = int((img_h**2 + img_w ** 2)**0.5)    # 대각선 길이 구하기
    add_h, add_w = int((diag - img_h)/2), int((diag - img_w)/2)     # 상하좌우에 각각 추가될 padding의 길이 구하기
    img = cv2.copyMakeBorder(img, add_h, add_h, add_w, add_w, cv2.BORDER_CONSTANT,value=0)  # 0이라는 CONSTANT (검은색)으로 가장자리 추가

    # actan만큼 회전
    img_h, img_w = img.shape[0], img.shape[1]
    degree = math.degrees(math.atan(img_h / img_w)) # actan에 해당하는 각도 구하기
    x_center, y_center = int(img_h/2), int(img_w/2) # 이미지 중심의 좌표 구하기
    matrix = cv2.getRotationMatrix2D((x_center, y_center), -degree, 1)  # 중심을 기준으로 해서 시계방향으로 degree만큼 (-degree) 회전하는 행렬 구하기, scal (확대 비율)은 그대로 1
    img = cv2.warpAffine(img, matrix, (diag, diag))   # 구한 rotation matrix를 img에 적용
    
    # Crop하기
    img = cv2.copyMakeBorder(img, 10,10,10,10,cv2.BORDER_CONSTANT,value=[0,0,0])    # 회전시킨 영상에서 다시 상하좌우 10만큼 검은색 배경 추가 -> border를 잘 자르기 위해서
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    _, gray = cv2.threshold(gray, 5, 255, cv2.THRESH_BINARY)    # 밝기 5 이하를 0, 그 이상은 255로 처리 -> 배경을 싹다 0으로 만들어주기
    contours,hierarchy = cv2.findContours(gray,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
        # 여기서 gray를 binary로 넣어야 해서 gray로 시작하는 image 전처리가 필요함
        # cv2.RETR_EXTERNAL: 가장 바깥쪽 라인만 생성
        # cv2.CHAIN_APPROX_SIMPLE: 컨투어 꼭짓점 좌표만 제공 -> 가장 바깥쪽 원의 좌표만 제공
    contours = max(contours, key=cv2.contourArea)
    x,y,w,h = cv2.boundingRect(contours)
        # 주어진 점을 감싸는 최소 크기 사각형(바운딩 박스)를 반환 -> x,y,w,h는 bounding box에 대한 좌표
    img = img[y:y+h, x:x+w]
    img_H, img_W = img.shape[0], img.shape[1]
    ret = max(img_H, img_W)

    # 반대 rotation
    img_H, img_W = img.shape[0], img.shape[1]
    x_center, y_center = int(img_H/2), int(img_W/2)
    matrix = cv2.getRotationMatrix2D((x_center, y_center), degree, 1)   # 반대로 rotation
    img = cv2.warpAffine(img, matrix, (img_H, img_W))
    
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    _,gray = cv2.threshold(gray, 10, 255, cv2.THRESH_BINARY)
    contours,hierarchy = cv2.findContours(gray,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
    contours = max(contours, key=cv2.contourArea)
    x,y,w,h = cv2.boundingRect(contours)
    img = img[y:y+h, x:x+w]
    return img

In [5]:
def img_resize(img, mode, size):
    if size == 1024:
        train_size = 1200
        val_test_size = 1024
    elif size == 512:
        train_size = 600
        val_test_size = 512
    if mode == 'train':
        if min(img.shape[0], img.shape[1]) <= train_size: # 이미지 size가 1600보다 작아서 확대해야 하는 경우
            img = cv2.resize(img, dsize=(train_size,train_size), interpolation=cv2.INTER_CUBIC)
        else:   # 이미지 size가 1024보다 커서 축소 해야 하는 경우
            img = cv2.resize(img, dsize=(train_size,train_size), interpolation=cv2.INTER_AREA)
    elif mode == 'val':
        if min(img.shape[0], img.shape[1]) <= val_test_size: # 이미지 size가 1024보다 작아서 확대해야 하는 경우
            img = cv2.resize(img, dsize=(val_test_size,val_test_size), interpolation=cv2.INTER_CUBIC)
        else:   # 이미지 size가 1024보다 커서 축소 해야 하는 경우
            img = cv2.resize(img, dsize=(val_test_size,val_test_size), interpolation=cv2.INTER_AREA)
    elif mode == 'test':
        if min(img.shape[0], img.shape[1]) <= val_test_size: # 이미지 size가 1024보다 작아서 확대해야 하는 경우
            img = cv2.resize(img, dsize=(val_test_size,val_test_size), interpolation=cv2.INTER_CUBIC)
        else:   # 이미지 size가 1024보다 커서 축소 해야 하는 경우
            img = cv2.resize(img, dsize=(val_test_size,val_test_size), interpolation=cv2.INTER_AREA)
    return img

In [18]:
data_dir = '.'

def rescale_and_save(mode, path_list, size):
    print('start preprocessing...')
    completed_list = sorted(glob.glob(os.path.join(data_dir, 'rescale_512', str(mode), '*')))
    completed_list_name = [os.path.split(x)[-1] for x in completed_list]
    working_list = [x for x in path_list if os.path.split(x)[-1] not in completed_list_name]
    for img_path in tqdm(working_list):
        img_name = os.path.split(img_path)[-1]
        img = cv2.imread(img_path)
        img = img_crop(img)
        img = img_resize(img, mode=mode, size = size)
        cv2.imwrite(os.path.join(data_dir, 'rescale_512', str(mode), img_name), img)
    print('finish!')

In [8]:
total_file_path = sorted(glob.glob(os.path.join(original_data_dir, 'sorted','*', '*')))

train_csv = pd.read_csv(os.path.join(data_dir,'train_label_rescale.csv'), index_col=0)
train_id = train_csv.index.to_list()
train_path = [x for x in total_file_path if os.path.basename(x).split('.')[0] in train_id]
print("print(len(train_csv)): ", len(train_csv))
print("len(train_path): ", len(train_path))

val_csv = pd.read_csv(os.path.join(data_dir,'val_label_rescale.csv'), index_col=0)
val_id = val_csv.index.to_list()
val_path = [x for x in total_file_path if os.path.basename(x).split('.')[0] in val_id]
print("print(len(val_csv)): ", len(val_csv))
print("len(val_path): ", len(val_path))

test_csv = pd.read_csv(os.path.join(data_dir,'test_label_rescale.csv'), index_col=0)
test_id = test_csv.index.to_list()
test_path = [x for x in total_file_path if os.path.basename(x).split('.')[0] in test_id]
print("print(len(test_csv)): ", len(test_csv))
print("len(test_path): ", len(test_path))

print(len(train_csv)):  24304
len(train_path):  24304
print(len(val_csv)):  8102
len(val_path):  8102
print(len(test_csv)):  8101
len(test_path):  8101


In [19]:
rescale_and_save('test', test_path, 512)

start preprocessing...


  0%|          | 0/8101 [00:00<?, ?it/s]

finish!


In [20]:
rescale_and_save('val', val_path, 512)

start preprocessing...


  0%|          | 0/8102 [00:00<?, ?it/s]

finish!


In [21]:
rescale_and_save('train', train_path, 512)

start preprocessing...


  0%|          | 0/24304 [00:00<?, ?it/s]

finish!
