# Diabetic Retinopathy

## Description

The Diabetic Retinopathy (DR) dataset predicts whether an image contains signs of diabetic retinopathy. The dataset contains features that have been extracted frm the [Messidor](https://www.adcis.net/en/third-party/messidor/) image set.

The column attributes are as follows:

| Column Name | Description |
| - | - |
| assessment_quality | Quality of assessment |
| prescreening_result | Pre-screening analysis results |
| mas_alpha_5 | Number of microaneurysms (MA) detected at confidence level alpha=0.5 |
| mas_alpha_6 | Number of microaneurysms (MA) detected at confidence level alpha=0.6 |
| mas_alpha_7 | Number of microaneurysms (MA) detected at confidence level alpha=0.7 |
| mas_alpha_8 | Number of microaneurysms (MA) detected at confidence level alpha=0.8 |
| mas_alpha_9 | Number of microaneurysms (MA) detected at confidence level alpha=0.9 |
| mas_alpha_10 | Number of microaneurysms (MA) detected at confidence level alpha=1.0 |
| exudates_alpha_50 | Number of exudates found at confidence level alpha=0.50 divided by the diameter of the region of interest (ROI) |
| exudates_alpha_57 | Number of exudates found at confidence level alpha≈0.57 divided by the diameter of the region of interest (ROI) |
| exudates_alpha_64 | Number of exudates found at confidence level alpha≈0.64 divided by the diameter of the region of interest (ROI) |
| exudates_alpha_71 | Number of exudates found at confidence level alpha≈0.71 divided by the diameter of the region of interest (ROI) |
| exudates_alpha_79 | Number of exudates found at confidence level alpha≈0.79 divided by the diameter of the region of interest (ROI) |
| exudates_alpha_86 | Number of exudates found at confidence level alpha≈0.86 divided by the diameter of the region of interest (ROI) |
| exudates_alpha_93 | Number of exudates found at confidence level alpha≈0.93 divided by the diameter of the region of interest (ROI) |
| exudates_alpha_100 | Number of exudates found at confidence level alpha=1.00 divided by the diameter of the region of interest (ROI) |
| macula_disc_distance | Euclidean distance of the center of the macula and the center of the optic disct divided by the diameter of the region of  |interest (ROI)
| disc_diameter | Diameter of the optic disc |
| am_fm_result | Binary result from AM/FM-based classification |
| contains_DR | Class label 0 (no signs of diabetic retinopathy) and 1 (contains signs of diabetic retinopathy) |

[Source](https://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set)

## Importing the Dataset

In [None]:
import pandas as pd
from scipy.io import arff

column_names = ['assessment_quality',
                'prescreening_result',
                'mas_alpha_5',
                'mas_alpha_6',
                'mas_alpha_7',
                'mas_alpha_8',
                'mas_alpha_9',
                'mas_alpha_10',
                'exudates_alpha_50',
                'exudates_alpha_57',
                'exudates_alpha_64',
                'exudates_alpha_71',
                'exudates_alpha_79',
                'exudates_alpha_86',
                'exudates_alpha_93',
                'exudates_alpha_100',
                'macula_disc_distance',
                'disc_diameter',
                'am_fm_result',
                'contains_DR']

with open("../../datasets/classification/diabetic_retinopathy.arff", "r") as dataset_file:
    raw_data, meta = arff.loadarff(dataset_file)

## Preparing the Dataset

In [None]:
# Convert the raw numpy dataset to a pandas DataFrame. This allows for mixed datatypes within the same multidimensional matrix object.
processed_data = pd.DataFrame(raw_data.tolist(), columns=column_names)

# Decode integer columns.
processed_data['assessment_quality'] = processed_data['assessment_quality'].astype(int)
processed_data['prescreening_result'] = processed_data['prescreening_result'].astype(int)
processed_data['mas_alpha_5'] = processed_data['mas_alpha_5'].astype(int)
processed_data['mas_alpha_6'] = processed_data['mas_alpha_6'].astype(int)
processed_data['mas_alpha_7'] = processed_data['mas_alpha_7'].astype(int)
processed_data['mas_alpha_8'] = processed_data['mas_alpha_8'].astype(int)
processed_data['mas_alpha_9'] = processed_data['mas_alpha_9'].astype(int)
processed_data['mas_alpha_10'] = processed_data['mas_alpha_10'].astype(int)

# Decode integer target column.
processed_data['am_fm_result'] = processed_data['am_fm_result'].astype(int)
processed_data['contains_DR'] = processed_data['contains_DR'].astype(int)

The following block prints the shape and column datatypes of the processed dataset.

In [None]:
print(processed_data.shape)
print(processed_data.dtypes)