# Diabetic Retinopathy

## Description

The Diabetic Retinopathy (DR) dataset predicts whether an image contains signs of diabetic retinopathy. The dataset contains features that have been extracted frm the [Messidor](https://www.adcis.net/en/third-party/messidor/) image set.

The column attributes are as follows:

| Column Name | Description |
| - | - |
| assessment_quality | Quality of assessment |
| prescreening_result | Pre-screening analysis results |
| mas_alpha_5 | Number of microaneurysms (MA) detected at confidence level alpha=0.5 |
| mas_alpha_6 | Number of microaneurysms (MA) detected at confidence level alpha=0.6 |
| mas_alpha_7 | Number of microaneurysms (MA) detected at confidence level alpha=0.7 |
| mas_alpha_8 | Number of microaneurysms (MA) detected at confidence level alpha=0.8 |
| mas_alpha_9 | Number of microaneurysms (MA) detected at confidence level alpha=0.9 |
| mas_alpha_10 | Number of microaneurysms (MA) detected at confidence level alpha=1.0 |
| exudates_alpha_50 | Number of exudates found at confidence level alpha=0.50 divided by the diameter of the region of interest (ROI) |
| exudates_alpha_57 | Number of exudates found at confidence level alpha≈0.57 divided by the diameter of the region of interest (ROI) |
| exudates_alpha_64 | Number of exudates found at confidence level alpha≈0.64 divided by the diameter of the region of interest (ROI) |
| exudates_alpha_71 | Number of exudates found at confidence level alpha≈0.71 divided by the diameter of the region of interest (ROI) |
| exudates_alpha_79 | Number of exudates found at confidence level alpha≈0.79 divided by the diameter of the region of interest (ROI) |
| exudates_alpha_86 | Number of exudates found at confidence level alpha≈0.86 divided by the diameter of the region of interest (ROI) |
| exudates_alpha_93 | Number of exudates found at confidence level alpha≈0.93 divided by the diameter of the region of interest (ROI) |
| exudates_alpha_100 | Number of exudates found at confidence level alpha=1.00 divided by the diameter of the region of interest (ROI) |
| macula_disc_distance | Euclidean distance of the center of the macula and the center of the optic disct divided by the diameter of the region of  |interest (ROI)
| disc_diameter | Diameter of the optic disc |
| am_fm_result | Binary result from AM/FM-based classification |
| contains_DR | Class label 0 (no signs of diabetic retinopathy) and 1 (contains signs of diabetic retinopathy) |

[Source](https://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set)

## Importing the Dataset

In [None]:
import pandas as pd
from scipy.io import arff

column_names = ['assessment_quality',
                'prescreening_result',
                'mas_alpha_5',
                'mas_alpha_6',
                'mas_alpha_7',
                'mas_alpha_8',
                'mas_alpha_9',
                'mas_alpha_10',
                'exudates_alpha_50',
                'exudates_alpha_57',
                'exudates_alpha_64',
                'exudates_alpha_71',
                'exudates_alpha_79',
                'exudates_alpha_86',
                'exudates_alpha_93',
                'exudates_alpha_100',
                'macula_disc_distance',
                'disc_diameter',
                'am_fm_result',
                'contains_DR']
features = column_names[:-1]
target = column_names[-1]

with open("../../datasets/classification/diabetic_retinopathy.arff", "r") as dataset_file:
    raw_data, meta = arff.loadarff(dataset_file)

## Preparing the Dataset

In [None]:
# Convert the raw numpy dataset to a pandas DataFrame. This allows for mixed datatypes within the same multidimensional matrix object.
prepared_data = pd.DataFrame(raw_data.tolist(), columns=column_names)

# Decode integer columns.
prepared_data['assessment_quality'] = prepared_data['assessment_quality'].astype(int)
prepared_data['prescreening_result'] = prepared_data['prescreening_result'].astype(int)
prepared_data['mas_alpha_5'] = prepared_data['mas_alpha_5'].astype(int)
prepared_data['mas_alpha_6'] = prepared_data['mas_alpha_6'].astype(int)
prepared_data['mas_alpha_7'] = prepared_data['mas_alpha_7'].astype(int)
prepared_data['mas_alpha_8'] = prepared_data['mas_alpha_8'].astype(int)
prepared_data['mas_alpha_9'] = prepared_data['mas_alpha_9'].astype(int)
prepared_data['mas_alpha_10'] = prepared_data['mas_alpha_10'].astype(int)
prepared_data['am_fm_result'] = prepared_data['am_fm_result'].astype(int)

# Decode integer target column.
prepared_data['contains_DR'] = prepared_data['contains_DR'].astype(int)

The following block prints the shape and column datatypes of the processed dataset.

In [None]:
print(prepared_data.shape)
print(prepared_data.dtypes)

## Preprocessing the Dataset

In [None]:
from sklearn.model_selection import train_test_split

X_full = prepared_data[features].copy()
y_full = prepared_data[target].copy()

# Split the dataset into two parts, one part training, the other, testing and validating.
X_train, X_test_and_val, y_train, y_test_and_val = train_test_split(X_full, y_full, 
                                                        train_size=0.6,
                                                        random_state=0)
# Split the second part from the previous split into two parts, one part testing, the other, validating.
X_test, X_val, y_test, y_val = train_test_split(X_test_and_val, y_test_and_val, 
                                                        train_size=0.5,
                                                        random_state=0)

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# Fit scaler to data then transform it.
X_train_scaled = scaler.fit_transform(X_train)

# Apply same transformation to test and validation data without fitting.
X_test_scaled = scaler.transform(X_test)
X_val_scaled = scaler.transform(X_val)

## Training on Multiple Classifiers

In [None]:
# Manage imports
import sklearn.linear_model
import sklearn.tree
import sklearn.ensemble
import sklearn.neighbors
import sklearn.naive_bayes
from utilities import train_estimators, plot_estimator_scores

### Logistic Regression Classification

In [None]:
adjusted_parameter = 'C'
adjusted_parameter_values = [1e-05, 0.0001, 0.001, 0.01, 0.1, 1.0]

LogisticRegressionEstimators = train_estimators(X_train_scaled, y_train,
                                                sklearn.linear_model.LogisticRegression,
                                                adjusted_parameter, adjusted_parameter_values,
                                                max_iter=10000,
                                                random_state=0)
plot_estimator_scores(LogisticRegressionEstimators,
                        adjusted_parameter, adjusted_parameter_values,
                        X_train_scaled, y_train, X_test_scaled, y_test, X_val_scaled, y_val)

### SVM Classification

In [None]:
adjusted_parameter = 'C'
adjusted_parameter_values = [0.01, 0.1,1.0,10.0,100.0]

SVMEstimators = train_estimators(X_train_scaled, y_train,
                                    sklearn.svm.SVC,
                                    adjusted_parameter, adjusted_parameter_values,
                                    gamma=0.0001,
                                    max_iter=10000,
                                    random_state=0)
plot_estimator_scores(SVMEstimators,
                        adjusted_parameter, adjusted_parameter_values,
                        X_train_scaled, y_train, X_test_scaled, y_test, X_val_scaled, y_val)

### Decision Tree Classification

In [None]:
adjusted_parameter = 'max_depth'
adjusted_parameter_values = [1, 5, 10, 20, 50, 100]

DecisionTreeEstimators = train_estimators(X_train_scaled, y_train,
                                            sklearn.tree.DecisionTreeClassifier,
                                            adjusted_parameter, adjusted_parameter_values,
                                            splitter='random',
                                            random_state=0)
plot_estimator_scores(DecisionTreeEstimators,
                        adjusted_parameter, adjusted_parameter_values,
                        X_train_scaled, y_train, X_test_scaled, y_test, X_val_scaled, y_val)

### Random Forest Classification

In [None]:
adjusted_parameter = 'max_depth'
adjusted_parameter_values = [1, 5, 10, 20, 50, 100]

RandomTreeEstimators = train_estimators(X_train_scaled, y_train,
                                        sklearn.ensemble.RandomForestClassifier,
                                        adjusted_parameter, adjusted_parameter_values,
                                        random_state=0)
plot_estimator_scores(RandomTreeEstimators,
                        adjusted_parameter, adjusted_parameter_values,
                        X_train_scaled, y_train, X_test_scaled, y_test, X_val_scaled, y_val)

### K-Nearest Neighbours Classification

In [None]:
adjusted_parameter = 'weights'
adjusted_parameter_values = ['uniform','distance']

KNearestEstimators = train_estimators(X_train_scaled, y_train,
                                        sklearn.neighbors.KNeighborsClassifier,
                                        adjusted_parameter, adjusted_parameter_values,
                                        n_neighbors=2)
plot_estimator_scores(KNearestEstimators,
                        adjusted_parameter, adjusted_parameter_values,
                        X_train_scaled, y_train, X_test_scaled, y_test, X_val_scaled, y_val)

In [None]:
adjusted_parameter = 'algorithm'
adjusted_parameter_values = ['auto', 'ball_tree', 'kd_tree','brute']

KNearestEstimators = train_estimators(X_train_scaled, y_train,
                                        sklearn.neighbors.KNeighborsClassifier,
                                        adjusted_parameter, adjusted_parameter_values,
                                        n_neighbors=2)
plot_estimator_scores(KNearestEstimators,
                        adjusted_parameter, adjusted_parameter_values,
                        X_train_scaled, y_train, X_test_scaled, y_test, X_val_scaled, y_val)

### Ada Boost Classification

In [None]:
adjusted_parameter = 'n_estimators'
adjusted_parameter_values = [10, 50, 100, 500, 1000, 5000]

AdaBoostEstimators = train_estimators(X_train_scaled, y_train,
                                        sklearn.ensemble.AdaBoostClassifier,
                                        adjusted_parameter, adjusted_parameter_values,
                                        random_state=0)
plot_estimator_scores(AdaBoostEstimators,
                        adjusted_parameter, adjusted_parameter_values,
                        X_train_scaled, y_train, X_test_scaled, y_test, X_val_scaled, y_val)

## Gaussian Naive Bayes Classification

In [None]:
gaussian_nb = sklearn.naive_bayes.GaussianNB()
NaiveBayesEstimator = gaussian_nb.fit(X_train, y_train)
gaussian_nb_train_score = NaiveBayesEstimator.score(X_train, y_train)
gaussian_nb_test_score =  NaiveBayesEstimator.score(X_test, y_test)
gaussian_nb_val_score =  NaiveBayesEstimator.score(X_val, y_val)
print(f'{gaussian_nb_train_score=}, {gaussian_nb_val_score=}, {gaussian_nb_test_score=}')