#  Diabetic Retinopathy Debrecen Data Set

## Data source:
<https://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen+Data+Set>

## Attribute Information:

0. quality: The binary result of quality assessment. 0 = bad quality 1 = sufficient quality.
1. prescreen: The binary result of pre-screening, where 1 indicates severe retinal abnormality and 0 its lack.
2. ma_detection_0.5: The results of MA detection, which is the number of MAs found at the confidence level alpha = 0.5
3. ma_detection_0.6: The results of MA detection, which is the number of MAs found at the confidence level alpha = 0.6
4. ma_detection_0.7: The results of MA detection, which is the number of MAs found at the confidence level alpha = 0.7
5. ma_detection_0.8: The results of MA detection, which is the number of MAs found at the confidence level alpha = 0.8
6. ma_detection_0.9: The results of MA detection, which is the number of MAs found at the confidence level alpha = 0.9
7. ma_detection_1.0: The results of MA detection, which is the number of MAs found at the confidence level alpha = 1.0
8. exudates_0.1: The number of exudates found at the confidence level alpha = 0.1 with normalization*.
9. exudates_0.2: The number of exudates found at the confidence level alpha = 0.2 with normalization*.
10. exudates_0.3: The number of exudates found at the confidence level alpha = 0.3 with normalization*.
11. exudates_0.4: The number of exudates found at the confidence level alpha = 0.4 with normalization*.
12. exudates_0.5: The number of exudates found at the confidence level alpha = 0.5 with normalization*.
13. exudates_0.6: The number of exudates found at the confidence level alpha = 0.6 with normalization*.
14. exudates_0.7: The number of exudates found at the confidence level alpha = 0.7 with normalization*.
15. exudates_0.8: The number of exudates found at the confidence level alpha = 0.8 with normalization*.

* Exudates are represented by a set of points rather than the number of pixels constructing the lesions, hence these features are normalized by dividing the number of lesions with the diameter of the ROI to compensate different image sizes.

16. dist_macula_optic: The euclidean distance of the center of the macula and the center of the optic disc to provide important information regarding the patient's condition. This feature is also normalized with the diameter of the ROI.
17. diameter_optic: The diameter of the optic disc.
18. am_fm: The binary result of the AM/FM-based classification.
19. Class: Class label. 1 = contains signs of Diabetic Retinopathy (DR) (Accumulative label for the Messidor classes 1, 2, 3), 0 = no signs of DR.


# Load the library and data

In [19]:
from scipy.io import arff
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report

data = arff.loadarff('messidor_features.arff')
df = pd.DataFrame(data[0])
df.columns = ["quality", "prescreen", 
              "ma_detection_0.5", "ma_detection_0.6", "ma_detection_0.7", 
              "ma_detection_0.8", "ma_detection_0.9", "ma_detection_1.0",
              "exudates_0.1", "exudates_0.2", "exudates_0.3",
              "exudates_0.4", "exudates_0.5", "exudates_0.6",
              "exudates_0.7", "exudates_0.8",
              "dist_macula_optic", "diameter_optic", "am_fm", "Class"
             ]

# Preview the data

In [61]:
df.head()

Unnamed: 0,quality,prescreen,ma_detection_0.5,ma_detection_0.6,ma_detection_0.7,ma_detection_0.8,ma_detection_0.9,ma_detection_1.0,exudates_0.1,exudates_0.2,exudates_0.3,exudates_0.4,exudates_0.5,exudates_0.6,exudates_0.7,exudates_0.8,dist_macula_optic,diameter_optic,am_fm,Class
0,1.0,1.0,22.0,22.0,22.0,19.0,18.0,14.0,49.895756,17.775994,5.27092,0.771761,0.018632,0.006864,0.003923,0.003923,0.486903,0.100025,1.0,0
1,1.0,1.0,24.0,24.0,22.0,18.0,16.0,13.0,57.709936,23.799994,3.325423,0.234185,0.003903,0.003903,0.003903,0.003903,0.520908,0.144414,0.0,0
2,1.0,1.0,62.0,60.0,59.0,54.0,47.0,33.0,55.831441,27.993933,12.687485,4.852282,1.393889,0.373252,0.041817,0.007744,0.530904,0.128548,0.0,1
3,1.0,1.0,55.0,53.0,53.0,50.0,43.0,31.0,40.467228,18.445954,9.118901,3.079428,0.840261,0.272434,0.007653,0.001531,0.483284,0.11479,0.0,0
4,1.0,1.0,44.0,44.0,44.0,41.0,39.0,27.0,18.026254,8.570709,0.410381,0.0,0.0,0.0,0.0,0.0,0.475935,0.123572,0.0,1


# Check for any missing value

In [62]:
df.isna().any()

quality              False
prescreen            False
ma_detection_0.5     False
ma_detection_0.6     False
ma_detection_0.7     False
ma_detection_0.8     False
ma_detection_0.9     False
ma_detection_1.0     False
exudates_0.1         False
exudates_0.2         False
exudates_0.3         False
exudates_0.4         False
exudates_0.5         False
exudates_0.6         False
exudates_0.7         False
exudates_0.8         False
dist_macula_optic    False
diameter_optic       False
am_fm                False
Class                False
dtype: bool

# Clean the data
- Clean the 'Class' variable to integer value

In [65]:
df["Class"] = [j.replace("b", "").replace("'", "") for j in df["Class"].astype(str)]
df["Class"] = df["Class"].astype(int)
df["Class"].head()

0    0
1    0
2    1
3    0
4    1
Name: Class, dtype: int32

# Data Partitioning
- Subset the features as X, and target class as Y
- Split the data into 60% training set and 40% testing set. Stratify split is performed to maintain the class proportion among training and testing set.

In [70]:
Y = df.iloc[:, -1]
X = df.iloc[:, :-1]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.4, stratify = Y, random_state = 123)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
print(y_train.mean(), y_test.mean()) # proportion of class = 1 is similar for training and test set

(690, 19) (461, 19) (690,) (461,)
0.5304347826086957 0.5314533622559653


# Modelling 
4 models were trained and tested on the data:
1. Logistic Regression
2. Random Forest classifier
3. K-Nearest Neighbor classifier
4. Linear Support Vector Machine (SVM)

### Logistic Regression

In [71]:
reg = LogisticRegression()
reg.fit(x_train, y_train)
y_pred = reg.predict(x_test)
reg.score(x_test, y_test)
classification_report(y_test, y_pred, output_dict = True)



{'0': {'precision': 0.7142857142857143,
  'recall': 0.8101851851851852,
  'f1-score': 0.7592190889370932,
  'support': 216},
 '1': {'precision': 0.8101851851851852,
  'recall': 0.7142857142857143,
  'f1-score': 0.7592190889370932,
  'support': 245},
 'accuracy': 0.7592190889370932,
 'macro avg': {'precision': 0.7622354497354498,
  'recall': 0.7622354497354498,
  'f1-score': 0.7592190889370932,
  'support': 461},
 'weighted avg': {'precision': 0.7652518105338062,
  'recall': 0.7592190889370932,
  'f1-score': 0.7592190889370932,
  'support': 461}}

### Random Forest Classifier

In [54]:
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)
rf.score(x_test, y_test)
classification_report(y_test, y_pred, output_dict = True)



{'0': {'precision': 0.6375,
  'recall': 0.7083333333333334,
  'f1-score': 0.6710526315789473,
  'support': 216},
 '1': {'precision': 0.7149321266968326,
  'recall': 0.6448979591836734,
  'f1-score': 0.6781115879828327,
  'support': 245},
 'accuracy': 0.6746203904555315,
 'macro avg': {'precision': 0.6762160633484162,
  'recall': 0.6766156462585033,
  'f1-score': 0.67458210978089,
  'support': 461},
 'weighted avg': {'precision': 0.6786515640796615,
  'recall': 0.6746203904555315,
  'f1-score': 0.6748041376938105,
  'support': 461}}

### K-Nearest Neighbors Classifier

In [59]:
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
y_pred= knn.predict(x_test)
knn.score(x_test, y_test)
classification_report(y_test, y_pred, output_dict = True)

{'0': {'precision': 0.6290322580645161,
  'recall': 0.7222222222222222,
  'f1-score': 0.6724137931034483,
  'support': 216},
 '1': {'precision': 0.7183098591549296,
  'recall': 0.6244897959183674,
  'f1-score': 0.6681222707423581,
  'support': 245},
 'accuracy': 0.6702819956616052,
 'macro avg': {'precision': 0.6736710586097229,
  'recall': 0.6733560090702948,
  'f1-score': 0.6702680319229032,
  'support': 461},
 'weighted avg': {'precision': 0.6764791393381632,
  'recall': 0.6702819956616052,
  'f1-score': 0.6701330491154502,
  'support': 461}}

### Linear Support Vector Machine (SVM)

In [60]:
svc = LinearSVC()
svc.fit(x_train, y_train)
y_pred= svc.predict(x_test)
svc.score(x_test, y_test)
classification_report(y_test, y_pred, output_dict = True)



{'0': {'precision': 0.6554054054054054,
  'recall': 0.8981481481481481,
  'f1-score': 0.7578124999999999,
  'support': 216},
 '1': {'precision': 0.8666666666666667,
  'recall': 0.5836734693877551,
  'f1-score': 0.6975609756097562,
  'support': 245},
 'accuracy': 0.7310195227765727,
 'macro avg': {'precision': 0.761036036036036,
  'recall': 0.7409108087679517,
  'f1-score': 0.727686737804878,
  'support': 461},
 'weighted avg': {'precision': 0.7676809130171386,
  'recall': 0.7310195227765727,
  'f1-score': 0.7257916247817575,
  'support': 461}}