**Group 5**

1. Lionel Rozario
2. **FAIZALABBAS SAIYED**
3. Sai Karthik Sana
4. Nagul meera Shaik
5. Rahul Thodupunoori
6. Sasidhar Yellanki

# All Import Statements

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import GaussianNB
from ucimlrepo import fetch_ucirepo
from sklearn import metrics

# fetch dataset
diabetic_retinopathy_debrecen = fetch_ucirepo(id=329)

# data (as pandas dataframes)
X = diabetic_retinopathy_debrecen.data.features
y = diabetic_retinopathy_debrecen.data.targets

# variable information
url = 'https://archive.ics.uci.edu/static/public/329/data.csv'

# Read the CSV file into a DataFrame
df = pd.read_csv(url)
df.head()

# Probabilistic Classifiier Code

In [None]:
# Assuming 'Class' column contains the target variable
X = df.drop('Class', axis=1)
y = df['Class']

# Mapping the labels for binary classification (0: No Diabetic Retinopathy, 1: Diabetic Retinopathy)
y_binary = y.apply(lambda x: 0 if x == 0 else 1)

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)

# Scaling the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initializing and training the logistic regression model
base_model = LogisticRegression(max_iter=1000)
base_model.fit(X_train_scaled, y_train)  

# Initializing and training the calibrated classifier
calibrated_model = CalibratedClassifierCV(base_model, method='sigmoid', cv='prefit')
calibrated_model.fit(X_train_scaled, y_train)

# Predicting probabilities on the test set
y_prob = calibrated_model.predict_proba(X_test_scaled)[:, 1]

# Converting probabilities to binary predictions based on a threshold (e.g., 0.5)
y_pred = (y_prob > 0.5).astype(int)

# Evaluating the model
probabilistic_accuracy = accuracy_score(y_test, y_pred)
conf_matrix_probabilistic = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

***Binary Classifier:***

**Original Labels:**

* Initially, the target variable 'Class' is separated into two parts: features (X) and original labels (Y), which indicate whether a person has diabetic retinopathy (1) or not (0).

**Modified Labels for Binary Classification:**

* To fit the needs of a binary classification task, the original labels are adjusted. A new set of labels (y_binary) is created, where 1 indicates the presence of diabetic retinopathy, and 0 indicates its absence. This simplifies the task of the model.

***Splitting the dataset:***
* We are creating two sets of data: one for training a model and one for testing its accuracy. 

* It takes the features (like patient information) and the modified labels(y_binary) indicating if someone has diabetic retinopathy or not and splits them into two groups. 

* About 80% of the data is used to train the model, traininging it to make predictions. The remaining 20% is kept aside to test how well the model can predict on new data it hasn't seen before. This separation helps ensure the model is good at making predictions on different cases, not just the ones it learned from. 

* The use of random_state=42 ensures that the splitting is done in a consistent way, making it easier to compare results when the code is run multiple times.

# Probabilistic Classification Results

In [None]:
print(f"Accuracy of Probabilistic Classification:\n{probabilistic_accuracy}")
print(f"Classification Report:\n{classification_rep}")
print(f"conf_matrix_probabilistic:\n{conf_matrix_probabilistic}")

cm_display_probabilistic = ConfusionMatrixDisplay(confusion_matrix=conf_matrix_probabilistic, display_labels=[0, 1] )
cm_display_probabilistic.plot(cmap='Blues', values_format='d')
plt.title('Confusion Matrix Probabilistic Classification')
plt.show()

***Results:***

**Accuracy:**

* The model is about 72% accurate, meaning it correctly predicts whether a person has diabetic retinopathy or not for roughly 72% of the cases.

**Precision and Recall:**

* It is better at identifying cases with diabetic retinopathy (precision: 80%) compared to identifying cases without it (precision: 65%). This means when the model says someone has diabetic retinopathy, it's right 80% of the time.

* It is good at finding most of the actual cases with diabetic retinopathy (recall: 66%), but it may miss some cases without the condition (recall: 80%). This means it correctly identifies 66% of the people with diabetic retinopathy.

**Confusion Matrix:**

The confusion matrix shows specific counts of correct and incorrect predictions. For this instance, it correctly predicted 82 cases with diabetic retinopathy and 84 cases without, but it incorrectly classified 21 cases without the condition as having it.

**Summary:**

The model is reasonably accurate but may need improvement in correctly identifying cases without diabetic retinopathy. Further adjustments can be made like fine-tuning the model, could enhance its performance. The confusion matrix provides a detailed breakdown of correct and incorrect predictions.

# Euclidean Distance Classifier

In [None]:
X = df.drop('Class', axis=1)
Y = df['Class']

# Reconsidering the labels for binary classification
Y_binary = Y.apply(lambda x: 1 if x == 1 else 0)

# Converting X to a NumPy array
X = X.to_numpy()

# Splitting the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_binary, test_size=0.1, random_state=42)

# Creating a KNN classifier
model_distance = KNeighborsClassifier(n_neighbors=5, metric='euclidean')

# Fitting the model on the training data
model_distance.fit(X_train, Y_train)

# Making predictions on the test data
predictions_distance = model_distance.predict(X_test)

# Evaluating the model
accuracy_euclidean = accuracy_score(Y_test, predictions_distance)
classification_report_euclidean = classification_report(Y_test, predictions_distance)
conf_matrix_euclidean = confusion_matrix(Y_test, predictions_distance)


***Binary Classifier:***

**Original Labels:**

* Initially, the target variable 'Class' is separated into two parts: features (X) and original labels (Y), which indicate whether a person has diabetic retinopathy (1) or not (0).

**Modified Labels for Binary Classification:**

* To fit the needs of a binary classification task, the original labels are adjusted. A new set of labels (Y_binary) is created, where 1 indicates the presence of diabetic retinopathy, and 0 indicates its absence. This simplifies the task of the model.

***Splitting the dataset:***
* We are creating two sets of data: one for training a model and one for testing its accuracy. 

* It takes the features (like patient information) and the modified labels(Y_binary) indicating if someone has diabetic retinopathy or not and splits them into two groups. 

* About 80% of the data is used to train the model, traininging it to make predictions. The remaining 20% is kept aside to test how well the model can predict on new data it hasn't seen before. This separation helps ensure the model is good at making predictions on different cases, not just the ones it learned from. 

* The use of random_state=42 ensures that the splitting is done in a consistent way, making it easier to compare results when the code is run multiple times

# Euclidean Distance Classifier Results

In [None]:
# Printing results
print(f"Accuracy (Euclidean Distance): {accuracy_euclidean}")
print(f"Classification Report (Euclidean Distance):\n{classification_report_euclidean}")
print(f"Confusion Matrix (Euclidean Distance):\n{conf_matrix_euclidean}")

cm_display_knn = ConfusionMatrixDisplay(confusion_matrix=conf_matrix_euclidean, display_labels=[0, 1])
cm_display_knn.plot(cmap='Blues', values_format='d')
plt.title('Confusion Matrix Euclidean')
plt.show()

***Results:***

**Accuracy:**

* The model is about 68% accurate, meaning it correctly predicts whether a person has diabetic retinopathy or not for around 68% of the cases.

**Precision and Recall:**

* It performs slightly better at identifying cases without diabetic retinopathy (precision: 62%) compared to identifying cases with it (precision: 75%). This means when the model says someone has diabetic retinopathy, it's right 75% of the time.

* It's better at finding most of the actual cases with diabetic retinopathy (recall: 65%), but it may miss some cases without the condition (recall: 73%). This means it correctly identifies 65% of the people with diabetic retinopathy.

**Confusion Matrix:**

* The confusion matrix shows specific counts of correct and incorrect predictions. For instance, it correctly predicted 42 cases with diabetic retinopathy and 37 cases without, but it incorrectly classified 14 cases without the condition as having it.

**Summary:**

The KNN model with Euclidean distance is reasonably accurate. It's slightly better at identifying cases without diabetic retinopathy. The confusion matrix provides a detailed breakdown of correct and incorrect predictions, giving insights into where the model can be enhanced.

# Cosine Similarity Classifier

In [None]:
# Separating features and target variable
X = df.drop('Class', axis=1)
y = df['Class']

# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling numerical features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Computing cosine similarity
cosine_matrix = cosine_similarity(X_test_scaled, X_train_scaled)

# Using cosine similarity for classification with K-Nearest Neighbors
knn = KNeighborsClassifier(n_neighbors=11) 
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)

# Evaluating accuracy,classification report and confusion matrix
cosine_accuracy = accuracy_score(y_test, y_pred)
classification_report_cosine = classification_report(y_test, y_pred)
conf_matrix_cosine = confusion_matrix(y_test, y_pred)

* KNN can naturally handle binary classification tasks without explicit label modification. We are using cosine similarity as a distance metric is compatible with the original class labels, and the model is trained and evaluated based on these labels directly.

***Splitting the Dataset:***

* We are dividing the dataset into two parts: one for training a machine learning model and the other for testing the model's accuracy. 

* About 80% of the data is used for training, teaching the model to make predictions. The remaining 20% is kept aside to see how well the model can predict on new data it hasn't seen before. 

* The use of random_state=42 ensures that the data is split in a consistent way, making it easier to compare results when the code is run multiple times.

# Cosine Similarity Classifier Results

In [None]:
print(f"Accuracy:\n{cosine_accuracy}")
print(f"classification_report_cosine:\n{classification_report_cosine}")
print(f"Confusion Matrix:\n{conf_matrix_cosine}")

cm_display_cosine = ConfusionMatrixDisplay(confusion_matrix=conf_matrix_cosine, display_labels=[0, 1])
cm_display_cosine.plot(cmap='Blues', values_format='d')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

***Results:***

**Accuracy:**

* The model has an accuracy of about 67%, meaning it correctly predicts whether a person has diabetic retinopathy or not for approximately 67% of the cases.

**Precision and Recall:**

* Precision for predicting cases with diabetic retinopathy (class 1) is 73%, indicating that when the model predicts someone has diabetic retinopathy, it's correct about 73% of the time.

* Recall for predicting cases without diabetic retinopathy (class 0) is 71%, meaning the model correctly identifies 71% of the actual cases without diabetic retinopathy.

**Confusion Matrix:**

* The confusion matrix shows specific counts of correct and incorrect predictions. For this instance, it correctly predicted 82 cases with diabetic retinopathy and 73 cases without, but it incorrectly classified 30 cases without the condition as having it.

**Summary:**

The KNN model with cosine similarity performs reasonably well. The accuracy indicates a fair overall performance, and the precision and recall values provide insights into how well the model is at correctly identifying positive and negative cases. Potential areas for improvement could involve fine-tuning the model to enhance its predictive capabilities. The confusion matrix further details the specific types of correct and incorrect predictions.