# Diabetes Prediction Model

## Importing Required Libraries

In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

## Loading the Dataset

In [2]:
data = pd.read_csv('diabetes_risk_prediction_dataset.csv')
data.head() 

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,Male,No,Yes,No,Yes,No,No,No,Yes,No,Yes,No,Yes,Yes,Yes,Positive
1,58,Male,No,No,No,Yes,No,No,Yes,No,No,No,Yes,No,Yes,No,Positive
2,41,Male,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,Yes,No,Positive
3,45,Male,No,No,Yes,Yes,Yes,Yes,No,Yes,No,Yes,No,No,No,No,Positive
4,60,Male,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Positive


## Data Cleaning and Preprocessing

### Checking for Missing and Duplicate Values

In [3]:
data.isnull().sum()  # Check for missing values
data.duplicated().any()  # Check for duplicate rows

True

### Dropping Duplicates

In [4]:
data.drop_duplicates(inplace=True)
data.duplicated().any()  # Verify no duplicates remain

False

### Understanding Data Types 

In [5]:
data.dtypes 

Age                    int64
Gender                object
Polyuria              object
Polydipsia            object
sudden weight loss    object
weakness              object
Polyphagia            object
Genital thrush        object
visual blurring       object
Itching               object
Irritability          object
delayed healing       object
partial paresis       object
muscle stiffness      object
Alopecia              object
Obesity               object
class                 object
dtype: object

## Encoding Categorical Data

In [6]:
le = LabelEncoder()
columns_to_encode = [col for col in data.columns if col != 'Age']
for column in columns_to_encode:
    data[column] = le.fit_transform(data[column]) 

## Splitting Dataset into Features and Target

In [7]:
X = data.iloc[:, :-1].values  # Features
y = data.iloc[:, -1].values  # Target

## Splitting Dataset into Training and Testing Sets

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Model Training

### Logistic Regression

In [9]:
logistic_classifier = LogisticRegression(max_iter=1000)
logistic_classifier.fit(X_train, y_train)

### K-Nearest Neighbors

In [10]:
KNN_classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
KNN_classifier.fit(X_train, y_train)

### Support Vector Machine (Linear Kernel)

In [11]:
SVM_classifier = SVC(kernel='linear', random_state=0)
SVM_classifier.fit(X_train, y_train)

### Support Vector Machine (RBF Kernel)

In [12]:
kernal_SVM_classifier = SVC(kernel='rbf', random_state=0)
kernal_SVM_classifier.fit(X_train, y_train)

### Naive Bayes

In [13]:
Naive_Bayes_classifier = GaussianNB()
Naive_Bayes_classifier.fit(X_train, y_train)

### Decision Tree

In [14]:
Decision_Tree_classifier = DecisionTreeClassifier(random_state=0)
Decision_Tree_classifier.fit(X_train, y_train)

### Random Forest

In [15]:
Random_Forest_classifier = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
Random_Forest_classifier.fit(X_train, y_train)

## Model Evaluation

### Predictions

In [17]:
logistic_y_pred = logistic_classifier.predict(X_test)
KNN_y_pred = KNN_classifier.predict(X_test)
SVM_y_pred = SVM_classifier.predict(X_test)
kernal_SVM_y_pred = kernal_SVM_classifier.predict(X_test)
naive_bayes_y_pred = Naive_Bayes_classifier.predict(X_test)
decision_tree_y_pred = Decision_Tree_classifier.predict(X_test)
random_forest_y_pred = Random_Forest_classifier.predict(X_test)

### Accuracy and Confusion Matrices

In [18]:
logistic_accuracy = accuracy_score(y_test, logistic_y_pred)
KNN_accuracy = accuracy_score(y_test, KNN_y_pred)
SVM_accuracy = accuracy_score(y_test, SVM_y_pred)
KSVM_accuracy = accuracy_score(y_test, kernal_SVM_y_pred)
naive_bayes_accuracy = accuracy_score(y_test, naive_bayes_y_pred)
decision_tree_accuracy = accuracy_score(y_test, decision_tree_y_pred)
random_forest_accuracy = accuracy_score(y_test, random_forest_y_pred)

print("Logistic Regression Accuracy:", logistic_accuracy)
print("KNN Accuracy:", KNN_accuracy)
print("SVM Accuracy:", SVM_accuracy)
print("Kernel SVM Accuracy:", KSVM_accuracy)
print("Naive Bayes Accuracy:", naive_bayes_accuracy)
print("Decision Tree Accuracy:", decision_tree_accuracy)
print("Random Forest Accuracy:", random_forest_accuracy)

Logistic Regression Accuracy: 0.8627450980392157
KNN Accuracy: 0.7058823529411765
SVM Accuracy: 0.9411764705882353
Kernel SVM Accuracy: 0.7450980392156863
Naive Bayes Accuracy: 0.8823529411764706
Decision Tree Accuracy: 0.8431372549019608
Random Forest Accuracy: 0.9019607843137255


## Comparing Models

In [19]:
models = pd.DataFrame(
    {
        'Classifier': ['Logistic Regression', 'KNN', 'SVM', 'Kernel SVM', 'Naive Bayes', 'Decision Tree', 'Random Forest'],
        'Accuracy': [logistic_accuracy, KNN_accuracy, SVM_accuracy, KSVM_accuracy, naive_bayes_accuracy, decision_tree_accuracy, random_forest_accuracy]
    }
)

models.sort_values(by='Accuracy', ascending=False)

Unnamed: 0,Classifier,Accuracy
2,SVM,0.941176
6,Random Forest,0.901961
4,Naive Bayes,0.882353
0,Logistic Regression,0.862745
5,Decision Tree,0.843137
3,Kernel SVM,0.745098
1,KNN,0.705882


## Recommendations
Based on the accuracies obtained, the following models are recommended for deployment:

1. Best Model: **SVM** with an accuracy of 94.12%
Recommendation: The SVM model shows the highest accuracy and performs very well. It is the best choice for deployment, especially when dealing with complex datasets or when high performance is required.

2. Second Best Model: **Random Forest** with an accuracy of 90.20%
Recommendation: Random Forest is robust and performs reliably across various datasets. It is a strong choice when you need a model that generalizes well and can handle a wide range of problems.

3. Third Best Model: **Naive Bayes** with an accuracy of 88.24%
Recommendation: Naive Bayes performs well and is fast to train. It is a good option when you need a simple model with quick results, especially when the features are independent.

Additional Considerations:
- **Logistic Regression (86.27%)**: Logistic Regression provides a solid performance and is preferred for interpretability. It is a good choice when you need a model that can be easily explained and understood, especially when dealing with linear relationships between features.

- **Decision Tree (84.31%)**: Decision Trees are easy to interpret and understand but can suffer from overfitting if not tuned properly. They can still offer valuable insights and provide competitive accuracy in many cases.

- **Kernel SVM (74.51%)**: The Kernel SVM model performs the worst in this case, likely due to its complexity and sensitivity to parameter settings. It may still be useful in specific situations, but it is not recommended for deployment in this instance.

- **K-Nearest Neighbors (70.59%)**: KNN has the lowest accuracy and is not recommended for deployment in its current form. It is typically slower on larger datasets and less effective for complex problems compared to other models.

## Save .pkl Files

In [23]:
best_model = SVM_classifier

# Save the model
with open('diabetes_model.pkl', 'wb') as model_file:
    pickle.dump(best_model, model_file)