<a href="https://www.kaggle.com/code/manishkr1754/diabetes-prediction?scriptVersionId=142553674" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
<center><h1>Diabetes Prediction</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

**Diabetes** is a chronic medical condition that affects how your body processes glucose (sugar) in the blood. It occurs when either the pancreas doesn't produce enough insulin (a hormone that regulates blood sugar) or when the body can't effectively use the insulin it produces. This results in elevated blood sugar levels, which can lead to various health complications if not managed properly.

The goal of this project is to leverage machine learning **to predict whether an individual has diabetes or not based on certain medical variables**. This falls under **Classication Machine Learning Problem**. This prediction can be valuable for early diagnosis and intervention which can improve the management of diabetes.

## 2) Understanding Data
---
The project uses Diabetes data which contains several medical variables (independent variables) and one outcome variable (dependent variable) for each individual.

### Dataset Description:

- **Pregnancies:** The number of times a woman has been pregnant.
- **Glucose:** Plasma glucose concentration measured 2 hours after an oral glucose tolerance test.
- **BloodPressure:** Diastolic blood pressure (in mmHg).
- **SkinThickness:** Triceps skin fold thickness (in mm).
- **Insulin:** 2-hour serum insulin level (in μU/ml).
- **BMI (Body Mass Index):** A measure of body fat based on weight in kilograms divided by height in meters squared.
- **Age:** The age of the individual in years.
- **DiabetesPedigreeFunction:** A score that indicates the likelihood of diabetes based on family history.
- **Outcome:** The outcome variable with two possible values:
  - 0: Indicates that the individual does not have diabetes.
  - 1: Indicates that the individual has diabetes.


## 3) Getting System Ready
---
Importing required libraries


In [None]:
import numpy as np
import pandas as pd

# for model buidling
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

## 4) Data Eyeballing
---

### Laoding Data

In [None]:
diabetes_data = pd.read_csv('Datasets/Day2_Diabetes_Data.csv') 

In [None]:
diabetes_data

In [None]:
print('The size of Dataframe is: ', diabetes_data.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
diabetes_data.info()
print('-'*100)

In [None]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in diabetes_data.columns if diabetes_data[feature].dtype != 'O']
categorical_features = [feature for feature in diabetes_data.columns if diabetes_data[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=diabetes_data.isnull().sum().sort_values(ascending=False)
percent=(diabetes_data.isnull().sum()/diabetes_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
diabetes_data.describe()

In [None]:
diabetes_data.groupby('Outcome').mean()

In [None]:
diabetes_data['Outcome'].value_counts()

### No Data Cleaning and Preprocessing Needed

## 5) Model Building
---

### Creating Feature Matrix (Independent Variables) & Target Variable (Dependent Variable)

In [None]:
# separating the data and labels
X = diabetes_data.drop(columns = 'Outcome', axis=1) # Feature matrix
y = diabetes_data['Outcome'] # Target variable

In [None]:
X

In [None]:
y

### Data Standardization

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(X)

In [None]:
standardized_data = scaler.transform(X)

In [None]:
standardized_data

In [None]:
X = standardized_data

In [None]:
X

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=45)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

In [None]:
print(y.shape, y_train.shape, y_test.shape)

### Model Comparison : Training & Evaluation

In [None]:
models = [LogisticRegression, SVC, DecisionTreeClassifier, RandomForestClassifier]
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

for model in models:
    classifier = model().fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    
    accuracy_scores.append(accuracy_score(y_test, y_pred))
    precision_scores.append(precision_score(y_test, y_pred))
    recall_scores.append(recall_score(y_test, y_pred))
    f1_scores.append(f1_score(y_test, y_pred))

In [None]:
classification_metrics_df = pd.DataFrame({
    "Model": ["Logistic Regression", "SVM", "Decision Tree", "Random Forest"],
    "Accuracy": accuracy_scores,
    "Precision": precision_scores,
    "Recall": recall_scores,
    "F1 Score": f1_scores
})

classification_metrics_df.set_index('Model', inplace=True)
classification_metrics_df

### Inference
Best Model based on accuracy score only is Random Forest Classifier. However, for real life best model selection are not solely based on accuracy score, we need to take into account other evaluation metrics, business context and model interpretability.

The choice of the best model may depend on factors like the balance between precision and recall as well as overall accuracy. In this context, where the primary goal is to predict whether an individual has diabetes or not, the choice of the best model should prioritize predictive accuracy and reliability.

- Based on the provided metrics and considering the goal of accurately predicting diabetes, the **SVM (Support Vector Machine) model** appears to perform the best. It has the highest accuracy, precision and F1 score among the models you've evaluated.

- **Logistic Regression** and **Random Forest** also perform well and could be considered as alternatives.


**`Note:`** For real life best model selection are not solely based on accuracy score, we need to take into account other evaluation metrics, business context and model interpretability.