# Cloud Faculty Institute Workshop: Predict Diabetes from Medical Records

## Overview
This project involves participating in a Kaggle competition aimed at predicting diabetes based on medical records. The dataset is derived from the National Institute of Diabetes and Digestive and Kidney Diseases, with the goal of diagnostically predicting whether a patient has diabetes based on specific diagnostic measurements.

## Dataset Description

### Context
The dataset consists of medical records from female patients who are at least 21 years old and of Pima Indian heritage. The primary objective is to predict the onset of diabetes using various medical predictor variables.

### Content
The dataset includes several predictor variables and one target variable, `Outcome`. The predictor variables include:
- `Pregnancies`: Number of pregnancies the patient has had
- `Glucose`: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- `BloodPressure`: Diastolic blood pressure (mm Hg)
- `SkinThickness`: Triceps skin fold thickness (mm)
- `Insulin`: 2-Hour serum insulin (mu U/ml)
- `BMI`: Body mass index (weight in kg/(height in m)^2)
- `DiabetesPedigreeFunction`: Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
- `Age`: Age (years)
- `Outcome`: Class variable (0 or 1) indicating whether the patient has diabetes (1) or not (0)

### Files
- `train.csv`: The training dataset containing the predictor variables and the target variable.
- `test.csv`: The test dataset containing the predictor variables but not the target variable.
- `sample_submission.csv`: A sample submission file in the correct format for predictions.

## Acknowledgements
The dataset was referenced from:
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.

## Objective
The challenge is to build a machine learning model that can accurately predict whether or not the patients in the test dataset have diabetes. This involves exploring the data, preprocessing it, selecting appropriate models, and fine-tuning them to achieve the best performance.

In [8]:
# Importing the libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [9]:
# Loading the dataset
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

train.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,Id
0,6,148,72,35,0,33.6,0.627,50,1,0
1,1,85,66,29,0,26.6,0.351,31,0,1
2,1,89,66,23,94,28.1,0.167,21,0,3
3,0,137,40,35,168,43.1,2.288,33,1,4
4,5,116,74,0,0,25.6,0.201,30,0,5


In [10]:
# Do we need to input any missing data?
train.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
Id                          0
dtype: int64

## Data Preprocessing and Model Training

In this section, we will perform several key steps to prepare the data and train our first machine learning model, a K-Nearest Neighbors (KNN) classifier. The following steps are included:

1. **Dropping Unnecessary Columns**:
   - We will drop the `Id` column from the training dataset as it does not provide any useful information for predicting diabetes.

2. **Data Splitting**:
   - We will split the training data into training and testing sets. This helps us to evaluate the model on unseen data and get an estimate of its performance.
   - We use `train_test_split` with stratification to ensure that the distribution of the target variable (`Outcome`) is the same in both the training and testing sets.

3. **Feature Normalization**:
   - We will normalize the feature values using `StandardScaler`. Normalization scales the features to have zero mean and unit variance, which is important for distance-based algorithms like KNN.

4. **Training the KNN Model**:
   - We will initialize and train a K-Nearest Neighbors classifier with `n_neighbors=5`.
   - The model will be trained on the normalized training data.

5. **Model Evaluation**:
   - We will evaluate the model's performance on the test set by calculating its accuracy.
   - Additionally, we will print the confusion matrix and classification report to get a detailed understanding of the model's performance in terms of precision, recall, and F1-score.


In [11]:
# Droping Id column
train = train.drop(columns=["Id"])

# Preprocessing (split train and test)
X_train, X_test, y_train, y_test = train_test_split(train.drop(columns=["Outcome"]), train["Outcome"], stratify=train["Outcome"], random_state=42)

# Normalizing
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Training
knn = KNN(n_neighbors=5)
knn.fit(X_train, y_train)

# Evaluating the model
y_pred = knn.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("")
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("")
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.8014184397163121

Confusion Matrix:
 [[78 13]
 [15 35]]

Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.86      0.85        91
           1       0.73      0.70      0.71        50

    accuracy                           0.80       141
   macro avg       0.78      0.78      0.78       141
weighted avg       0.80      0.80      0.80       141





#### Accuracy
- **0.8014**: The model correctly predicts 80.14% of the cases.

#### Confusion Matrix
- **True Negatives (TN)**: 78
- **False Positives (FP)**: 13
- **False Negatives (FN)**: 15
- **True Positives (TP)**: 35

### Classification Report

##### Class 0 (No Diabetes)
- **Precision**: 0.84
- **Recall**: 0.86
- **F1-Score**: 0.85

##### Class 1 (Diabetes)
- **Precision**: 0.73
- **Recall**: 0.70
- **F1-Score**: 0.71

#### Summary Metrics
- **Accuracy**: 0.80
- **Macro Average Precision**: 0.78
- **Macro Average Recall**: 0.78
- **Macro Average F1-Score**: 0.78
- **Weighted Average Precision**: 0.80
- **Weighted Average Recall**: 0.80
- **Weighted Average F1-Score**: 0.80

The model achieves good overall accuracy (80.14%). It performs better in predicting the absence of diabetes (class 0) than the presence of diabetes (class 1). Further tuning can help improve these results.

### What if we categorize BMI?

In [12]:
# Categorizing BMI
train['BMI_Category'] = pd.cut(train['BMI'], bins=[0, 18.5, 24.9, 29.9, float('inf')], labels=['Underweight', 'Normal', 'Overweight', 'Obese'])

# Creating dummy variables for the BMI categories
train = pd.get_dummies(train, columns=['BMI_Category'], drop_first=True)
train.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,BMI_Category_Normal,BMI_Category_Overweight,BMI_Category_Obese
0,6,148,72,35,0,33.6,0.627,50,1,0,0,1
1,1,85,66,29,0,26.6,0.351,31,0,0,1,0
2,1,89,66,23,94,28.1,0.167,21,0,0,1,0
3,0,137,40,35,168,43.1,2.288,33,1,0,0,1
4,5,116,74,0,0,25.6,0.201,30,0,0,1,0


In [13]:
# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(train.drop(columns=['Outcome', 'BMI']), train['Outcome'], stratify=train['Outcome'], random_state=42)

# Normalizing the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Training the KNN model
knn = KNN(n_neighbors=5)
knn.fit(X_train, y_train)

# Evaluating the model
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.7943262411347518
Confusion Matrix:
 [[78 13]
 [16 34]]
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.86      0.84        91
           1       0.72      0.68      0.70        50

    accuracy                           0.79       141
   macro avg       0.78      0.77      0.77       141
weighted avg       0.79      0.79      0.79       141



The updated model achieves an accuracy of 79.43%, which is slightly lower than the original model. The precision and recall for class 1 (Diabetes) have decreased marginally. The results suggest that categorizing BMI did not significantly improve the model's performance. Further experimentation with different features and models may be necessary to enhance predictive accuracy.

### Trying XGboost

In [16]:
import xgboost as xgb

xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)
xg_cl.fit(X_train, y_train)
preds = xg_cl.predict(X_test)

accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

accuracy: 0.815603
