## Diabetes dataset 
### Predict if a person is at risk of developing diabetes

### This Dataset is Freely Available

### Overview:
The data was collected and made available by the "National Institute of Diabetes and Digestive and Kidney Diseases" as part of the Pima Indians Diabetes Database. 

`Diabetes.csv` is available [from Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database). We have several questions - what information is more correlated with a positive diagnosis, and if we can only ask two questions to a patient, what should we ask and how would we give them a risk of being diagnosed.

++++++++++++++++++++++++++++++++++++

The following features have been provided to help us predict whether a person is diabetic or not:
* **Pregnancies:**  Number of times pregnant
* **Glucose:** Plasma glucose concentration over 2 hours in an oral glucose tolerance test
* **BloodPressure:** Diastolic blood pressure (mm Hg)
* **SkinThickness:** Triceps skin fold thickness (mm)
* **Insulin:** 2-Hour serum insulin (mu U/ml)
* **BMI:** Body mass index (weight in kg/(height in m)2)
* **DiabetesPedigreeFunction:** Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
* **Age:** Age (years)
* **Outcome:** Class variable (0 if non-diabetic, 1 if diabetic)

### Binary Classification problem - XGBoost

In [None]:
# Install xgboost in notebook instance.
#### Command to install xgboost
#!pip install xgboost==0.90

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import xgboost as xgb


from sklearn.model_selection import train_test_split
from xgboost import plot_importance

from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel

import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv("../Data/Diabetes.csv")

In [None]:
data.describe()

In [None]:
data.info()

In [None]:
## only keep rows where non of the columns has 0 value (except the first and last columns)
data = data[~(data[data.columns[1:-1]] == 0).any(axis=1)]
data.reset_index(inplace=True, drop = True)

### Dealing with Missing Values

In [None]:
# using isnull() function  
# print(data.isnull().any().sum())
print(data.isnull().sum())
#data.isnull()

In [None]:
data.drop(columns=['Insulin'], inplace = True)
data.reset_index(inplace=True, drop = True)

In [None]:
### Replace missing values in each column with the mean or median of that column
#data.fillna(data.mean())
data.fillna(data.median(), inplace=True)

### Drop all rows that contain missing values?
#data = data.dropna()
#data.reset_index(inplace=True, drop = True)

### Split Data

In [None]:
X = data.iloc[:,:-1] # Features: all columns excep last
y = data.iloc[:,-1].ravel() # Target: last column
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Launch XGBoost classifier

In [None]:
# Launch a classifier
# XGBoost Training Parameter Reference: 
#   https://xgboost.readthedocs.io/en/latest/parameter.html
classifier = xgb.XGBClassifier (objective="binary:logistic")

In [None]:
classifier

In [None]:
classifier.fit(X_train,
               y_train, 
               eval_metric=['logloss'])

### Plot Feature Importance

In [None]:
# plot feature importance
plot_importance(classifier)
plt.show()

### Feature Selection using Feature Importance
* Feature importance scores can be used for feature selection in scikit-learn.
* This is done using the SelectFromModel class that takes a model and can transform a dataset into a subset with selected features.
* This class can take a pre-trained model, such as one trained on the entire training dataset. 
* It can then use a threshold to decide which features to select. 
* This threshold is used when you call the transform() method on the SelectFromModel instance to consistently select the same features on the training dataset and the test dataset.


In [None]:
# fit model on all training data
model = xgb.XGBClassifier(objective="binary:logistic", use_label_encoder =False)
model.fit(X_train, y_train, eval_metric=['logloss'])
# make predictions for test data and evaluate
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = np.sort(model.feature_importances_)
for thresh in thresholds:
    # select features using threshold
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    # train model
    selection_model = xgb.XGBClassifier(objective="binary:logistic", use_label_encoder =False)
    selection_model.fit(select_X_train, y_train, eval_metric=['logloss'])
    # eval model
    select_X_test = selection.transform(X_test)
    y_pred = selection_model.predict(select_X_test)
    predictions = [round(value) for value in y_pred]
    accuracy = accuracy_score(y_test, predictions)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

You can see that the performance of the model generally decreases with the number of selected features.