<a href="https://colab.research.google.com/github/vbosstech/disease-diagnostic-from-symptoms/blob/master/diabetes_risk_assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning Workflow on Diabetes Data

In [0]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Step 1.  Data Preparation
`Acquiring and the preparation of a Data-set`

Using the [Pima Indians Diabetes Database](https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes) provided by the UCI Machine Learning Repository.

## Step 2.   Data Exploration

Analyse and “**get to know**” the data-set: potential features & to see if data cleaning



In [0]:
from google.colab import drive
# drive.mount('/content/gdrive')
drive.mount("/content/gdrive", force_remount=True)

In [0]:
diabetes = pd.read_csv('/content/gdrive/My Drive/machine-learning/disease-diagnostic-from-symptoms/dataset/diabetes.csv')
print(diabetes.columns)

In [0]:
diabetes.head()

In [0]:
print("Diabetes data set dimensions : {}".format(diabetes.shape))

In [0]:
diabetes.groupby('Outcome').size()

*   **Visualize** to understand & explain the Data: using *Matplotlib*, *Seaborn*.
*   to find the **data distribution** of the **features**.




In [0]:
diabetes.hist(figsize=(9, 9))

In [0]:
# histograms for the two responses separately. 
diabetes.groupby('Outcome').hist(figsize=(9, 9))

## Step 3.  Data Cleaning
“**Better data beats fancier algorithms**”, which suggests better data gives you better resulting models. There are several factors to consider in the data cleaning process, including:
* 3.1. Duplicate or irrelevant observations.
* 3.2. Bad labeling of data, same category occurring multiple times.
* 3.3. Missing or null data points.
* 3.4. Unexpected outliers.

### 3.1. Duplicate or irrelevant observations.
NA: 'cause using standard data-set

### 3.2. Bad labeling of data, same category occurring multiple times.
NA: 'cause using standard data-set

### 3.3. Missing or Null Data points

In [0]:
diabetes.isnull().sum()

In [0]:
diabetes.isna().sum()

### 3.4. Unexpected Outliers
Unexpected Outliers either useful or potentially harmful.

In [0]:
# A living person cannot have diastolic blood pressure of zero
print("Total : ", diabetes[diabetes.BloodPressure == 0].shape[0])
print(diabetes[diabetes.BloodPressure == 0].groupby('Outcome')['Age'].count())

In [0]:
# Even after fasting glucose level would not be as low as zero.
print("Total : ", diabetes[diabetes.Glucose == 0].shape[0])
print(diabetes[diabetes.Glucose == 0].groupby('Outcome')['Age'].count())

In [0]:
# Skin fold thickness can’t be less than 10 mm better yet zero.
print("Total : ", diabetes[diabetes.SkinThickness == 0].shape[0])
print(diabetes[diabetes.SkinThickness == 0].groupby('Outcome')['Age'].count())

In [0]:
# Should not be 0 or close to zero unless the person is really underweight which could be life threatening.
print("Total : ", diabetes[diabetes.BMI == 0].shape[0])
print(diabetes[diabetes.BMI == 0].groupby('Outcome')['Age'].count())

In [0]:
# In a rare situation a person can have zero insulin
print("Total : ", diabetes[diabetes.Insulin == 0].shape[0])
print(diabetes[diabetes.Insulin == 0].groupby('Outcome')['Age'].count())

**Handle invalid data values :**
* **Ignore/remove these cases** : This is not actually possible in most cases because that would mean losing valuable information.It might work for “BMI”, “glucose ”and “blood pressure” whenever just a few invalid data points.
* **Put average/mean values** : This might work for some data sets, but in our case putting a mean value to the blood pressure column would send a wrong signal to the model.
* **Avoid using features** : It is possible to not use the features with a lot of invalid values for the model. This may work for “skin thickness” but its hard to predict that.

In [0]:
# Remove the rows which the “BloodPressure”, “BMI” and “Glucose” are zero.
diabetes_mod = diabetes[(diabetes.BloodPressure != 0) & (diabetes.BMI != 0) & (diabetes.Glucose != 0)]
print(diabetes_mod.shape)

## Step 4.  Model Selection
[Feature Engineering](https://elitedatascience.com/feature-engineering-best-practices) is the process of transforming the gathered data into features that better represent the problem that we are trying to solve to the model, to improve its performance and accuracy.
Feature Engineering enables to highlight the **important features** and facilitate to bring **domain expertise** on the problem to the table. It also allows to **avoid overfitting the model** despite providing many input features

Assign the **features to **the** X variable** and the **response to **the** y variable**.

In [0]:
# Features/Response
feature_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
X = diabetes_mod[feature_names]
y = diabetes_mod.Outcome

## Step 5.  Model Selection
**Model selection** or **algorithm selection** select the model which performs best for the data-set at hand.

*  Calculating the “**Classification Accuracy (Testing Accuracy)**” of a given set of classification models with their default parameters to determine which model performs better with the diabetes data-set.
*   **Logistic Regression**, **Random Forest** & **Gradient Boost** & **Decision Tree**, **K-Nearest Neighbors**, **Support Vector Classifier**, **Gaussian Naive Bayes** to be contenders for the best classifier.



In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble     import RandomForestClassifier
from sklearn.ensemble     import GradientBoostingClassifier
from sklearn.naive_bayes  import GaussianNB
from sklearn.tree         import DecisionTreeClassifier
from sklearn.neighbors    import KNeighborsClassifier
from sklearn.svm          import SVC

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

In [0]:
# Initial model selection process
models = []

models.append(('LR',  LogisticRegression(solver='lbfgs', max_iter=4000)))
models.append(('RF',  RandomForestClassifier(n_estimators=100)))
models.append(('GB',  GradientBoostingClassifier()))
models.append(('GNB', GaussianNB()))
models.append(('DT',  DecisionTreeClassifier()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVC', SVC(gamma=0.001)))
# models.append(('SVC', SVC(gamma='scale')))

**Evaluation Methods**
General practice to avoid training and testing on the same data. The reasons are that, the goal of the model is to predict **out-of-sample data**, and the model could be overly complex leading to **overfitting**. To avoid the aforementioned problems, there are two precautions:
* **5.1. Train/Test Split**: 
  * “**train_test_split**” method split the data set into two portions: the **training set** is used to train the model. And the **testing set** is used to test the model, and evaluate the accuracy.
  * “**accuracy_score**” to evaluate the accuracy of the respective model in the train/test split method.
*   **5.2. K-Fold Cross Validation** method splits the data set into **K equal partitions** (“folds”), then use 1 fold as the testing set and the union of the other folds as the **training set**. Then the model is tested for accuracy. The **average testing accuracy** of the process is the testing accuracy. Note: more accurate and use the data efficiently. 

### 5.1. Using Train/Test split

In [0]:
# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = diabetes_mod.Outcome, random_state=0)

In [0]:
names = []
scores = []

for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    scores.append(accuracy_score(y_test, y_pred))
    names.append(name)

tr_split = pd.DataFrame({'Name': names, 'Score': scores})
print(tr_split)

### 5.2. Using K-Fold cross validation

In [0]:
strat_k_fold = StratifiedKFold(n_splits=10, random_state=10)

names = []
scores = []

for name, model in models:
    
    score = cross_val_score(model, X, y, cv=strat_k_fold, scoring='accuracy').mean()
    names.append(name)
    scores.append(score)

kf_cross_val = pd.DataFrame({'Name': names, 'Score': scores})
print(kf_cross_val)

In [0]:
# Plot the accuracy scores using "seaborn"
axis = sns.barplot(x = 'Name', y = 'Score', data = kf_cross_val)
axis.set(xlabel='Classifier Algorithm', ylabel='Accuracy')

for p in axis.patches:
    height = p.get_height()
    axis.text(p.get_x() + p.get_width()/2, height + 0.005, '{:1.4f}'.format(height), ha="center") 
    
plt.show()

We can see the **Logistic Regression**, **Random Forest** and **Gradient Boosting**, **Gaussian Naive Bayes**,  have performed better than the rest. From the base level we can observe that the ***Logistic Regression*** performs better than the other algorithms.

## Step 6. Feature Selection (Revisited)
Analyze the selected model "Logistic Regression", and how feature importance affects it. 
* **Recursive Feature Elimination**: **RFE** works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.
* **Univariate Feature Selection**: Statistical tests can be used to select those features that have the strongest relationship with the output variable.
* **Principal Component Analysis**: **PCA** uses linear algebra to transform the dataset into a compressed form. Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal component in the transformed result.
* **Feature Importance**: Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.

In [0]:
## using "Recursive Feature Elimination" RFE as the feature selection method.
from sklearn.feature_selection import RFECV

### Logistic Regression

In [0]:
logreg_model = LogisticRegression(solver='lbfgs', max_iter=4000)

rfecv = RFECV(estimator=logreg_model, step=1, cv=strat_k_fold, scoring='accuracy')
rfecv.fit(X, y)

plt.figure()
plt.title('Logistic Regression CV score vs No of Features')
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()

* Input **5 features** to the model gives the best accuracy score.
* **RFECV** exposes **support_ ** which is another attribute to find out the features which contribute the most to predicting.
* We can do a comparison of the model with **original features** and the **RFECV selected features ** to see if there is an improvement in the accuracy scores.

In [0]:
feature_importance = list(zip(feature_names, rfecv.support_))

new_features = []

for key,value in enumerate(feature_importance):
    if(value[1]) == True:
        new_features.append(value[0])
        
print(new_features)

In [0]:
# Calculate accuracy scores 
X_new = diabetes_mod[new_features]

initial_score = cross_val_score(logreg_model, X, y, cv=strat_k_fold, scoring='accuracy').mean()
print("Initial accuracy : {} ".format(initial_score))

fe_score = cross_val_score(logreg_model, X_new, y, cv=strat_k_fold, scoring='accuracy').mean()
print("Accuracy after Feature Selection : {} ".format(fe_score))

### Gradient Boost

In [0]:
gb_model = GradientBoostingClassifier()

gb_rfecv = RFECV(estimator=gb_model, step=1, cv=strat_k_fold, scoring='accuracy')
gb_rfecv.fit(X, y)

plt.figure()
plt.title('Gradient Boost CV score vs No of Features')
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(gb_rfecv.grid_scores_) + 1), gb_rfecv.grid_scores_)
plt.show()

In [0]:
feature_importance = list(zip(feature_names, gb_rfecv.support_))

new_features = []

for key,value in enumerate(feature_importance):
    if(value[1]) == True:
        new_features.append(value[0])
        
print(new_features)

In [0]:
X_new_gb = diabetes_mod[new_features]

initial_score = cross_val_score(gb_model, X, y, cv=strat_k_fold, scoring='accuracy').mean()
print("Initial accuracy : {} ".format(initial_score))

fe_score = cross_val_score(gb_model, X_new_gb, y, cv=strat_k_fold, scoring='accuracy').mean()
print("Accuracy after Feature Selection : {} ".format(fe_score))

## Step 7.  Model Parameter Tuning
* Using **[Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)** for the **Model Parameter Tuning** 'cause more accurate than **Gradient Boosting**.
* Instead of having to manually search for optimum parameters, we can easily perform an exhaustive search using the **GridSearchCV**, which does an “exhaustive search over specified parameter values for an estimator”.

In [0]:
from sklearn.model_selection import GridSearchCV

In [0]:
# Specify parameters
c_values = list(np.arange(1, 10))

param_grid = [
    {'C': c_values, 'penalty': ['l1'], 'solver' : ['liblinear'], 'multi_class' : ['ovr']},
    {'C': c_values, 'penalty': ['l2'], 'solver' : ['liblinear', 'newton-cg', 'lbfgs'], 'multi_class' : ['ovr']}
]

In [0]:
## fit the data to the GridSearchCV, which performs a K-fold cross validation on the data for the given combinations of the parameters.
grid = GridSearchCV(LogisticRegression(), param_grid, cv=strat_k_fold, scoring='accuracy', iid=False)
grid.fit(X_new, y)

In [0]:
## After training & scoring, GridSearchCV provides some useful attributes to find the best parameters and the best estimator.
print(grid.best_params_)
print(grid.best_estimator_)

In [0]:
## feed the best parameters to the Logistic Regression model and observe whether it’s accuracy has increased.
logreg_new = LogisticRegression(C=1, multi_class='ovr', penalty='l2', solver='liblinear')
initial_score = cross_val_score(logreg_new, X_new, y, cv=strat_k_fold, scoring='accuracy').mean()
print("Final accuracy : {} ".format(initial_score))