# End to End Model Development and Deployment

Diabetics is a Chronic disease that affects millions worldwide. Particularly we are interested to analyze diabetes in female patients.

**Problem Statement**
Develop a machine learning model to predict diabtes in women and deploy it as a Web App in Streamlit

**Dataset Description**
This is the Pima Indians Dataset from kaggle.com and has data about 768 women of Pima heritage 21 years and above. This is an open source dataset. 

**Steps of the Modelling Process**
1. Import all libraries and view the data set 
2. Do the Data Sanity Check
3. Clean the data
4. Perform Exploratory Data Analysis 
5. Preprocess the data for modelling
6. Fit and evaluate Machine Learning Models
7. Optimize the best model
8. Interpret the tuned model
9. Prepare for deployment by creating a pipeline 
10. Deploy in Streamlit 

### Step1: Import libraries and the datset


In [1]:
# data manipulation and EDA libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# data preprocessing libraries
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from imblearn.over_sampling import SMOTE

# data modelling libraries
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

# data metrics libraries
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report

# model interpretation and deployment libraries
import shap
import pickle
from sklearn.pipeline import Pipeline
import streamlit as st

print("All libraries are imported")



ImportError: cannot import name '_MissingValues' from 'sklearn.utils._param_validation' (C:\ProgramData\anaconda3\Lib\site-packages\sklearn\utils\_param_validation.py)

In [2]:
!pip install xgboost --quiet

In [None]:
!pip install shap

In [None]:
!pip install streamlit

In [4]:
data=pd.read_csv('diabetes.csv')

In [5]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,Yes
1,1,85,66,29,0,26.6,0.351,31,No
2,8,183,64,0,0,23.3,0.672,32,Yes
3,1,89,66,23,94,28.1,0.167,21,No
4,0,137,40,35,168,43.1,2.288,33,Tested_Positive


**Attributes of the data**
1. Pregnancies- The number of times the patient was pregnant
2. Glucose- The serum glucose level of the patient
3. BloodPressure- Duastolic blood pressure(mm of Hg)
4. SkinThickness- Triceps fold skin thickness ( mm)
5. Insulin - The serum insulin level of te patient 
6. BMI- Body Mass Index ( Wt/Ht^2) is a measure of obesity
7. DiabetisPedigreeFunction- A genetic propensity towards diabtes base on family history
8. Age - Age of the patient
9. Outcome- The target variable withh two levels (Yes/No)

### Step2 : Data Sanity Check
- get the basic info of the data
- look for null values
- look for duplicate rows
- look for corrupted data
- get the data summary statistics(both numerical and categorical)
- look for erroneous values in the data 


In [6]:
# get the shape of the data 
data_shape=data.shape
print('Rows =', data_shape[0], ' Columns=', data_shape[1])

Rows = 768  Columns= 9


In [7]:
# get the basic info
info=data.info()

# get the data type
dtype=data.dtypes

info, dtype

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object 
dtypes: float64(2), int64(6), object(1)
memory usage: 54.1+ KB


(None,
 Pregnancies                   int64
 Glucose                       int64
 BloodPressure                 int64
 SkinThickness                 int64
 Insulin                       int64
 BMI                         float64
 DiabetesPedigreeFunction    float64
 Age                           int64
 Outcome                      object
 dtype: object)

In [8]:
# check for unique levels in categorical
data['Outcome'].nunique()

4

In [9]:
# get the vakue counds
data['Outcome'].value_counts()

Outcome
No                 470
Yes                248
Tested_Negative     30
Tested_Positive     20
Name: count, dtype: int64

In [10]:
#check for nulls and duplicate 
nulls=data.isnull().sum()

dups=data.duplicated().sum()

nulls, dups

(Pregnancies                 0
 Glucose                     0
 BloodPressure               0
 SkinThickness               0
 Insulin                     0
 BMI                         0
 DiabetesPedigreeFunction    0
 Age                         0
 Outcome                     0
 dtype: int64,
 0)

In [11]:
# look for corrupt characters in the data 
data[~data.applymap(np.isreal).any(1)]

TypeError: NDFrame._add_numeric_operations.<locals>.any() takes 1 positional argument but 2 were given

In [None]:
# Summary statistics of numerical anad categorical data 
num_stats=data.describe().T
cat_stats=data.describe(include='O').T
num_stats, cat_stats

**Data Summary**
1. The dataset has 768 rows and 9 columns
2. The dataset has 8 numerical varibles ( int64 and float64) and one categorical variable (Ouctome)
3. **The categorical variable Outcome has 4 levels which we need to clean and reduce to 2 levels (Yes=1 /No=0)**
3. There are no missing values or duplicate rows
4. There are no corrupt characters in the data
5. **There are many columns which have minimun vaue as 0 , that is physiologically not feasible, so we have ti impute them with column medians** 

### Step3: Data cleaning Step
- encode categorical Outcome variable
- impute columns with minimum value 0

In [None]:
data.columns

In [None]:
# create a copy of the data
df=data.copy()

In [None]:
# imputing the 0 values
cols=['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI']
zerofill=lambda x: x.replace(0, x.median())
df[cols]=df[cols].apply(zerofill, 0)

In [None]:
#check the minimum values
df[df.columns[:]].agg('min')

In [None]:
# categorical encoding
d={
    'Yes':1,'Tested_Positive':1, 'No':0, 'Tested_Negative':0
}
df['Outcome']=df["Outcome"].map(d)
df['Outcome'].value_counts()

### Step4: Exploratory Data Analysis 
- univariate analysis
   - numerical data - histograms and boxplots
   - categorical data - bar plots
- Bivariate analysis
   - bivariate bar charts
   - scatter plots
- Correlation analyis
   - Correlation matrix and heatmaps

In [None]:
# Univariate analyis for numerical data 
df.hist()
plt.tight_layout()
plt.show()

In [None]:
# create individual box plots and histplots
def histplot_boxplot(data, feature, figsize=(12, 7), bins=None):
    print("Univarites for ...", feature)
    fig, (ax_box, ax_hist)=plt.subplots(
    nrows=2,
    sharex=True,
    gridspec_kw={'height_ratios':(0.25, 0.75)},
    figsize=figsize)
    sns.boxplot(data=data, x=feature, color='violet', ax=ax_box, showmeans=True)
    sns.histplot(data=data, x=feature, ax=ax_hist, bins=bins ) if bins else sns.histplot(data=data, x=feature, ax=ax_hist)
    plt.axvline(data[feature].mean(), color='green', linestyle='--')
    plt.axvline(data[feature].median(), color='black', linestyle='-')
    plt.show()

In [None]:
for col in df.select_dtypes(exclude='O').columns:
    histplot_boxplot(data=df, feature=col)

In [None]:
num_outliers={}
for col in df.columns:
    q1=df[col].quantile(0.25)
    q3=df[col].quantile(0.75)
    iqr=q3-q1
    outliers=((df[col]< (q1-1.5*iqr)) | (df[col]>(q3+1.5*iqr)))
    num_outliers[col]=outliers.sum()
num_outliers

In [None]:
# univariate barchart for categorical Outcome
plt.figure(figsize=(12,7))
ax=sns.countplot(df['Outcome'], color='green' )
for p in ax.patches:
    x=p.get_bbox().get_points()[:,0]
    y=p.get_bbox().get_points()[1,1]
    ax.annotate("{:.2g}%".format(100.*y/len(df)), (x.mean(), y), ha='center', va='bottom')
plt.title('Univariate Bar Chart fpr Outcome')
plt.show()

**Bivariate Analysis**
- bivarite bar graph

In [None]:
df.columns

In [None]:
cols=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']
for col in cols:
    print("Bivariates between Outcome and {}".format(col))
    df.groupby('Outcome')[col].mean().plot(kind='bar', color='orange')
    plt.show()

In [None]:
# pairplot
sns.pairplot(df, hue='Outcome')

In [None]:
plt.figure(figsize=(12,7))
sns.heatmap(df.corr(), annot=True, cmap='Spectral', vmax=+1, vmin=-1)
plt.show()

**Observations**
1. Pregnancies , insulin, DiabetesPedigreeFunction and age are rights skewed
2. BloodPressure, Insulin, SkinThickness , Diabetes predigree function had many outliers
3. Outliers counts have been obtained but we will not resolve these outliers
4. The Outcome variable is highly imbalanced with 65% having diabetes and 35% not having diabetes
5. Women with higher Pregnancies, Glucose, BMI, Age, Diabetes Pedigree Function are more prone to diabetes. To confirm this we will use pairplots an heatmaps.
6. Base on KDE plots the distributions of Pregnancies,Glucose, Age, Diabetes Pedigree Function are much different for the two outcome classes showing that they are risk factors of diabtes
7. Scatter plots show strong positive trend between Glucose and Insulin, Glucose and BMI, Glucose and age. These may be risk factors of diabetes. We confirm with a heatmap
8. Heatmap shows that Glucose, BMI and Age are risk factors of Diabetes

### Step5: Data Preprocessing