# End to End Model Development and Deployment 

Diabetes is a chronic disease that affects millions worldwide.
Particularly we are interested to analyse diabetes in female patients.

**Problem Statement**
Develop a machine leanring model to predict diabetes in women and deploy it as a Web App in streamlit

**Dataset Description**
This is The Pima Indians Dataset from kaggle.com and has about 768 women of Pima heritage 21 years and above.This is an open source dataset.

**Steps of the Modelling Process**
1. Import all libraries and view the dataset
2. Do the Data Sanity Check
3. Clean the data
4. Perform Exploratory Data Analysis
5. Preprocess the data for modelling
6. Fit and evaluate Machine Leaning Models
7. Optimize the best model
8. Interpret the tuned model
9. Prepare for deployment by creating a pipeline.
10. Deploy in streamlit

## Step 1: Import libraries and the dataset

In [5]:
# data manipulation and EDA libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# data preprocessing libraries
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from imblearn.over_sampling import SMOTE

# data modelling libraries
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

# data metrics libraries
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report

# model interpretation and deployment libraries
import shap
import pickle
from sklearn.pipeline import Pipeline
import streamlit as st

print("All libraries are imported")



All libraries are imported


In [4]:
!pip install streamlit

Collecting streamlit
  Using cached streamlit-1.27.1-py2.py3-none-any.whl (7.5 MB)
Collecting gitpython!=3.1.19,<4,>=3.0.7
  Using cached GitPython-3.1.37-py3-none-any.whl (190 kB)
Collecting protobuf<5,>=3.20
  Downloading protobuf-4.24.3-cp310-abi3-win_amd64.whl (430 kB)
     -------------------------------------- 430.5/430.5 kB 5.4 MB/s eta 0:00:00
Collecting cachetools<6,>=4.0
  Using cached cachetools-5.3.1-py3-none-any.whl (9.3 kB)
Collecting blinker<2,>=1.0.0
  Using cached blinker-1.6.2-py3-none-any.whl (13 kB)
Collecting validators<1,>=0.2
  Using cached validators-0.22.0-py3-none-any.whl (26 kB)
Collecting pyarrow>=6.0
  Downloading pyarrow-13.0.0-cp310-cp310-win_amd64.whl (24.3 MB)
     --------------------------------------- 24.3/24.3 MB 21.8 MB/s eta 0:00:00
Collecting tzlocal<6,>=1.1
  Using cached tzlocal-5.0.1-py3-none-any.whl (20 kB)
Collecting altair<6,>=4.0
  Using cached altair-5.1.1-py3-none-any.whl (520 kB)
Collecting pydeck<1,>=0.8.0b4
  Using cached pydeck-0.8.1

In [2]:
!pip install shap

Collecting shap
  Downloading shap-0.42.1-cp310-cp310-win_amd64.whl (462 kB)
     -------------------------------------- 462.3/462.3 kB 4.8 MB/s eta 0:00:00
Collecting slicer==0.0.7
  Using cached slicer-0.0.7-py3-none-any.whl (14 kB)
Installing collected packages: slicer, shap
Successfully installed shap-0.42.1 slicer-0.0.7


In [26]:
pip install streamlit

Note: you may need to restart the kernel to use updated packages.


In [21]:
!pip install shap

Collecting shap
  Downloading shap-0.42.1-cp38-cp38-win_amd64.whl (462 kB)
     -------------------------------------- 462.3/462.3 kB 4.1 MB/s eta 0:00:00
Collecting slicer==0.0.7
  Downloading slicer-0.0.7-py3-none-any.whl (14 kB)
Installing collected packages: slicer, shap
Successfully installed shap-0.42.1 slicer-0.0.7


In [18]:
!pip install xgboost


Collecting xgboost
  Using cached xgboost-2.0.0-py3-none-win_amd64.whl (99.7 MB)
Installing collected packages: xgboost
Successfully installed xgboost-2.0.0


In [17]:
pip install xgboost


^C


In [11]:
!pip install imbalanced-learn --quiet


In [4]:
!pip install --upgrade scikit-learn


Collecting scikit-learn
  Downloading scikit_learn-1.3.1-cp38-cp38-win_amd64.whl (9.3 MB)
     ---------------------------------------- 9.3/9.3 MB 7.6 MB/s eta 0:00:00
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
Successfully installed scikit-learn-1.3.1


In [6]:
data=pd.read_csv('diabetes.csv')

In [7]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,Yes
1,1,85,66,29,0,26.6,0.351,31,No
2,8,183,64,0,0,23.3,0.672,32,Yes
3,1,89,66,23,94,28.1,0.167,21,No
4,0,137,40,35,168,43.1,2.288,33,Tested_Positive


**Atributes of the data**
1. Pregnancies - The number of times the patient was pregnant
2. Glucose- The serum glucose level of the patient
3. BloodPressure-Duastolic blood pressure (mm of Hg)
4. SkinThickness- Triceps fold skin thickness (mm)
5. Insulin-The serum insulin level of the pateints
6. BMI- Body Mass Index (Wt/Ht^2) is a measure of obesity.
7. DiabetesPedigreeFunction-A genetic propensity towards diabetes base on family history
8. Age-Age of the patient
9. Outcome- The targhet variable within two levels(Yes/No)

### Step 2: Data Sanity Check
- get the basic info of the data
- look for null values
- look for duplicates rows
- look for corrupted data
- get the data summary statisctics(both numerical and categorical)
- look for erroneous values in the data

In [8]:
#get the shape of the data
data_shape=data.shape
print('Rows =',data_shape[0],'Columns=',data_shape[1])

Rows = 768 Columns= 9


In [9]:
#get the basic info
info=data.info()

#get the data type
dtype=data.dtypes

info,dtype

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object 
dtypes: float64(2), int64(6), object(1)
memory usage: 54.1+ KB


(None,
 Pregnancies                   int64
 Glucose                       int64
 BloodPressure                 int64
 SkinThickness                 int64
 Insulin                       int64
 BMI                         float64
 DiabetesPedigreeFunction    float64
 Age                           int64
 Outcome                      object
 dtype: object)

In [10]:
#check for unique levels in categorical
data['Outcome'].nunique()

4

In [11]:
#get the value counts of target column
data['Outcome'].value_counts()

No                 470
Yes                248
Tested_Negative     30
Tested_Positive     20
Name: Outcome, dtype: int64

In [16]:
#Check for nulls and duplicates
nulls=data.isnull().sum()

dups=data.duplicated().sum()

nulls,dups

(Pregnancies                 0
 Glucose                     0
 BloodPressure               0
 SkinThickness               0
 Insulin                     0
 BMI                         0
 DiabetesPedigreeFunction    0
 Age                         0
 Outcome                     0
 dtype: int64,
 0)

In [19]:
#Look for corrupt character in the data
data[~data.applymap(np.isreal).any(1)]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome


In [21]:
#Summary statistics of numerical and categorical data
num_stats=data.describe().T #Transposing columns to rows and vice versa to understand the data in a different way
cat_stats=data.describe(include='O').T
num_stats,cat_stats

(                          count        mean         std     min       25%  \
 Pregnancies               768.0    3.845052    3.369578   0.000   1.00000   
 Glucose                   768.0  120.894531   31.972618   0.000  99.00000   
 BloodPressure             768.0   69.105469   19.355807   0.000  62.00000   
 SkinThickness             768.0   20.536458   15.952218   0.000   0.00000   
 Insulin                   768.0   79.799479  115.244002   0.000   0.00000   
 BMI                       768.0   31.992578    7.884160   0.000  27.30000   
 DiabetesPedigreeFunction  768.0    0.471876    0.331329   0.078   0.24375   
 Age                       768.0   33.240885   11.760232  21.000  24.00000   
 
                                50%        75%     max  
 Pregnancies                 3.0000    6.00000   17.00  
 Glucose                   117.0000  140.25000  199.00  
 BloodPressure              72.0000   80.00000  122.00  
 SkinThickness              23.0000   32.00000   99.00  
 Insulin   

**Data Summary**
1. The dataset has 768 rows and 9 columns
2. The dataset has 8 numerical variables(int64 and float64) and one categorical variable (Outcome)
3. **The categorical variable Outcome has 4 levels has which we need to clean and reduce to 2 levels (Yes=1/No=0)**\
4. There are no missing values or duplicate rows.
5. There are no corrupt characters in the data
6. **There are many columns which hace minimum value as 0,that is physiologically not feasible,so we have to impute them with columns median**

### Step 3:Data cleaning Step
- encode categorical Outcome variable
- impute columns with minimum value 0

In [22]:
data.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [23]:
#Create a copy of the data
df=data.copy()