# End to End Machine Learning With Deployment

### Part1- EDA of the Medical Dataset
1. Import the libraries
2. Load and View the data
3. Clean the data
4. Complete EDA of the data ( depoloy a EDA page in streamlit) 

### Part2-Modelling of the data set
5. Preprocessing for modelling
6. Fit and Evaluate various models
7. OPtimize the chosen model
8. Interpret the model
9. Create a pipeline for the model
10. Pickle the model 
11. Deploy the model in streamlit 

### Step1. Import the libraries

In [1]:
!python -V

Python 3.8.13


In [3]:
from platform import python_version

In [4]:
print(python_version())

3.10.9


In [6]:
pip install imbalanced-learn

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install xgboost

Note: you may need to restart the kernel to use updated packages.


In [7]:
pip install streamlit

Collecting streamlit
  Using cached streamlit-1.22.0-py2.py3-none-any.whl (8.9 MB)
Collecting cachetools>=4.0
  Using cached cachetools-5.3.0-py3-none-any.whl (9.3 kB)
Collecting rich>=10.11.0
  Using cached rich-13.3.5-py3-none-any.whl (238 kB)
Collecting protobuf<4,>=3.12
  Using cached protobuf-3.20.3-cp310-cp310-win_amd64.whl (904 kB)
Collecting pyarrow>=4.0
  Using cached pyarrow-12.0.0-cp310-cp310-win_amd64.whl (21.5 MB)
Collecting tzlocal>=1.1
  Using cached tzlocal-5.0.1-py3-none-any.whl (20 kB)
Collecting gitpython!=3.1.19
  Using cached GitPython-3.1.31-py3-none-any.whl (184 kB)
Collecting blinker>=1.0.0
  Using cached blinker-1.6.2-py3-none-any.whl (13 kB)
Collecting pympler>=0.9
  Using cached Pympler-1.0.1-py3-none-any.whl (164 kB)
Collecting altair<5,>=3.2.0
  Using cached altair-4.2.2-py3-none-any.whl (813 kB)
Collecting validators>=0.2
  Using cached validators-0.20.0-py3-none-any.whl
Collecting pydeck>=0.1.dev5
  Using cached pydeck-0.8.1b0-py2.py3-none-any.whl (4.8 MB

In [8]:
pip install --upgrade scikit-learn

Collecting scikit-learn
  Using cached scikit_learn-1.2.2-cp310-cp310-win_amd64.whl (8.3 MB)
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.1
    Uninstalling scikit-learn-1.2.1:
      Successfully uninstalled scikit-learn-1.2.1
Successfully installed scikit-learn-1.2.2
Note: you may need to restart the kernel to use updated packages.


In [None]:
conda update -c conda-forge scikit-learn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

plt.style.use('fivethirtyeight')
print("All needed libraries are imported")


# libraries for preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from imblearn.over_sampling import SMOTE

# libraries for modeling
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn. tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC

# libraries for model evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import plot_confusion_matrix

#library for deployement
import streamlit as st

### Step2. Load and View the data

In [None]:
data=pd.read_csv('data.csv')
data.head(10)

In [None]:
data.shape

In [None]:
data.dtypes

In [None]:
data.isnull().sum()

In [None]:
data.duplicated().sum()

In [None]:
data[~data.applymap(np.isreal).any(1)]

In [None]:
data.describe().T

In [None]:
data.columns

**Observations**
1. Data has 768 rows and 10 columns 
2. The first column is 'Unnamed: 0' which is redundant
3. All columns are numerical except the Outcome 
4. There are no nulls in the data 
5. However there are nulls present as 0's
6. There are no duplicates or corrupt characters


### Step 3. Clean the data

In [None]:
# remove the redundant columns
data.drop('Unnamed: 0', axis=1, inplace=True)

In [None]:
data.head(2)

In [None]:
zerofill=lambda x:x.replace(0, x.median())
cols=data.columns[1:6]
data[cols]=data[cols].apply(zerofill, axis=0)

In [None]:
data.describe().T


In [None]:
d={'Yes':1, 'No':0}
df=data.copy()
df['Outcome']=df['Outcome'].map(d)

In [None]:
df.head(2)

### Step4. Complete EDA of the data ( depoloy a EDA page in streamlit) 

**Univariate Analysis**
1. Numericals -histograms and boxplots 
2. Categorical- barcharts 

**Bivariate Analysis**
1. Categorical vs Numerical barchart
2. Scatter plots and Line plots 
3. Pairplots 

**Corralations**
1. Correlation Matrix
2. Heatmap

**Univariate Analysis**

In [None]:
def histograms(data):
    print('Histograms')
    data.hist()
    plt.tight_layout()
    plt.show()

In [None]:
histograms(df)

In [None]:
def boxplot_histplot(data, feature, bins=None, figsize=(12,7)):
    print("Boxplot and Histplot for ", feature)
    fig, (ax_box, ax_hist)=plt.subplots(
    nrows=2,
    sharex=True,
    gridspec_kw={'height_ratios':(0.25, 0.75)},
    figsize=figsize)
    
    sns.boxplot(data=data, x=feature, showmeans=True, color='orange', ax=ax_box)
    sns.histplot(data=data, x=feature, bins=bins, ax=ax_hist, pallete='green') if bins else sns.\
                         histplot(data=data, x=feature, ax=ax_hist)
    ax_hist.axvline(data[feature].mean(), color='green', linestyle='--')
    ax_hist.axvline(data[feature].median(), color='black', linestyle='-')
    plt.show()

In [None]:
for col in data.select_dtypes(exclude='O').columns:
    boxplot_histplot(data=data, feature=col, bins=None, figsize=(12,7))

In [None]:
def barchart(data, feature):
    print("Univariate Countplot of ", feature)
    plt.figure(figsize=(12,7))
    ax=sns.countplot(data=data, x=feature, color='green')
    for p in ax.patches:
        x=p.get_bbox().get_points()[:,0]
        y=p.get_bbox().get_points()[1,1]
        ax.annotate("{:.3g}%".format(100.*y/len(df)), (x.mean(), y), ha='center', va='bottom')
    plt.show()

In [None]:
barchart(data=data, feature='Outcome')

**Observations**
1. Insulin . DPF and Age are highly right skewed and having heavy amoiunt of outliers ( we may need to do data transformation like log) 
2. Age and Pregnancies are also right skewed with some extreme values which may be legit ( need to consukt with domain experts)
3. Outcome variable is highly imbalanced(Yes:No = 1:2) , we need to solve for data imbalance before modelling
4. Missing values have been taken care of
5. Label Encoding is done 

**Bivariate Analysis**
1. Categorical vs Numerical barchart
2. Scatter plots and Line plots 
3. Pairplots 

In [None]:
def catnum(data, feature1, feature2):
    print("Bivariate Barchart between {0} and {1}".format(feature1, feature2))
    data.groupby(feature1)[feature2].mean().plot(kind='bar', color='orange')
    plt.ylabel(col)
    plt.show()
    

In [None]:
for col in data.select_dtypes(exclude='O').columns:
    catnum(data=data, feature1='Outcome', feature2=col)

In [None]:
def lineplot_scatterplot(data, feature1, feature2):
    plt.figure(figsize=(16,7))
    print("Bivariate Charts for {0} and {1}".format(feature1, feature2))
    plt.subplot(1,2,1)
    sns.lineplot(data=data, x=feature1, y=feature2, color='green')
    plt.title('Lineplot between features')
    
    plt.subplot(1,2,2)
    sns.scatterplot(data=data, x=feature1, y=feature2, color='orange')
    plt.title('Scatterplot between features')
    plt.show()
    

In [None]:
for col in df.select_dtypes(exclude='O').columns:
    lineplot_scatterplot(data=df, feature1='Glucose', feature2=col)

In [None]:
sns.pairplot(df, kind='reg')

**Corralations**
1. Correlation Matrix
2. Heatmap

In [None]:
df.corr()

In [None]:
df[df.columns[:]].corr()['Outcome']

In [None]:
plt.figure(figsize=(12,7))
sns.heatmap(df.corr(), annot=True, cmap='Spectral', vmin=-1, vmax=+1)

**Observations of Bivariate Analysis**
1. Women with higher Pregnancies, Glucose, DPF, Insulin are more likely to be diabetic
2. Glucose and Insulin, BMI and SkinThickness appear to have hiugh multicollinearity
3. Glucose, BMI appear to be strongest predictors of Diabetes

### App for EDA 

In [None]:
!pip install streamlit

In [None]:
%%writefile eda.py
import streamlit as st
st.title("The EDA Page")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
st.set_option('deprecation.showPyplotGlobalUse', False)
plt.style.use('fivethirtyeight')

### Step2. Load and View the data

data=pd.read_csv('data.csv')
st.subheader('Data View')
st.write(data.head())

st.subheader('Descriptives')
st.write(data.describe().T)

data.hist()
st.subheader('Histograms')
plt.tight_layout()
st.pyplot()

# End of Part 1

# step 5:data preproccesing
1. separate features and the labels
2. null value imputation
3. label encoding
4. data imbalanced solving
5. train test split
6. feature scaling

In [None]:
def preprocess(data,label):
    x=data.drop(label,axis=1)
    y=data[label]
    sm=SMOTE()
    x,y=sm.fit_resample(x,y)
    x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
    return x_train,x_test,y_train,y_test

In [None]:
x_train,x_test,y_train,y_test=preprocess(df,'Outcome')

In [None]:
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)

# Step 6

Fit and evalute models

In [None]:
def print_metrics(y_test,y_pred,model_name):
    print('the results of the model',model_name)
    print('')
    print('accuracy_score=',accuracy_score(y_test,y_pred))
    print('')
    print('recall_score=',recall_score(y_test,y_pred))
    print('')
    print('precision=',precision_score(y_test,y_pred))
    print('')
    print('f1_score=',f1_score(y_test,y_pred))
    

In [None]:
def plot_metrics(clf,x_test,y_test,model_name):
    plot_confusion_matrix(clf,x_test,y_test,display_labels=[0,1])
    print('')
    plot_roc_curve(clf,x_test,y_test)
    print('')
    plot_precision_recall_curve(clf,x_test,y_test)

In [None]:
y=df['Outcome']
y.value_counts()
x=df.drop('Outcome',axis=1)

In [None]:
sm=SMOTE()
x,y=sm.fit_resample(x,y)

In [None]:
y.value_counts()

In [None]:
#fit the knn method
knn=KNeighborsClassifier()
knn.fit(x_train,y_train)
y_pred=knn.predict(x_test)
print_metrics(y_pred,y_test,'KNN')

In [None]:
plot_metrics(knn,x_test, y_test,'KNN')

In [None]:
#optimize k
neighbors=np.arange(1,20)
train_accuracies=np.empty(len(neighbors))
test_accuracies=np.empty(len(neighbors))

for i,k in enumerate(neighbors):
    knn=KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train,y_train)
    train_accuracies[i]=knn.score(x_train,y_train)
    test_accuracies[i]=knn.score(x_test,y_test)

plt.plot(neighbors, train_accuracies, label='Train_accuracies')
plt.plot(neighbors, test_accuracies, label='Test_accuracies')
plt.legend()
plt.title("Model Complexity Curves")
plt.xlabel("No.of.Neighbors")
plt.ylabel("Accuraries")
plt.show()

In [None]:
#refit the model

knn=KNeighborsClassifier(n_neighbors=16)
knn.fit(x_train,y_train)
y_pred=knn.predict(x_test)
print_metrics(y_pred,y_test,'KNN')

# fit and evaluate all the modeland choose the best to deploy

In [None]:

clfs = {
    'LogisticRegression': LogisticRegression(),
    'Naive Bayes': GaussianNB(),
    'KNN':KNeighborsClassifier(),
    'DecisionTreeClassifier': DecisionTreeClassifier(),
    'RandomForestClassifier':RandomForestClassifier(),
    'AdaBoostClassifier':AdaBoostClassifier(),
    'GradientBoostingClassifier':GradientBoostingClassifier(),
    'XGBClassifier':XGBClassifier(),
    'SVM':SVC()
}

# create an empty dataframe of metrics
models_report = pd.DataFrame(columns=['Model_name','Accuracy','Recall','Precision',
                                    'f1_score'])

# fit and evaluate each model
for clf, clf_name in list(zip(clfs.values(), clfs.keys())):
    clf.fit(x_train, y_train)
    print('Fitting Classifier....', clf_name)
    y_pred=clf.predict(x_test)
    t=pd.Series({
        'Model_name':clf_name,
        'Accuracy':accuracy_score(y_test, y_pred),
        'Recall':recall_score(y_test, y_pred),
        'Precision':precision_score(y_test, y_pred),
        'f1_score':f1_score(y_test, y_pred)
    })
    models_report = models_report.append(t, ignore_index=True)
    
models_report=models_report.sort_values(by='f1_score', ascending=False)
models_report

In [None]:
rfc=RandomForestClassifier()
rfc.fit(x_train,y_train)

In [None]:
#optimize the model using GridSearchCV
param_grid={
    'n_estimators':[100,150,200,250,300],
    'min_samples_leaf':range(1,5,1),
    'min_samples_split':range(2,10,2),
    'max_depth':[1,2,3,4,5],
    'criterion':['entropy','gini'],
    
}
n_folds=3
cv=GridSearchCV(estimator=rfc,param_grid=param_grid,cv=n_folds,n_jobs=-1,verbose=5,return_train_score=False)
cv.fit(x_train,y_train)
cv.best_score_

In [None]:
#lets interupt the model
rfc_tuned=cv.best_estimator_
rfc_tuned

In [None]:
!pip install shap

In [None]:
#feature importance / model interpretation

import shap
value=shap.TreeExplainer(rfc).shap_values(x_test)
shap.summary_plot(value, x_train, plot_type='bar', feature_names=x.columns)

In [None]:
# Create the deployement model pipeline

from sklearn.pipeline import Pipeline

In [None]:
sc=StandardScaler()
rfc=rfc_tuned
steps=[('sc',sc),('rfc',rfc)]
pipeline=Pipeline(steps)
x_train,x_test,y_train,y_test=preprocess(df,'Outcome')
pipeline.fit(x_train,y_train)
y_pred=pipeline.predict(x_test)
print_metrics(y_test,y_pred,'Pipeline RFC')

In [None]:
#To freeze the model
import pickle
model=open('rfc.pickle','wb')
pickle.dump(pipeline,model)
model.close()

# Deploy pickle model

In [None]:
df.columns

In [None]:
%%writefile app.py
import streamlit as st
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import pickle
st.title('Medical Diagnostic Prediction App')
st.markdown('Does the Person have Diabetics')

#step1 : load the trained model
model=open('rfc.pickle','rb')
clf=pickle.load(model)
model.close()

#step2: get the user input from frontend

pregs=st.number_input('Pregnancies',0,20,step=1)
glucose=st.slider('Glucose',42,200,40)
bp=st.slider('BloodPressure',20,140,20)
skin=st.slider('SkinThickness',7,99,7)
insulin=st.slider('Insulin',14,850,14)
bmi=st.slider('BMI',18,70,18)
dpf=st.slider('DiabetesPedigreeFunction',0.05,2.50,0.05)
age=st.slider('Age',21,90,21)


#step 3: COnvert user input to model input

data={
    'Pregnancies':pregs,
    'Glucose':glucose,
    'BloodPressure':bp,
    'SkinThickness':skin,
    'Insulin':insulin,
    'BMI':bmi,
    'DiabetesPedigreeFunction':dpf,
    'Age':age
}

input_data=pd.DataFrame([data])

# step4 : get the predictions and print the result
prediction = clf.predict(input_data)[0]
if st.button("Predict"):
    if prediction==0:
        st.write("The Person is Healthy")
    if prediction==1:
        st.write("The Person has Diabetes")