# Target Problem
## Dataset
#### - Buseiness Case
I have created a diabetes prediction model that determines whether a person suffers from diabetes or not. I have used kaggle's pre-computed dataset, that includes BMI, blood pressure, age, blood glucose, insulin and number of pregnencies, etc. These details will be collected using web forms for a small price, and the model will determine whether or not the person is diabetic. This will be adaptable to clients, as compared to difficult endocrinologist appointments.

#### - Link to database: https://www.kaggle.com/datasets/mathchi/diabetes-data-set?resource=download
## Machine Learning Problem
We will be utilising body parameters(age, glucose, blood pressure, insulin, BMI, and other measurements) provided in dataset to predict if the person has diabetes. Because there's only two possible outcomes (0 or 1), this is obviously a classification problem. Since the dependent variable(target) is categorical, i.e., predicting 1 for diabetic and 0 for non-diabetic, I will be using Logistic Regression model.

In [1]:
# Importing libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix

import warnings
warnings.filterwarnings('ignore')

# Automate ML Workflow 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn import model_selection
from mlflow import log_metric, log_param
import mlflow
import mlflow.sklearn

# DVC
import dvc.api
from io import StringIO

In [2]:
# Reading the data set
df = pd.read_csv("diabetes.csv")

## Data Exploration

In [3]:
# Printing the dataframe's first five rows.
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [7]:
# Checking if data has any null values.
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


# Automate ML Workflow 

In [8]:
# Checking which columns has zero values
for colname in df.columns[:8]:
    print('0s in "{variable}": {count}'.format(
        variable=colname,
        count=np.count_nonzero(df[colname] == 0)))

0s in "Pregnancies": 111
0s in "Glucose": 5
0s in "BloodPressure": 35
0s in "SkinThickness": 227
0s in "Insulin": 374
0s in "BMI": 11
0s in "DiabetesPedigreeFunction": 0
0s in "Age": 0


In [9]:
# Replacing all zero values with NaN in our dataset.
cols = ["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]
df[cols] = df[cols].replace(['0', 0], np.nan)

In [10]:
# Again checking which if any columns has zero values.
for colname in df.columns[:8]:
    print('{variable}: {count}'.format(
        variable=colname,
        count=np.count_nonzero(df[colname] == 0)))

Pregnancies: 0
Glucose: 0
BloodPressure: 0
SkinThickness: 0
Insulin: 0
BMI: 0
DiabetesPedigreeFunction: 0
Age: 0


In [11]:
# Using Imputer to replace NaN with mean of the variable.
imp = SimpleImputer(strategy = "mean", missing_values = np.nan)

In [12]:
# Spliting data into test and train sets.
X = df.drop(columns = "Outcome", axis=1)
y = df["Outcome"]

In [13]:
# Data splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 42)

In [14]:
pipeline = Pipeline([("imputation", imp),
         ("scaler", StandardScaler()),
         ("classifier", LogisticRegression())])
pipeline.fit(X_train,y_train)

print("Training Accuracy: ", pipeline.score(X_train,y_train))
print("Testing Accuracy: ", pipeline.score(X_test,y_test))

Training Accuracy:  0.7760736196319018
Testing Accuracy:  0.7672413793103449


Machine learning pipelines dramatically minimise the time it takes to process any predictions. Continuous streams of raw data collected throughout time can be processed by the pipelines.  Teams with varied technical expertise can readily access pipelines. This implies you can put machine learning in the hands of the people who will really utilize the forecasts.

# Hyperparameter tuning

We can quickly execute an exhaustive search over stated parameter values for an estimator instead of manually searching for the optimum parameters using the GridSearchCV, which does a "exhaustive search over defined parameter values for an estimator."

### 1. Tuning using GridSearchCV

In [15]:
# Specifying parameters
parameters = [
    {'classifier' : [LogisticRegression()],
     'classifier__penalty' : ['l1', 'l2', 'elasticnet', 'none'],
     'classifier__C' : [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ],
     'classifier__fit_intercept' : [True, False],
     'classifier__dual' : [True, False],
     'classifier__class_weight' : ['dict', 'balanced'],
     'classifier__solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
    }
]
cv_1 = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

grid = GridSearchCV(estimator=pipeline, param_grid=parameters, cv=cv_1, scoring='accuracy')
grid.fit(X_train, y_train)

print("BEST PARAMETERS:", grid.best_params_, "\n")
print("BEST ESTIMATOR:", grid.best_estimator_)

BEST PARAMETERS: {'classifier': LogisticRegression(C=0.1, class_weight='dict', penalty='l1', solver='saga'), 'classifier__C': 0.1, 'classifier__class_weight': 'dict', 'classifier__dual': False, 'classifier__fit_intercept': True, 'classifier__penalty': 'l1', 'classifier__solver': 'saga'} 

BEST ESTIMATOR: Pipeline(steps=[('imputation', SimpleImputer()), ('scaler', StandardScaler()),
                ('classifier',
                 LogisticRegression(C=0.1, class_weight='dict', penalty='l1',
                                    solver='saga'))])


In [16]:
print(f'Accuracy: {grid.score(X,y):.3f}')

Accuracy: 0.773


### 2. Tuning using pipeline and mlflow

In [17]:
mlflow.set_experiment('Hyperparameter Tuning')
# defining paremeters in array
classifier__C_arr = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0 ]
classifier__solver_arr = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
classifier__fit_intercept_arr = [True, False]

In [18]:
mlflow.end_run()

In [19]:
for i in classifier__solver_arr:
    for j in classifier__fit_intercept_arr:
        for l in classifier__C_arr:
            with mlflow.start_run():
                logistic = LogisticRegression(solver=i, C=l, max_iter=500, fit_intercept=j)

                pipeline = Pipeline([("imputation", imp), ("scaler", StandardScaler()),  ("classifier", logistic)])
                pipeline.fit(X_train,y_train)

                train_score = pipeline.score(X_train,y_train)
                test_score = pipeline.score(X_test,y_test)

                #prediction = pipeline.predict(X_test)

                mlflow.log_param('classifier', i)
                mlflow.log_param('C', l)
                mlflow.log_param('fit_intercept', j)

                mlflow.log_metric('train_accuracy', train_score)
                mlflow.log_metric('test_accuracy', test_score)
                    

#### From mlflow ui, we can say that the parameters for best accuracy we get are: 
#### 1). solver: saga
#### 2). C: 0.1
#### 3). fit_intercept: True

# Data versioning using DVC

#### Creating a new version of our dataset and making a new csv file at destination.

In [20]:
df.head() #is our old dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,,33.6,0.627,50,1
1,1.0,85.0,66.0,29.0,,26.6,0.351,31,0
2,8.0,183.0,64.0,,,23.3,0.672,32,1
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,,137.0,40.0,35.0,168.0,43.1,2.288,33,1


In [21]:
# We will add few rows to our dataset and then version them.

# Adding BMI descriptor
def new_bmi(row):
    if row["BMI"] < 18.5:
        return "Under"
    elif row["BMI"] >= 18.5 and row["BMI"] <= 24.9:
        return "Healthy"
    elif row["BMI"] >= 25 and row["BMI"] <= 29.9:
        return "Over"
    elif row["BMI"] >= 30.0:
        return "Obese"

df = df.assign(BMI_DESC=df.apply(new_bmi, axis=1))

# Adding Glucose descriptor
def new_glucose(row):
    if row["Glucose"] >= 60 and row["Glucose"] <= 130:
        return "Normal"
    else:
        return "Abnormal"

df = df.assign(GLU_DESC=df.apply(new_glucose, axis=1))


df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,BMI_DESC,GLU_DESC
0,6.0,148.0,72.0,35.0,,33.6,0.627,50,1,Obese,Abnormal
1,1.0,85.0,66.0,29.0,,26.6,0.351,31,0,Over,Normal
2,8.0,183.0,64.0,,,23.3,0.672,32,1,Healthy,Abnormal
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21,0,Over,Normal
4,,137.0,40.0,35.0,168.0,43.1,2.288,33,1,Obese,Abnormal


In [22]:
# Converting categorical to numerical data(one hot encoding).
ohe=pd.get_dummies(df)
ohe.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,BMI_DESC_Healthy,BMI_DESC_Obese,BMI_DESC_Over,BMI_DESC_Under,GLU_DESC_Abnormal,GLU_DESC_Normal
0,6.0,148.0,72.0,35.0,,33.6,0.627,50,1,0,1,0,0,1,0
1,1.0,85.0,66.0,29.0,,26.6,0.351,31,0,0,0,1,0,0,1
2,8.0,183.0,64.0,,,23.3,0.672,32,1,1,0,0,0,1,0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21,0,0,0,1,0,0,1
4,,137.0,40.0,35.0,168.0,43.1,2.288,33,1,0,1,0,0,1,0


In [23]:
ohe.shape

(768, 15)

In [24]:
# Removing 100 rows from our data to create a new version
np.random.seed(10)
remove_n = 100
drop_indices = np.random.choice(ohe.index, remove_n, replace=False)
df_subset = ohe.drop(drop_indices)

In [25]:
df_subset.shape

(668, 15)

In [26]:
# Saving this new data as a csv file
df_subset.to_csv('data/diabetes_v2.csv', index=False)

### Importing data from local storage using Python API for DVC

### Version = 'v1.0'

In [27]:
data_url = dvc.api.get_url(
    path='data/diabetes.csv',
    repo='/Users/kvnsvni/Downloads/AI',
    rev='v1.0')

df_v1 = pd.read_csv('.'+data_url);
df_v1.shape

(768, 9)

### Version = 'v2.0'

In [28]:
data_url = dvc.api.get_url(
    path='data/diabetes.csv',
    repo='/Users/kvnsvni/Downloads/AI',
    rev='v2.0')

df_v2 = pd.read_csv('.'+data_url);
df_v2.shape

(668, 15)

In [29]:
# Training 'v1.0' dataset with tuned parameters
version = 'v1.0'
# Data splitting
X = df_v1.drop(columns = "Outcome", axis=1)
y = df_v1["Outcome"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 42)
mlflow.set_experiment('DVC')

with mlflow.start_run():
    logistic = LogisticRegression(solver='saga', C=0.1, max_iter=500, fit_intercept=True, class_weight='dict', penalty='l1')

    pipeline = Pipeline([("imputation", imp), ("scaler", StandardScaler()),  ("classifier", logistic)])
    pipeline.fit(X_train,y_train)

    train_score = pipeline.score(X_train,y_train)
    test_score = pipeline.score(X_test,y_test)

    mlflow.log_param('classifier', i)
    mlflow.log_param('C', l)
    mlflow.log_param('fit_intercept', j)
    mlflow.log_param('data_version', version)

    mlflow.log_metric('train_accuracy', train_score)
    mlflow.log_metric('test_accuracy', test_score)

In [30]:
# Training NEW dataset with tuned parameters
version = 'v2.0'
# Data splitting
A = df_v2.drop(columns = "Outcome", axis=1)
b = df_v2["Outcome"]
A_train, A_test, b_train, b_test = train_test_split(A, b, test_size = 0.15, random_state = 42)
mlflow.set_experiment('DVC')

with mlflow.start_run():
    logistic = LogisticRegression(solver='saga', C=0.1, max_iter=500, fit_intercept=True, class_weight='dict', penalty='l1')

    pipeline = Pipeline([("imputation", imp), ("scaler", StandardScaler()),  ("classifier", logistic)])
    pipeline.fit(A_train,b_train)

    train_score = pipeline.score(A_train,b_train)
    test_score = pipeline.score(A_test,b_test)

    mlflow.log_param('classifier', i)
    mlflow.log_param('C', l)
    mlflow.log_param('fit_intercept', j)
    mlflow.log_param('data_version', version)

    mlflow.log_metric('train_accuracy', train_score)
    mlflow.log_metric('test_accuracy', test_score)

## By seeing the logs for both the datasets, we can conclude that the new versioned dataset has higher accuracy.