## **About this Script**

This script generates logistic regression model from scratch, in python, for a diabetes dataset.

This dataset describes the medical records for over 700 female Pima native Americans with 8 characteristics. These covariates include:
*   preg = Number of times pregnant
*   plas = Plasma glucose concentration a 2 hours in an oral glucose tolerance test
*   pres = Diastolic blood pressure (mm Hg)
*   skin = Triceps skin fold thickness (mm)
*   test = 2-Hour serum insulin (mu U/ml)
*   mass = Body mass index (weight in kg/(height in m)^2)
*   pedi = Diabetes pedigree function
*   age = Age (years)
The outcome variable is yes/1/diabetes or no/0/no diabetes

If you have questions, please contact: maese005@umn.edu

## **Step 1:** Get Data and load libraries

Here is the link to the github which stores the data: https://github.com/maese005/GLBIO-2021

The data was originally obtained from Kaggle: https://www.kaggle.com/kumargh/pimaindiansdiabetescsv

This dataset is also packaged in R: https://www.rdocumentation.org/packages/pdp/versions/0.7.0/topics/pima

Please download this file into your Google Drive. From there, we will mount Google Collab to Google Drive to access the data. 

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [97]:
import numpy as np 
import statistics
import csv
import pandas as pd
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression

In [None]:
#Read the data as a .csv and view. 
with open('/content/drive/My Drive/AI_Workshop/Code and Exercises/pima-indians-diabetes.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

In [81]:
#You can also read the .csv file using pandas. 
result = pd.read_csv('/content/drive/My Drive/AI_Workshop/Code and Exercises/pima-indians-diabetes.csv', header=None) #header=None so that you don't read the first row as column names.
result.shape #768 people and 8 features with 1 output.
result.head() #Take a look at the data (only works on pandas dataframe, not numpy)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [82]:
result = result.to_numpy() #In order to index, you must convert pandas data frame to numpy. 

## **Step 2:** Prepare the input (X) and output (y) data

In [90]:
X = result[:,0:8]
X.shape #768 8
X

array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

In [None]:
y = np.round(result[:,8])
y

## **Step 3:** Define our logistic regression function and cross validation function from scratch.

In [92]:
#Define class(MyLogisticReg2).

class MyLogisticReg2:
    
    #Initializes the parameters w and w0. 
    def __init__(self, d, lr, n_iters):  
        #Store these values.
        #d is the number of dimensions/features. 
        self.d=d #For boston, d=13. 
        self.lr=lr
        self.n_iters=n_iters
        
        #Create some weights. Set them to none at first.
        #Create the bias. Set it to none. 
        #We will have to come up with them.
        self.weights=None
        self.bias=None
    
    #Develop a fit method. 
    #This is the training step and involves gradient descent.
    #x is a numpy vector of size m*n where m is the number of samples and n is the number of features for each sample.
    #y is also of size m. Each training sample has 1 vector.
    def fit(self, X, y):
        #We need to initialize the weights/our parameters. 
        #Initialize the parameters.
        n_samples, n_features = X.shape

        #Initialize the weights by creating a vector of only 0's. It's size is the number of features.
        self.weights = np.zeros(n_features)
        #Set the bias to 0 at first.
        #Note: you can also use random numbers for the initialization, but 0 is just fine.
        self.bias=0
        
        #Use gradient descent. Iteratively update the weights.
        for _ in range(self.n_iters): #n_iters is the number of iterations we want to have.
            linear_model = np.dot(X, self.weights) + self.bias #This is wx+b. Use np.dot to multiple the vectors. 
            #Then apply the sigmoid function. Apply a helper method below.
            y_predicted = self.sigmoid(linear_model) #This is our approximation of y.
            #Update our weights using the update rules.
            
            #This is the derivative with respect to w.
            dw=(1/n_samples) * np.dot(X.T, (y_predicted-y)) #y predicted minus the actual y.
            
            #The derivative with respect to bias is the same but without the x. 
            db=(1/n_samples)*np.sum(y_predicted-y)
            
            #Now that we have our derivatives, update the parameters.
            self.weights-=self.lr * dw
            self.bias-=self.lr * db
        #set_trace() 
        
    #Develop a predict method. Input the new test samples that you want to predict. 
    def predict(self, X):
        #First, approximate the data using a linear model.
        linear_model=np.dot(X, self.weights) + self.bias
        #Then apply a sigmoid function to gete th probability.
        y_predicted=self.sigmoid(linear_model)
        #Predict the y class. Use a list comprehension. 
        y_predicted_cls=[1 if i > 0.5 else 0 for i in y_predicted] #Do this for each value in y_predicted.
        return y_predicted_cls
    
    def sigmoid(self, linear_model):
        return 1/(1+np.exp(-linear_model))


Store the logistic regression function we just created in a variable (model) so that we can reference it in our k fold cross validation function. 

In [93]:
model=MyLogisticReg2(d=2, lr=0.01, n_iters=1000)
#model.fit(xtrain, ytrain)
#model.predict(xtest)

In [94]:
#Create k fold cross validation function.
#This function performs k fold cross validation on X and y using method and returns the error rate in each fold.
#The method used is my logistric regression function.

def my_cross_val(X, y, k, model):
    from sklearn.metrics import accuracy_score
    error_rate=[0 for x in range(k)] #Or error_rate=np.zeros(10)
    for i in range(k):
        random_array=np.random.rand(X.shape[0])
        split=random_array<np.percentile(random_array,70)
        data_train3=X[split]
        target_train3=y[split]
        data_test3=X[~split]
        target_test3=y[~split]
        
        BOSTON.fit(data_train3, target_train3)
        #set_trace()
        y_prediction=BOSTON.predict(data_test3)
        #set_trace()
        error_rate[i]=(1-accuracy_score(target_test3, y_prediction)) #Output is the error.
    return (error_rate, np.mean(error_rate), statistics.stdev(error_rate))
    #Will report the error rates across folds, mean across the error rates, standard deviation.

## **Step 4:** Compare performance to sklearns logistic regression model. 

In [96]:
 #Apply the cross validation code to the datasets...accuracy is approximately 70%
 my_cross_val(X, y, 5, model)

([0.35064935064935066,
  0.33333333333333337,
  0.5974025974025974,
  0.4025974025974026,
  0.2943722943722944],
 0.39567099567099573,
 0.11928744045486377)

In [99]:
#Prepare sklearn's default logistic regression model.
#I am going to compare the imputations from this model with those calculated using my LR model.

myLR=LogisticRegression(penalty='l2',solver='lbfgs', multi_class='multinomial', max_iter=5000)

my_cross_val(X, y, 5, myLR)
#We can see that our model performs slightly better / almost equal to sklearns. 

([0.37229437229437234,
  0.6277056277056277,
  0.4372294372294372,
  0.4372294372294372,
  0.316017316017316],
 0.43809523809523804,
 0.1174588992685196)