### Pima Indian Diabetics Prediction
     
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. I collected this data from kaggle, for the link click [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database)

#### Curiosity 

I choose this data for predicting if a person has a diabetes or not but without using any Machine Learning library. I was having a curiosity of solving this problem by using mathematics and the data analysis libraries. Achieving the great accuracy is not my goal here but to chase the curiosity.

To make good Structure I followed follwing path:

 1. Data Understanding
 2. Data Preparation
 3. Model

## Data Understanding 

In [1]:
import pandas as pd     # import data analysis libraries
import numpy as np

from math import exp 

In [2]:
df= pd.read_csv(r"E:\Study\Projects\diabetes\diabetes.csv") # Read file
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
df.columns  # Columns list

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

### Data preparation

In [4]:
# Prepare the input variable(X), output variable(Y) and also theta.

# Collect all data for the input variable
X = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']].values

#here, we are adding one extra column containing only one's to the input(X) using np.c_ method
X=np.c_[np.ones(len(X)),X]

 # giving the shape of X to m and n(m will contain rows and n columns)
[m,n]=X.shape

 # Making theta's of zero's of the shape(n*1)
theta=np.array(np.zeros((n,1)))

# Make output varible y containing predicting values.
Y=df['Outcome'].values
Y=np.reshape(Y,(m,1)) #reshaping into m*1 size

print("theta:{0}, X:{1}, Y:{2}".format(theta.shape,X.shape,Y.shape))

theta:(9, 1), X:(768, 9), Y:(768, 1)


In [5]:
# Normalize our features

Xmax, Xmin = X.max(), X.min() # calculate max and min
Xavg=np.average(X) #calculate average
print(Xmax,Xmin,Xavg)

X=(X-Xavg) / Xmax-Xmin # normalizing 
print(X.shape)

846.0 0.0 40.0984810474537
(768, 9)


### Model

In [6]:
#Since we are using logistic regression so calculating the sgmoid function.

def sigmoid(X,theta):
        """ Here, we are calculating the sigmoid function as it maps the whole
            range of z values(z = theta.T * X) into [0,1] in g(z) = 1/1+e^-z.
            cause our goal is to classify the output in  o or 1 that's why sigmoid 
            function plays big role here."""
        
        hyp = 1 + np.exp(-np.dot(X,theta)) # calculating hypothesis (1+e^z)
        hx= np.divide(1,hyp) 
        return hx
    
hx=sigmoid(X,theta)   
print("hypothesis shape",hx.shape)

hypothesis shape (768, 1)


In [7]:
# Calculating Cost Function

def costf(X,Y):
        """ Here, we are calculating the cost function the error between predicted and expected values
             cost = -1/m[y log(hx)+(1-y)log(1-hx)]
        """
        cost = np.dot(Y.T, np.log(hx)) + np.dot((1-Y).T, np.log(1-hx))
        jtheta = np.dot((-1/m),cost)
        
        return jtheta

jtheta = costf(X,Y)
print("Cost function", jtheta)

Cost function [[0.69314718]]


In [8]:
#Gradient discent - minimizing cost function(jtheta)

def grad(X,Y,hx):
    """To minimize the cost here we are calculating gredient descent
        first we created the thetaj gradient variable to save our descent.
        our descents: thetaj:= thetaj - 1/m (hx-y) xj. xj are the features."""
    
    thetaj=np.array(np.zeros(theta.shape))
    thetaj+=thetaj-np.divide(np.sum(np.dot(X.T,hx-Y)),m)
    
    return thetaj
        
final_theta=grad(X,Y,hx)

print(final_theta.shape)
final_theta

(9, 1)


array([[0.02158164],
       [0.02158164],
       [0.02158164],
       [0.02158164],
       [0.02158164],
       [0.02158164],
       [0.02158164],
       [0.02158164],
       [0.02158164]])

In [9]:
#Finding_Accuracy

p=sigmoid(X,final_theta)>=0.5

print("trained accuracy:\n", np.mean(p==Y)*100)

trained accuracy:
 64.32291666666666


#### Adding more features

In [10]:
""" To do something a bit more we can do one thing is that 
    adding some more features from existing features in this way we can traing some more features
    and add a bit accuracy to our model.
    Here we have taken two features X1 and X2 and making other more features from these two features
"""

X1=np.mat([X[:,0]]).T  # Extracted two features
X2=np.mat([X[:,1]]).T

out=np.ones((m,1))  # an empty array of one's

degree=6

for i in np.arange(0,6):
    for j in np.arange(0,i):
        out = np.multiply(np.power(X1,(i-j)),np.power(X2,j)) # creating new features from existing ones
        X = np.c_[(X,out)]  
        
[m,n]=X.shape
theta=np.array(np.zeros((n,1)))

### Regularized Logistic Regression

In [11]:
lam=1000 #lambda

In [12]:
#Cost function regularised

def costf_reg(X,Y,theta):
    """Since we added many features we should regularised. It make trade off 
        between the hx and regularization term for making hypothesis simple and avoid overfitting.
            j(theta) = 1/m [y log(hx) + (1-y)log(1-hx)] + lambda/2m theta^2 """
    
    cost= np.dot(Y.T,np.log(hx))+np.dot((1-Y).T,np.log(1-hx))
    cost_reg= np.dot(np.divide(lam,2*m),np.sum(np.square(theta[1:])))
    jtheta= np.dot((-1/m),cost)+cost_reg
    
    return jtheta

jtheta = costf_reg(X,Y,theta)
print("Cost shape:",jtheta.shape)

Cost shape: (1, 1)


In [13]:
alpha=0.03
epoch=1000

In [14]:
def grad_reg(X,Y,hx,alpha,epoch):
    """ Calculating gredient descent for regularization term
        Note that we do not regularize theta0. for j=0 the descent is in else statement and for j>=1
        descent is in if statement.
        for j=0; thetaj:=thetaj - alpha 1/m (h(x)-y) xj)
        for j>=1 thetaj:=thetaj - alpha[1/m (h(x)-y) xj) + lambda/m theta ]"""
    thetaj=np.array(np.zeros(theta.shape))
    
    for epoch in range(epoch):
        for j in range(0,len(theta)):
            if(j!=0):
                thetaj[j]+=thetaj[j]- np.dot(np.divide(np.sum(np.dot(X.T,hx-Y)),m)+ np.dot(np.divide(lam,m),theta[j]),alpha)
            else:
                thetaj[j]+=thetaj[j]- np.dot(np.divide(np.sum(np.dot(X.T,hx-Y)),m),alpha)
    
    
    return thetaj
        
final_theta_reg=grad_reg(X,Y,hx,alpha,epoch)
print(final_theta.shape)
final_theta

(9, 1)


array([[0.02158164],
       [0.02158164],
       [0.02158164],
       [0.02158164],
       [0.02158164],
       [0.02158164],
       [0.02158164],
       [0.02158164],
       [0.02158164]])

In [15]:
#Finding_probabilty

p=sigmoid(X,final_theta_reg)>=0.5
print("trained accuracy:\n", np.mean(p==Y)*100);

trained accuracy:
 65.36458333333334


  if __name__ == '__main__':


In [16]:
alpha=0.01

In [17]:
 def sgd(X,Y,hx):
        
        """Trying out with another algorithm called stochastic gredient descent for better efficiancy
            #b = b + learning_rate * (y - hx) * hx * (1 - hx) * x """
        b=np.array(np.zeros(theta.shape))
        b+= b + np.dot(alpha,np.dot(np.dot(X.T,(1-hx)),np.dot((Y-hx).T,hx)))
        return b
theta_sgd=sgd(X,Y,hx)
theta_sgd.shape

(24, 1)

In [18]:
#Finding_probabilty

p=sigmoid(X,theta_sgd)>=0.5
print("trained accuracy:\n", np.mean(p==Y)*100)

trained accuracy:
 65.10416666666666


### Summery

hmm, so we looked at the whole process in simple manner. there are lots of things we can do to improve more like in the data preparation step we can play more with the data so by that we will get more better results and also by trying out some different algorithms. like here we used sgd algo and withount making lot of work it predicted wel. if our data would be clean as possible then this algo might gives us different result.

So there you go with implementation of logistic algorithm with and without regularization term also the sgd algorithm for predicting if a patient has diabeties or not.