Logistic Regression

**Problem Statement**<br>
World Health Organization has estimated 12 million deaths occur worldwide, every year due to Heart diseases. Half the deaths in the United States and other developed countries are due to cardio vascular diseases. The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in turn reduce the complications. This research intends to pinpoint the most relevant/risk factors of heart disease as well as predict the overall risk using logistic regression.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

**Reading the dataset from file**

In [2]:
data=pd.read_csv('framingham.csv')

In [3]:
print(data.shape)
data.head()

(4238, 16)


Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [16]:
data=(data-data.min())/(data.max()-data.min())

**Checking for any NaN value**

In [90]:
data.isnull().any().any()

True

**We can count the number of NaN values**

In [6]:
data.isnull().sum().sum()

645

**Delete the rows containing null or NaN values**

In [17]:
data.dropna(inplace=True)
data.reset_index(drop=True, inplace=True)
data.isnull().sum().sum()

0

In [65]:
data.shape

(3656, 16)

**Checking the correlation of features with each other**

In [None]:
data.corr()

**Splitting the data to train and test**

In [18]:
data_train=data.iloc[:2559]
data_test=data.iloc[2559:]

**Separating the target value and feature value**

In [19]:
x_train=data_train.iloc[:,:15].to_numpy()
print("Type of x_train:",type(x_train))

Type of x_train: <class 'numpy.ndarray'>


In [20]:
y_train=data_train.iloc[:,15:].to_numpy()
print(y_train)

[[0.]
 [0.]
 [0.]
 ...
 [0.]
 [1.]
 [0.]]


**Converting into array for easier calculation**

In [None]:
x_train1=np.array(x_train)
y_train=np.array(y_train)

**Sigmoid Function**

In [9]:
def sigmoid(z):
    g=1/(1+np.exp(-z))
    return g

**Here we calculate the cost using formula**

In [10]:
def compute_cost(X, y, w, b, lambda_= 1):
    m, n = X.shape
    loss_tmp=0.
    z_wb=np.zeros(m)
    z_wb=np.dot(X,w)
    print(z_wb[:5])
    z_wb=z_wb+b
    z=1/(1+np.exp(-z_wb))
    for i in range(m):
        loss_tmp=loss_tmp-((y[i]*np.log(z[i]))+((1-y[i])*np.log(1-z[i])))
    total_cost=loss_tmp/m
    return total_cost

**Calculate gradient**

In [12]:
def compute_gradient(X, y, w, b, lambda_=None): 
    """
    Computes the gradient for logistic regression 
 
    Args:
      X : (ndarray Shape (m,n)) variable such as house size 
      y : (array_like Shape (m,1)) actual value 
      w : (array_like Shape (n,1)) values of parameters of the model      
      b : (scalar)                 value of parameter of the model 
      lambda_: unused placeholder.
    Returns
      dj_dw: (array_like Shape (n,1)) The gradient of the cost w.r.t. the parameters w. 
      dj_db: (scalar)                The gradient of the cost w.r.t. the parameter b. 
    """
    m, n = X.shape
    dj_dw = np.zeros(w.shape)
    dj_db = 0.
    dj_dw_i=np.zeros(w.shape)
    f_wb  = np.zeros(m)
    for i in range(m):
      for j in range(n):
        X[i][j]=(X[i][j])**j
        z_wb = np.dot(X[i],w) + b
        f_wb[i] =sigmoid(z_wb)
        dj_db_i = f_wb[i]-y[i]
        dj_db += dj_db_i
        
        for j in range(n):
            dj_dw_i[j] = dj_db_i*X[i,j]
            dj_dw[j] +=dj_dw_i[j]       
    dj_dw = dj_dw/m
    dj_db = dj_db/m
        
    return dj_db, dj_dw

**Gradient descent algorithm**

In [13]:
def gradient_descent(X ,y, w_in, b_in,cost_function, gradient_function, alpha, num_iters,lambda_):
    m = len(X)
        
        # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    w_history = []
    
    for i in range(num_iters):

        # Calculate the gradient and update the parameters
        dj_db, dj_dw = gradient_function(X, y, w_in, b_in, lambda_)   

        # Update Parameters using w, b, alpha and gradient
        w_in = w_in - alpha * dj_dw               
        b_in = b_in - alpha * dj_db              
    
        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            cost =  cost_function(X, y, w_in, b_in, lambda_)
            J_history.append(cost)

        # Print cost every at intervals 10 times or as many iterations if < 10
        if (num_iters%10) == 0 or i == (num_iters-1):
            w_history.append(w_in)
            print(f"Iteration {i:4}: Cost {float(J_history[-1]):8.2f}   ")
        
    return w_in, b_in, J_history, w_history #return w and J,w history for graphing

**Main Function**

In [21]:
np.random.seed(1)
m=x_train.shape[1]
initial_w = np.zeros(m).reshape(-1,1)
initial_b = -8


# Some gradient descent settings
iterations = 500
alpha = 0.001

w,b, J_history,_ = gradient_descent(x_train ,y_train, initial_w, initial_b,compute_cost, compute_gradient, alpha, iterations, 0)

[[0.00297585]
 [0.00282504]
 [0.00403354]
 [0.00580639]
 [0.00416199]]
Iteration    0: Cost     1.22   
[[0.00580723]
 [0.00556017]
 [0.00806548]
 [0.01135615]
 [0.00807301]]
Iteration    1: Cost     1.22   
[[0.0085695 ]
 [0.00832656]
 [0.01209814]
 [0.0168515 ]
 [0.01192682]]
Iteration    2: Cost     1.22   
[[0.01131318]
 [0.01110181]
 [0.01613075]
 [0.02241956]
 [0.01585335]]
Iteration    3: Cost     1.22   
[[0.01405474]
 [0.01387717]
 [0.0201633 ]
 [0.02802202]
 [0.01981429]]
Iteration    4: Cost     1.22   
[[0.0167962 ]
 [0.01665249]
 [0.0241958 ]
 [0.03362621]
 [0.02377698]]
Iteration    5: Cost     1.22   
[[0.01953762]
 [0.01942778]
 [0.02822824]
 [0.03923033]
 [0.02773963]]
Iteration    6: Cost     1.22   
[[0.022279  ]
 [0.02220302]
 [0.03226062]
 [0.04483438]
 [0.03170221]]
Iteration    7: Cost     1.21   
[[0.02502035]
 [0.02497823]
 [0.03629295]
 [0.05043835]
 [0.03566475]]
Iteration    8: Cost     1.21   
[[0.02776165]
 [0.0277534 ]
 [0.04032523]
 [0.05604226]
 [0.0396

KeyboardInterrupt: 

**Mentioning features and target**

In [None]:
x_test=data_test.iloc[:,:15].to_numpy()
y_test=data_test.iloc[:,15:].to_numpy()
# x_test1=np.array(x_test)
# y_test1=np.array(y_test)

**Predict function**

In [None]:
def predict(X, w, b): 
    """
    Predict whether the label is 0 or 1 using learned logistic
    regression parameters w
    
    Args:
    X : (ndarray Shape (m, n))
    w : (array_like Shape (n,))      Parameters of the model
    b : (scalar, float)              Parameter of the model

    Returns:
    p: (ndarray (m,1))
        The predictions for X using a threshold at 0.5
    """
    m, n = X.shape   
    f_wb=np.zeros(m)
    p = np.zeros(m)
   
    
    # Loop over each example
    for i in range(m):   
        z_wb = np.dot(X[i],w) + b
        f_wb[i] =sigmoid(z_wb)

        # Apply the threshold
        if (f_wb[i]>=0.3):
            p[i]=1
        else:
            p[i]=0
    return p

In [None]:
y_pred=predict(x_test,w,b)

**Confusion MAtrix For accuracy**

matrix=confusion_matrix(y_test,y_pred)
print("Confusion Matrix:")