# **Binary Logistic Regression** using Gradient Descent 

**Binary Logistic Regression** is a discriminative classifier, because here, instead of first calculating likelihood probability and the calculating posterior probability, we will directly calculate posterior probability by making a PDF for posterior probability. 
### Let's see how:

We know, that our posterior probaility, for each example in our data(rows), for one of the class(if our data has only 2 classes) looks like:

$$ \textbf{P(class='1' | X)} = \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}$$ 

And, for another class:

$$\textbf{P(class='0' | X)} = 1 - \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}$$

**Suppose :** our data(considering preprocessed and normalized data) has N rows(examples) and M columns(features)

**1.** For each example i in all N examples, there will be actual class that it must belong to. That class will be either `0` or `1` (in encoded form).

**2.** Let's see how class column will look like: 

\begin{bmatrix}
0\\
0\\
1\\
\vdots\\
1
\end{bmatrix}

**3.** Now what we are going to make PDF of posterior probability is that we are going to combine posterior probability of each class in a **likelihood function**. Let's see how:

***A.*** For each example i, posterior probaility can be written as :
    
$$\boldsymbol{ \textbf{P(class=}C_i\textbf{)}} = (\hat p)^{C_i}\ (1-\hat p)^{1 - C_i}$$ 

where $\hat p$ is the posterior probability for class `1` and $C_i$ is the class label for a given example $i$.

***B.*** Combining above formula for all examples :

$\textbf{Likelihood probability} $ for $(x_1 = \text{'0'}\ \cap\ x_2 = \text{'0'}\ \cap\ \dots \cap\ x_N = \text{'1'})\ $ is:


$$\textbf{L}\ =\ \prod^N_{i = 1}\ (\hat p)^{C_i}\ (1- \hat p)^{1 - C_i}$$


$$\textbf{where,}\ \ \ \ \ {\hat p}\ =\ \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}$$ 


$$\textbf{and}\ C_i \text{is class label for}\ i^{\text{th}} example$$

***C.*** Taking both side $\log_e$, our Likelihood Function becomes:
$$\ $$

$$\boldsymbol{\log_e{\left(\textbf{L}(\theta_0, \theta_1, \theta_2)\right)}}\ =\ \sum_{i=1}^{N}\ \left[C_i\ .\log_e{\frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}}\right]\ \ +\  \left[(1 - C_i)\ .\log_e{\left(1 - \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}\right)}\right]$$

$$\ $$   

This is our final **Log Likelihood Function**, which we have to maximize, i.e:
$$\ $$

$$\boldsymbol{\underset{\hat\theta_0,\hat\theta_1, \hat\theta_2}{\textbf{max}}}\ \ \log_e{\left(\textbf{L}(\theta_0, \theta_1, \theta_2)\right)}$$

$$\textbf{=>}\ \ \ \text{-}\ \boldsymbol{\underset{\hat\theta_0,\hat\theta_1, \hat\theta_2}{\textbf{min}}}\ \ \log_e{\left(\textbf{L}(\theta_0, \theta_1, \theta_2)\right)}$$

$$\ $$
To dissolve the negative sign outside our optimization problem, we take in inside our log likelihood function and the our new likelihood function and optimization problem becomes :
$$\ $$

$$\boldsymbol{- \log_e{\left(\textbf{L}(\theta_0, \theta_1, \theta_2)\right)}}\ =\ - \ \sum_{i=1}^{N}\ \left[C_i\ .\log_e{\frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}}\right]\ \ \left[(1 - C_i)\ .\log_e{1 - \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}}\right]$$

$$\ $$

$$\boldsymbol{\underset{\hat\theta_0,\hat\theta_1, \hat\theta_2}{\textbf{min}}}\ \ - \ \log_e{\left(\textbf{L}(\theta_0, \theta_1, \theta_2)\right)}$$

**4.** Now we are going to solve our minimization problem using **Gradient Descent**:
$$\ $$

$$\theta_{\textbf{0}, final}\ =\ \theta_{\textbf{0}, initial} + \epsilon\ .\frac{\partial}{\partial\theta_{\textbf{0}}}\left[\log_e{L(\theta_0, \theta_1, \theta_2)}\right] \Bigg|_{\theta_0 = \theta_{0,initial}\\ \theta_1 = \theta_{1,initial}\\ \theta_2 = \theta_{2, initial}}$$

$$\ $$

$$\boldsymbol{\theta_{\text{final}}}\ = \ \begin{bmatrix}
                                                \theta_{\textbf{1},final} \\
                                                \theta_{\textbf{2},final}
                                                \end{bmatrix}
                                                \ =\ \begin{bmatrix}
                                                \theta_{\textbf{1},initial} \\
                                                \theta_{\textbf{2},initial}
                                                \end{bmatrix} 
                                                \ +\ \epsilon\ .\nabla\log_e{L(\theta_0, \theta_1, \theta_2) \Bigg|_{\theta_0 = \theta_{0,initial}\\ \theta_1 = \theta_{1,initial}\\ \theta_2 = \theta_{2, initial}}}
                                                $$
                                                
                                                
$$\textbf{where,}\ \ \ \boldsymbol{\epsilon}\ \text{is the step-size}$$

***A.*** Calculating $\boldsymbol{\frac{\partial}{\partial\theta_{\textbf{0}}}\log_e{L(\theta_0, \theta_1, \theta_2)}}$ and $\boldsymbol{\nabla\log_e{L(\theta_0, \theta_1, \theta_2)}}$ we find their values as:

$$\boldsymbol{\frac{\partial}{\partial\theta_{\textbf{0}}}\log_e{L(\theta_0, \theta_1, \theta_2)}}\ =\ -\ \sum_{i=0}^{N}\left(C_i - \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}\right)$$

$$\ $$

$$\boldsymbol{\nabla\log_e{L(\theta_0, \theta_1, \theta_2)}}\ = \ \begin{bmatrix}
                            \frac{\partial}{\partial\theta_{\textbf{1}}}\log_e{L(\theta_0, \theta_1, \theta_2)} \\
                            \frac{\partial}{\partial\theta_{\textbf{2}}}\log_e{L(\theta_0, \theta_1, \theta_2)}
                            \end{bmatrix}\ = \ 
                            \begin{bmatrix}
                 -\ \sum\limits_{i=1}^{N}\ x_1^i\ .\left(C_i - \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}\right)\\
                 -\ \sum\limits_{i=1}^{N}\ x_2^i\ .\left(C_i - \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}\right)
                 \end{bmatrix}
                            $$

## Let's Code it

In [1]:
import pandas as pd
import numpy as np

In [11]:
data_ini = pd.read_csv('/home/pramila/Desktop/DataSets/TumorData.csv')

In [12]:
data = data_ini.copy()

In [13]:
data.drop(labels=['Unnamed: 32', 'id'], axis=1, inplace=True)

In [14]:
data.head()

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [15]:
data['diagnosis'].replace(to_replace=['M', 'B'], value=[1,0], inplace=True)

In [16]:
training_data = data.iloc[:int(0.75*data.shape[0]),:]

In [17]:
testing_data = data.iloc[int(0.75*data.shape[0]), :]

In [18]:
C = training_data['diagnosis']

In [19]:
C

0      1
1      1
2      1
3      1
4      1
      ..
421    0
422    0
423    0
424    0
425    0
Name: diagnosis, Length: 426, dtype: int64

In [20]:
C = np.array(C).reshape(C.shape[0],1)

In [21]:
X = training_data.drop(labels='diagnosis', axis=1)

In [22]:
X = (X - X.mean())-X.std()  #Normalizing the training data

In [23]:
X = np.array(X)

In [24]:
#X

In [28]:
def the_sigmoid_output(theta0, theta1):
    '''calculates posterior probability, i.e probability represented as sigmoid function
    '''
    

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
        0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
        1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
        0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
        1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
        1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [None]:
def neg_log_likelihood(theta0, theta1, C):
    '''Calculates log likelihood function's value for given thetas
    '''
    sigmoid_output = the_sigmoid_output(theta0, theta1)
    
    first_term = np.matmul(C.T, np.log(sigmoid_output))
    second_term = np.matmul((1-C).T, np.log(1 - sigmoid_output))
        
    return first_term + second_term

In [None]:
def derivative_theta0():
    '''Returns derivate with respect to theta0 we are going to need in gradient descent
    '''

In [None]:
tolerance = 10**(-7)

step_size = 10**(-4)

while True:
    
    theta0_final = theta0_initial + (step_size * (derivative0))
    
    theta_final = theta_initial + (step_size * (gradient_matrix))
    
    if abs(neg_log_likelihood(theta0_final, theta_final) - neg_log_likelihood(theta0_final, theta_final)) < tolerance :
        break
    
    theta0_initial =  theta0_final
    theta_initial = theta_final
    
    
    