# **Binary Logistic Regression** using Gradient Descent 

**Binary Logistic Regression** is a discriminative classifier, because here, instead of first calculating likelihood probability and the calculating posterior probability, we will directly calculate posterior probability by making a PDF for posterior probability. 
### Let's see how:

We know, that our posterior probaility, for each example in our data(rows), for one of the class(if our data has only 2 classes) looks like:

$$ \textbf{P(class='1' | X)} = \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}$$ 

And, for another class:

$$\textbf{P(class='0' | X)} = 1 - \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}$$

**Suppose :** our data(considering preprocessed and normalized data) has N rows(examples) and M columns(features)

**1.** For each example i in all N examples, there will be actual class that it must belong to. That class will be either `0` or `1` (in encoded form).

**2.** Let's see how class column will look like: 

\begin{bmatrix}
0\\
0\\
1\\
\vdots\\
1
\end{bmatrix}

**3.** Now what we are going to make PDF of posterior probability is that we are going to combine posterior probability of each class in a **likelihood function**. Let's see how:

***A.*** For each example i, posterior probaility can be written as :
    
$$\boldsymbol{ \textbf{P(class=}C_i\textbf{)}} = (\hat p)^{C_i}\ (1-\hat p)^{1 - C_i}$$ 

where $\hat p$ is the posterior probability for class `1` and $C_i$ is the class label for a given example $i$.

***B.*** Combining above formula for all examples :

$\textbf{Likelihood probability} $ for $(x_1 = \text{'0'}\ \cap\ x_2 = \text{'0'}\ \cap\ \dots \cap\ x_N = \text{'1'})\ $ is:


$$\textbf{L}\ =\ \prod^N_{i = 1}\ (\hat p)^{C_i}\ (1- \hat p)^{1 - C_i}$$


$$\textbf{where,}\ \ \ \ \ {\hat p}\ =\ \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}$$ 


$$\textbf{and}\ C_i \text{is class label for}\ i^{\text{th}} example$$

***C.*** Taking both side $\log_e$, our Likelihood Function becomes:
$$\ $$

$$\boldsymbol{\log_e{\left(\textbf{L}(\theta_0, \theta_1, \theta_2)\right)}}\ =\ \sum_{i=1}^{N}\ \left[C_i\ .\log_e{\frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}}\right]\ \ +\  \left[(1 - C_i)\ .\log_e{\left(1 - \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}\right)}\right]$$

$$\ $$   

This is our final **Log Likelihood Function**, which we have to maximize, i.e:
$$\ $$

$$\boldsymbol{\underset{\hat\theta_0,\hat\theta_1, \hat\theta_2}{\textbf{max}}}\ \ \log_e{\left(\textbf{L}(\theta_0, \theta_1, \theta_2)\right)}$$

$$\textbf{=>}\ \ \ \text{-}\ \boldsymbol{\underset{\hat\theta_0,\hat\theta_1, \hat\theta_2}{\textbf{min}}}\ \ \log_e{\left(\textbf{L}(\theta_0, \theta_1, \theta_2)\right)}$$

$$\ $$
To dissolve the negative sign outside our optimization problem, we take in inside our log likelihood function and the our new likelihood function and optimization problem becomes :
$$\ $$

$$\boldsymbol{- \log_e{\left(\textbf{L}(\theta_0, \theta_1, \theta_2)\right)}}\ =\ - \ \sum_{i=1}^{N}\ \left[C_i\ .\log_e{\frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}}\right]\ \ \left[(1 - C_i)\ .\log_e{1 - \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}}\right]$$

$$\ $$

$$\boldsymbol{\underset{\hat\theta_0,\hat\theta_1, \hat\theta_2}{\textbf{min}}}\ \ - \ \log_e{\left(\textbf{L}(\theta_0, \theta_1, \theta_2)\right)}$$

**4.** Now we are going to solve our minimization problem using **Gradient Descent**:
$$\ $$

$$\theta_{\textbf{0}, final}\ =\ \theta_{\textbf{0}, initial} + \epsilon\ .\frac{\partial}{\partial\theta_{\textbf{0}}}\left[\log_e{L(\theta_0, \theta_1, \theta_2)}\right] \Bigg|_{\theta_0 = \theta_{0,initial}\\ \theta_1 = \theta_{1,initial}\\ \theta_2 = \theta_{2, initial}}$$

$$\ $$

$$\boldsymbol{\theta_{\text{final}}}\ = \ \begin{bmatrix}
                                                \theta_{\textbf{1},final} \\
                                                \theta_{\textbf{2},final}
                                                \end{bmatrix}
                                                \ =\ \begin{bmatrix}
                                                \theta_{\textbf{1},initial} \\
                                                \theta_{\textbf{2},initial}
                                                \end{bmatrix} 
                                                \ +\ \epsilon\ .\nabla\log_e{L(\theta_0, \theta_1, \theta_2) \Bigg|_{\theta_0 = \theta_{0,initial}\\ \theta_1 = \theta_{1,initial}\\ \theta_2 = \theta_{2, initial}}}
                                                $$
                                                
                                                
$$\textbf{where,}\ \ \ \boldsymbol{\epsilon}\ \text{is the step-size}$$

***A.*** Calculating $\boldsymbol{\frac{\partial}{\partial\theta_{\textbf{0}}}\log_e{L(\theta_0, \theta_1, \theta_2)}}$ and $\boldsymbol{\nabla\log_e{L(\theta_0, \theta_1, \theta_2)}}$ we find their values as:

$$\boldsymbol{\frac{\partial}{\partial\theta_{\textbf{0}}}\log_e{L(\theta_0, \theta_1, \theta_2)}}\ =\ -\ \sum_{i=0}^{N}\left(C_i - \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}\right)$$

$$\ $$

$$\boldsymbol{\nabla\log_e{L(\theta_0, \theta_1, \theta_2)}}\ = \ \begin{bmatrix}
                            \frac{\partial}{\partial\theta_{\textbf{1}}}\log_e{L(\theta_0, \theta_1, \theta_2)} \\
                            \frac{\partial}{\partial\theta_{\textbf{2}}}\log_e{L(\theta_0, \theta_1, \theta_2)}
                            \end{bmatrix}\ = \ 
                            \begin{bmatrix}
                 -\ \sum\limits_{i=1}^{N}\ x_1^i\ .\left(C_i - \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}\right)\\
                 -\ \sum\limits_{i=1}^{N}\ x_2^i\ .\left(C_i - \frac{1}{1 + e^{-(\hat\theta_0 + \hat\theta^{T}.X)}}\right)
                 \end{bmatrix}
                            $$

## Let's Code It

### Prepping Data

In [None]:
import pandas as pd
import numpy as np

In [None]:
data_ini = pd.read_csv('/home/pramila/Desktop/DataSets/TumorData.csv')

In [None]:
data = data_ini.copy()

In [None]:
data.drop(labels=['Unnamed: 32', 'id'], axis=1, inplace=True)

In [None]:
data.head()

In [None]:
data['diagnosis'].replace(to_replace=['M', 'B'], value=[1,0], inplace=True)

### Test-Train

In [None]:
training_data = data.iloc[:int(0.75*data.shape[0]),:]

In [None]:
testing_data = data.iloc[int(0.75*data.shape[0]), :]

### Normalizing the training data

In [None]:
C = training_data['diagnosis']

In [None]:
C

In [None]:
C = np.array(C).reshape(C.shape[0],1)

In [None]:
X = training_data.drop(labels='diagnosis', axis=1)

In [None]:
X = (X - X.mean())-X.std()  #Normalizing the training data

In [None]:
X = np.array(X)

In [None]:
#X

In [None]:
def the_sigmoid_output(theta0, theta, X):
    '''calculates posterior probability, i.e probability represented as sigmoid function
    '''
    exp_power = theta0 + (np.matmul(theta.T, X))
    
    denominator = 1 + np.exp(-exp_power)
    
    return 1/denominator   

In [None]:
def neg_log_likelihood(theta0, theta, X, C):
    '''Calculates log likelihood function's value for given thetas
    '''
    sigmoid_output = the_sigmoid_output(theta0, theta, X)
    
    first_term = np.matmul(C.T, np.log(sigmoid_output))
    
    second_term = np.matmul((1-C).T, np.log(1 - sigmoid_output))
        
    return first_term + second_term

In [None]:
def derivative_theta0(theta0, theta, X):
    '''Returns derivative with respect to theta0
    '''
    sigmoid_output = the_sigmoid_output(theta0, theta, X)
    
    return C - sigmoid_output

In [None]:
def derivative_theta(theta0, theta, X):
    '''Returns derivative with respect to theta 
    '''
    sigmoid_output = the_sigmoid_output(theta0, theta, X)
    
    return np.matmul(X.T, (C - sigmoid_output))

In [None]:
tolerance = 10**(-7)

step_size = 10**(-4)

theta0_initial = 

theta_initial = 

while True:
    
    derivative0 = derivative_theta0(theta0_initial, theta_initial, X)
    
    gradient_matrix = derivative_theta(theta0_initial, theta_initial, X)
    
    theta0_final = theta0_initial + (step_size * (derivative0))
    
    theta_final = theta_initial + (step_size * (gradient_matrix))
    
    if (abs(neg_log_likelihood(theta0_final, theta_final) - neg_log_likelihood(theta0_final, theta_final)) < tolerance) :
        break
    
    theta0_initial =  theta0_final
    theta_initial = theta_final
    
    
    