# Logistic Regression

In this tutorial we will be going through logistic regression. The aim is to acheive the following in this tutorial. Provide a situation in which logistic regression is a useful algorithm. Provide a mathematical perspective on Logistic Regression. Apply Logistic regression to a sample dataset to show how it can be implemented in Python, in R and in Julia.

### Estimating probability
Logistic regression is used to model for the purposes of classification. 

Imagine we want to predict the probability that a student will pass an exam based on their test marks across the year. Logistic regression provides us with a tool to turn a set of input features into a probability measure.

The probably of the response is conditional on a set of variables which can themselves be either continuous or discreet.

Logistic regression works in a very similar manner to linear regression but the prediction is passed through a sigmoid function which converts the linear function to a sigmoid function. A Sigmoid function is defined over the space 0 to 1.

## Mathematics
Logistic regression finds its basis in the logit function
$$
 \mathcal{l} = log_b \frac{p}{1-p}
$$

Conider the case in which we have a set of predictors $x = [x_1, \dots, x_n]$ and a binary response $y$. We denote the probability of $P(y=1)=p$. Assume a linear relationship between the predictors governed by the weights $\theta = [\theta_1, \dots, \theta_n] and the logit function

$$
    \mathcal{l} = log_b \frac{p}{1-p} = \theta \cdot x
$$

The odds $p$ can be recovered by taking exponentiating the logit function:

$$
\frac{p}{1-p} = b^{\theta^T \cdot x}
$$

By simple algerbra we can extract the probability

$$p = \frac{b^{\theta^T \cdot x}}{b^{\theta^T \cdot x} + 1} = \frac{1}{1 + b^{\theta^T \cdot x}} = S_b(\theta^T \cdot x)$$

Where $S_b$ is the sigmoid function (inverse of the logit) with base $b$. This means that once $\theta$ is fixed we can easily compute the liklihood of the response. The base $b$ in most applications is usually taken to be $e$, however it can from time to time be easier to work in other bases.

### Finding Beta

Logistic regression concerns itself with finding feature weights $\theta$ that maximise the probability of observations in a training set. Consider a generealised linear model 

$$
h_\theta(X) = \frac{1}{1+e^{-\theta \cdot X}} = Pr(Y=1|X;\theta)
$$

This therfore returns some value $ 0 \geq h_\theta(X) \leq 1$. Since $Y \in \{0,1\}$, $Pr(y|X;\theta)=h_{\theta}(X)^y(1-h_{\theta}(X))^{(1-y)}$. Given this loss a liklihood function can be calculated under the assumption that all observations in the sample are independently distributed.

$$L(\theta| y;x) = Pr(Y| X; \theta) = \prod_iPr(y_i|x_i;\theta) = \prod_ih_{\theta}(x_i)^{y_i}(1-h_{\theta}(x_i))^{(1-y)}$$

Breaking this down consider the case where the true label $y=0$ then $(1-h_{\theta}(x))^{(1-y)} = 1$ leaving $h_{\theta}(x)^y$, therefore in this case we want the probability to be high, now consider if the true label is false and you will see the opposite. Therefore we want to find the parameters $\theta$ that maximise this liklihood function. This can be done using gradient descent.



## Components
In this section we will build the necessary components that will allow us to perform Logistic Regression. We will then analyse the various componenets on top of which logistic regression is based.

### Python

In [2]:
import pandas as pd
import numpy as np
from scipy.optimize import fmin_tnc
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

X = data.data
y = data.target
theta = np.zeros((X.shape[1], 1))

# Define the model we are using

def sigmoid(x):
    # Activation function used to map any real value between 0 and 1
    return 1 / (1 + np.exp(-x))


def net_input(theta, x):
    # Computes the weighted sum of inputs
    return np.dot(x, theta)


def probability(theta, x):
    # Returns the probability after passing through sigmoid
    return sigmoid(net_input(theta, x))


# Define the cost function

def cost_function(theta, x, y):
    # Computes the cost function for all the training samples
    m = x.shape[0]
    total_cost = -(1 / m) * np.sum(
        y * np.log(probability(theta, x)) + (1 - y) * np.log(
            1 - probability(theta, x)))
    return total_cost


# Define the optimisation method

def gradient(theta, x, y):
    # Computes the gradient of the cost function at the point theta
    m = x.shape[0]
    return (1 / m) * np.dot(x.T, sigmoid(net_input(theta,   x)) - y)


def fit(x, y, theta):
    opt_weights = fmin_tnc(func=cost_function, x0=theta,
                  fprime=gradient,args=(x, y.flatten()))
    return opt_weights[0]




parameters = fit(X, y, theta)

print("Initial Cost: ", cost_function(theta, X, y))
print("Optimised Cost: ", cost_function(parameters, X, y))

    

Initial Cost:  394.40074573860846
Optimised Cost:  0.04099637846897899
