# Logistic Regression
## Binary Logistic Regression


### Derivation Process
Assume all $X_i$ are conditionally independent given Y,
$X$ is a vector of $<x_1, x_2, x_3...x_n>$

Then, Model $P(X_i \mid Y=y_k)$ as Gaussian $N(\mu_{ik}, \sigma_i)$
AND Model $P(Y)$ as Bernoulli($\pi$)


We can imply that

$$ P(Y=1 \mid X) = \frac{P(X \mid Y=1)P(Y=1)}{P(X)} = \frac{P(X \mid Y=1)P(Y=1)}{P(X \mid Y=1)P(Y=1)+P(X \mid Y=0)P(Y=0)} = \frac{1}{1+\frac{P(X \mid Y=0)P(Y=0}{P(X \mid Y=1)P(Y=1)}} = \frac{1}{1+\exp({\ln{\frac{P(X \mid Y=0)P(Y=0)}{P(X \mid Y=1)P(Y=1)})}}}$$
$$= \frac{1}{1+\exp({\ln{\frac{1-\pi}{\pi}})}+exp{(\sum_{i=1}^n \ln{\frac{P(X_i \mid Y=0)}{P(X_i \mid Y=1))})}}}$$

Substitute for Gaussian PDF

$$P(x_i\mid y_k, \mu_{ik},\sigma_{ik}^2)={\frac {1}{\sqrt {2\pi \sigma_{ik} ^{2}}}}e^{-{\frac {(x_i-\mu_{ik} )^{2}}{2\sigma_{ik} ^{2}}}}$$

$$ P(Y=1 \mid X) = \frac{1}{1+\exp({-\theta_0 - \sum_{i=1}^n \theta_iX_i})}$$

Which implies:

$$ P(Y=0 \mid X) = \frac{\exp({-\theta_0 - \sum_{i=1}^n \theta_iX_i})}{1+\exp({-\theta_0 - \sum_{i=1}^n \theta_iX_i})}$$

AND 


$$ \frac{P(Y=0 \mid X)}{P(Y=1 \mid X)} = \exp({-\theta_0 - \sum_{i=1}^n \theta_iX_i}) $$

So,


$$ \ln{\frac{P(Y=0 \mid X)}{P(Y=1 \mid X)}} = -\theta_0 - \sum_{i=1}^n \theta_iX_i $$   **, which is Generilized Linear Model**

- *Definition of Ordinary Linear Model*: 

The conditional mean of Y on X is a linear expression of X with some coefficient $E(Y \mid X) = X\theta$

- *Definition of Generalized Linear Model*: 

The conditional mean of Y on X is a linear expression of X with some coefficient after some transformation,  $E(Y \mid X) = g^{-1}X\theta$

( $g$ is called link function)


** $ \frac{1}{\exp({-\theta_0 - \sum_{i=1}^n \theta_iX_i})}$ Also named as Sigmoid Function**

**Logit is log-odds**

**Odds = $P(Y=1 \mid x)/ (1-P(Y=1 \mid x))$ **

### Definition of Logistic Regression

$$F(x) = P(Y=1 \mid X) = \frac{1}{1+\exp({-\theta_0 - \sum_{i=1}^n \theta_iX_i})}$$

Measures the probability of $Y=1$ given $X$

## Maximum Likelihood Estimation

From above we know:
$P(Y)$ as Bernoulli($\pi$)

$$ P(Y=1 \mid X) = \frac{1}{1+\exp({-\theta_0 - \sum_{i=1}^n \theta_iX_i})}$$

Thus, $ h_{\theta}(x_i) = \frac{1}{1+\exp({-\theta_0 - \sum_{i=1}^n \theta_iX_i})}$ is a approximation of $\pi$

$$P(Y=y_i \mid x_i) = \pi^{y_i}(1-\pi)^{1-{y_i}}$$

$$P(Y=y_i \mid x_i) = (\frac{1}{1+\exp({-\theta_0 - \sum_{i=1}^n \theta_iX_i})})^{y_i}(1-\frac{1}{1+\exp({-\theta_0 - \sum_{i=1}^n \theta_iX_i})})^{1-{y_i}} = h_{\theta}(x_i)^{y_i}(1-h_{\theta}(x_i))^{1-{y_i}}$$ 

$$ L = P(Y_i \mid \overrightarrow X) = P(Y_1,Y_2,...Y_n  \mid \overrightarrow X) = P(Y_1\mid x_1)P(Y_2 \mid x_2)P(Y_3\mid x_3)...P(Y_n\mid x_n) = \prod_{i=1} ^n P(Y_i \mid x_i) $$

$$ =  \prod_{i=1} ^n h_{\theta}(x_i)^{y_i}(1-h_{\theta}(x_i))^{1-{y_i}}$$

$$ \ln{L} = \sum_{i=1}^n [{y_i} \ln{h_{\theta}(x_i)}+{(1-{y_i})}ln{(1-h_{\theta}(x_i))}] $$

Thus,

$$l(\theta) = \sum_{i=1}^n [{y_i} \ln{h_{\theta}(x_i)}+{(1-{y_i})}ln{(1-h_{\theta}(x_i))}]$$


$$argmax_{\theta}\sum_{i=1}^n [{y_i} \ln{h_{\theta}(x_i)}+{(1-{y_i})}ln{(1-h_{\theta}(x_i))}]$$

Equals to

$$argmin_{\theta}\sum_{i=1}^n [-{y_i} \ln{h_{\theta}(x_i)}-{(1-{y_i})}ln{(1-h_{\theta}(x_i))}]$$

Which is our objective function

Which is also called Cross Entropy Error or Log Loss


Good news: the objective function $l(\theta)$ is convex in $\theta$


Bad news: no closed-form solution to maximize 

## Optimization

$$ l(\theta) = \sum_{i=1}^n [{y_i} \ln{h_{\theta}(x_i)}+{(1-{y_i})}ln{(1-h_{\theta}(x_i))}]
= \sum_{i=1}^n [y_i(\log{\frac{1}{1+e^{-\theta x_i}}} - \log{\frac{e^{-\theta x_i}}{1+e^{-\theta x_i}}})
+ \log{\frac{e^{-\theta x_i}}{1+e^{-\theta x_i}}}]$$


It's hard to use first derivative to find the optimal solution to the objective function.

Ways to do:

Newton's Method

Gradient Descent