<a href="https://colab.research.google.com/github/jaekim2172/busanguy/blob/master/02_logistic_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression

Linear regression's purpose is minimize the RSS(Residual Sum of Squares). But linear regression can't apply on specific situation. For example, If the dependent value($y$) is non-numerical(e.g. binary categorical), regression result must be weird. So, we have to re-define of this problem as solving probability of binary class.  

----

### Summary
- The logistic(sigmoid) function's output range always $ 0< f(x) < 1 $.
- It satisfies probability density function's rule.
- So, we can use this for binary classification.

$$ f(x) = \frac{1}{1+e^{-x}} $$


----
### Logistic Regression for Binary Classification

- Now we can modeling $ P(y=1|X) $ when $ y = \{1,0\} $. 
- Follow equation is logistic regression's equation.

$$ P(X) = \frac{1}{1+e^{-(b_0 + WX)}} $$

----
### How to Derive
- It derived from the concept of $odds$.
- Below equation means(odds means) that if the P(A) is increases, odds(승산) also increases.

$$ odds = \frac{P(A)}{P(A^c)} = \frac{P(A)}{1-P(A)}$$

- And we have below equation. Because we have to solve this problem as probability of specific class. But left equation's range is $ 0 $ to $ 1 $ and right equation's range is $ -\infty $ to $ \infty $ 

$$ P(y=1|X) = b_0 + WX $$

- Let's turn this problem into a odds problem.

$$ \frac{P(y=1|X)}{1 - P(y=1|X)} = b_0 + WX $$

- But left equation's range is still $ 0 $ to $ \infty $. So, we have use $log$.

$$ ln(\frac{P(y=1|X)}{1 - P(y=1|X)}) = b_0 + WX $$

- Now, the range of both equation is $ -\infty $ to $ \infty $ 
- Below equations are the process of solving equation line by line.

$$ \frac{P(y=1|X)}{1 - P(y=1|X)})= e^{b_0 + WX} $$

$$ \frac{P(y=1|X)}{1 - P(y=1|X)})= e^{b_0 + WX} $$

$$ P(y=1|X) = \frac{e^{b_0 + WX}}{1+e^{b_0 + WX}} $$

$$ \therefore P(y=1|X) = \frac{1}{1+e^{-(b_0 + WX)}} $$

----
### Decision Boundary

Generally, The decision boundary of binary classification is 0.5. And logistic regression also like this. Now we follow the process of derive decision boundary.

- Make below equation to find decision boundary(Criteria that can be classified as 1).

$$ P(y=1|X) > P(y=0|X) $$

- And, the below equations are the process of solving equation line by line.

> - $$ P(y=1|X) > 1 - P(y=1|X) $$
> - $$ P(y=1|X) > 0.5 $$
> - $$ \therefore P(y=1|X) = \frac{1}{1+e^{-(b_0 + WX)}} > 0.5 $$

or

> - $$ P(y=1|X) > 1 - P(y=1|X) $$
> - $$ \frac{P(y=1|X)}{1-P(y=1|X)} > 1$$
> - $$ ln(\frac{P(y=1|X)}{1-P(y=1|X)}) > 0$$
> - $$ \therefore b_0 + WX > 0 $$

----
### Logistic Regression with MLE

Now, we have to find cost function for gradient descent. Because of logistic model's non-linearity, we can't use deterministic model. Likelihood is also calculated from PDF functions but by calculating the joint probabilities of data points from a particular PDF function. In this case, instead of normal distribution, Bernoulli distribution is our proper PDF.

$$ L(parametes | data) = \prod_{i=1}^{m} f(data_i | parameters) $$

- Bernoulli trial : bernoulli trial means that the results of experiment(=event) are binary case.
- For example, the binary probability of event is below.

$$ P(y=1) = p, P(y=0) = 1-p $$

$$ P(Y=y_i) = p^{y_{i}} (1-p)^{1-y_i}, (y=1,0) $$

- And Bernoulli distribution is like that.

$$ Bern(y; p) = p^{y} (1-p)^{1-y} $$

- Now we can summary some equations. ($ Note, f(x)=logistic(x) $)

$$ L = \prod_i p^{y_{i}} (1-p)^{1-y_i}$$

$$ L = \prod_i f(b_0 + XW)^{y_{i}} (1-f(b_0 + XW))^{1-y_i}$$

$$ ln(L) = \sum_i y_i ln(f(b_0 + XW)) + \sum_i (1-y_i)ln(1-f(b_0 + XW)))$$

- So, the log-likelihood of logistic regression model is defined. And this is cost function of logistic regression.

----
### Gradient Descent

- After long long long long formula expansion [(link)](https://stats.stackexchange.com/questions/278771/how-is-the-cost-function-from-logistic-regression-derivated), we can find the gradient of cost function (log-likelihood)  

$$ \frac{d}{dW}ln(L) = \frac{1}{m} \sum_{i=1}^{m}(f(WX) - y_i)x_j $$

$$ w_j = w_j - \frac{\alpha}{m} \sum_{i=1}^{m}(f(WX) - y_i)x_j $$

----
#### references
- https://machinelearningmastery.com/logistic-regression-for-machine-learning/
- https://ratsgo.github.io/machine%20learning/2017/04/02/logistic/
- https://stats.stackexchange.com/questions/278771/how-is-the-cost-function-from-logistic-regression-derivated