# Logistic Regression

## Basic Logistic Regression
Logistic regression is used to model relationships between independent variables and categorical dependent variables. Instead of regression it would more appropriately be called logistic classification. Output of a logistic model is probability that a data point belongs to a class.

Classification: Predicting the probability that a data point belongs to a specific class (label).  
Logit: shortened name for logistic model.

### Assumptions of Logistic Regression
- Logit does not make many of the key assumptions of linear models that are based on ordinary least squares algorithms.
    - Does not require a linear relationship between the dependent and independent variables.
    - Error terms (residuals) do not need to be normally distributed Homoscedasticity is not required.
    - Finally, the dependent variable in logistic regression is not measured on an interval or ratio scale (aka not continuous)
- What assumptions do apply?
    - Dependent variable must follow a binomial distribution
    - Observations to be independent of each other.  In other words, the observations should not come from repeated measurements or matched data
    - Little or no multicollinearity among the independent variables.  This means that the independent variables should not be too highly correlated with each other.
    - Assumes linearity of independent variables and log odds

### The Sigmoid Function
A sigmoid curve is used to fit the data with the decision boundary separating the classes.
$$
    \log(\eta) = \frac{1}{1+\exp(-\eta)}
$$

Where,  
The probability is bound, $0 \leq \text{sigmoid}(x) \leq 1$  
Unlike linear reg. where values can range from $\pm\infty$
- where x goes to $-\infty$, $y=0$
- where x goes to $+\infty$, $y=1$

![](./img/logit_sigmoid.png)

### Log Odds
$$
    p(x) = \frac{1}{1+e^{-(\beta X)}}
$$

$$
    \beta X = \log\left(\frac{p(x)}{1-p(x)}\right)
$$

Where,  
$p(x)$: probability that an observation belongs to a class.  
$-\beta X$: input to the function (algorithm's prediction, e.g. $mx+b$).  
$e$: base of natural log.

### Steps in Logistic Regression
1. Set a threshold value to distinguish between two classes, for example: 
    - < 0.5 = 0
    - \> 0.5 = 1
    - But it's not that simple
1. Use a regression line to create line of best fit for distinguishing between two classes:
    - Examine distance between line and each data point
1. Plug regression formula into logit.
1. Get output prediction for each data point.

### Calculating Probability with The Sigmoid Function
What are the odds that Y belongs to a particular class?
$$
    P(Y=1|X)
$$

For the points near the line, what are the odds that the distance between the point and the line (the error) is greater than 0?
$$
    P(\beta X + \epsilon > 0|X) = P(\epsilon > -(\beta X)|X)
$$

The error terms follow a logistic distribution, so we can use the CDF of the logistic distribution to integrate the probability, which is the sigmoid function.
$$
    P(Y=1|X) = \frac{1}{1+e^{-(\beta X)}}
$$

![](./img/logit_logodds.png)

## Cost Function: Cross Entropy
Cross entropy measure the difference between two probability distributions for a given set of events:
- $P(y)=0$
- $P(y)=1$

$$
    J(\theta) = \frac{1}{m}\sum_{i=1}^{m}Cost(h_{\theta}(x^{(i)}), y^{(i)})
$$

$$
    J(\theta) =
    \begin{cases}
    Cost(h_{\theta}(x),y) = -\log(h_{\theta}(x)) &\text{if }y=1 \\
    Cost(h_{\theta}(x),y) = -\log(1-h_{\theta}(x)) &\text{if }y=0
    \end{cases}
$$

This graph shows the outcome of log loss a wrong and more confident answer (higher probability) gets a greater penalty.

![](./img/cross_entropy.png)