## Logistic Regression

- Instead of predicting exactly 0 or 1, **logistic** **regression** generates a probability, a value between 0 and 1,exclusive.
- Consider logistic regression model for spam detection.
  - If the model infers a value of 0.932, on a particular email message, it implies a 93.2% probability of email message being spam.
    - More precisely , it means that in the limit of infinite training examples, the set of  examples for which the model predicts 0.932 will actually be spam 93.2% of the time and the remaining 6.8% will not.
    
- **probability** **estimates** are calibrated
  - for example,$P$(house will sell)$*$price = expected outcome
- Regularization is extremely important for logistic regression
  - asymptotes of $log$**loss**
  - it will keep trying to drive loss to 0 in high dimensions
- two strategies are especially useful:
  - $L_{2}$**regularization**(aka $L_{2}$ weight decay) - penalizes huge weights
  - **Early** **stopping** - limiting training steps or learning rate

- Linear logistic regression is extremely efficient.
  - very fast training and prediction times
  - Short/wide models use a lot of RAM
  - of we need non-linearities, we can get them by adding in **feature** **cross** **products**

### Logistic Regression: Calculating a Probability 

Many problems require a probability estimate as output. Logistic regression is an extremely efficient mechanism for calculating probabilities. Practically speaking, you can use the returned probability in either of the following two ways:
  - "As is"
  - Converted to a binary category
  
Let's consider how we might use the probability "as is." Suppose we create a logistic regression model to predict the probability that a dog will bark during the middle of the night. We'll call that probability:

$P(bark\mid night)$

If the logistic regression model predicts a $P(bark\mid night)$ of 0.05, then over a year, the dog's owners should be startled awake approximately 18 times:

startled = $P(bark\mid night)$ * nights

18 ~= 0.05 * 365

In many cases, you'll map the logistic regression output into the solution to a binary classification problem, in which the goal is to correctly predict one of two possible labels (e.g., "spam" or "not spam").

You might be wondering how a logistic regression model can ensure output that always falls between 0 and 1. As it happens, a sigmoid function, defined as follows, produces output having those same characteristics:

$$y = \frac{1}{1 + e^{-z}}$$

The sigmoid function yields the following plot:

<img src="google_ml_img\sigmoidfunction.png">


If z represents the output of the linear layer of a model trained with logistic regression, then sigmoid(z) will yield a value (a probability) between 0 and 1. In mathematical terms:

$$y' = \frac{1}{1 + e^{-(z)}}$$

where:

- y' is the output of the logistic regression model for a particular example.
- z is b + w1x1 + w2x2 + ... wNxN
  - The w values are the model's learned weights and bias.
  - The x values are the feature values for a particular example.

Note that z is also referred to as the log-odds because the inverse of the sigmoid states that z can be defined as the log of the probability of the "1" label (e.g., "dog barks") divided by the probability of the "0" label (e.g., "dog doesn't bark")

$$z = log(\frac{y}{1-y})$$

**sigmoid function with ML labels**

<img src="google_ml_img\LogisticRegressionOutput.svg">

### Loss function for Logistic Regression

The loss function for linear regression is squared loss. The loss function for logistic regression is **Log Loss**, which is defined as follows:

$$\text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') - (1 - y)\log(1 - y')$$


where:

- $(x,y)\in D$ is the data set containing many labeled examples, which are $(x,y)$ pairs
- $y$ is the label in a labeled example. Since this is logistic regression, every value of $y$ must either be 0 or 1. 
- $y'$  is the predicted value (somewhere between 0 and 1), given the set of features in $x$

The equation for Log Loss is closely related to Shannon's Entropy measure from Information Theory. It is also the negative logarithm of the likelihood function, assuming a Bernoulli distribution of . Indeed, minimizing the loss function yields a maximum likelihood estimate.


#### Regularization in Logistic Regression

Regularization is extremely important in logistic regression modeling. Without regularization, the asymptotic nature of logistic regression would keep driving loss towards 0 in high dimensions. Consequently, most logistic regression models use one of the following two strategies to dampen model complexity:
  - L2 regularization.
  - Early stopping, that is, limiting the number of training steps or the learning rate.
  - there is also third strategy, L1 regularization

Imagine that you assign a unique id to each example, and map each id to its own feature. If you don't specify a regularization function, the model will become completely overfit. That's because the model would try to drive loss to zero on all examples and never get there, driving the weights for each indicator feature to +infinity or -infinity. This can happen in high dimensional data with feature crosses, when there’s a huge mass of rare crosses that happen only on one example each.

Fortunately, using L2 or early stopping will prevent this problem.
