Как запоминать? Как рассказывать на лекции?

1. Функции потерь помогают прицелиться в наблюдаемое значение параметра p
2. Параметризуем p через x => лосс меняется, но не сильно - получаем логистическую регрессию
3. Детальнее, как выглядит оптимизация лоса для логрега
4. Обощение на произвольное распределение Response

## Logistic Loss

Logistic Loss = Negative Loglikelihood = Binary cross-entropy

$$Log{L(p)} = \sum_{i=1}^{N} \bigg[y_i \cdot log{p_i} + (1-y_i) \cdot \log{(1-p_i)}\bigg]$$

#### Derivation
Suppose we have a set of binary variables $\overline{y} = \{y_1, y_2 ... y_n\}$, where $y_i \sim B(p)$ - Bernoulli random variable with parameter $p$

*The most typical example of Bernoulli trial is a coin flip (p=1/2)

In most of the problems we do not know parameter $p$ but we can infer it. It is typically done by maximizing likelihoood of the observed data

Recall the likelihood function $l(\theta)$ = the probability of the data given the parameter $\theta$: $P(x | \theta)$
<img src="img/logistic_mle1.png" width=250>

Maximium function value refers to the most probable parameters set $\theta_{ML}$

---

The likelihood of one Bernoulli trial is
$$L(p) = \begin{cases}
    p,& \text{if y = 1}\\
    1-p,& \text{if y = 0}
\end{cases}
$$

Or in one row 
$$L(p) = p^{y} \times (1-p)^{(1-y)}$$
 
Likelihood function of one Bernoulli trial can be illustrated in 3-D where L = (x,p)<br> Here $x \in \{0,1\}$ and $p \in R$
- Red line = P(0|p)
- Blue line = P(1|p)
<img src="img/logistic_mle2.png" width=200>

In this case MLE maximization is straitforward: just set p = y and it gonna be the most probable model

---

Now take the likelihood of a series of Bernoulli trials, say<br>[0,1,0,0,1,1,1,0,1,0,1 ...]<br>

Here 3D illustration is harder, since it is not clear how to order $y$ combinations along the $x$ axis. But we can slice on this particular combination $\overline{y}$=[0,1,0,0,1,1,1,0,1,0,1 ...] and intuitively the likelihood will be a  bell with its maximum equals the rate of "ones" $p=\frac{|y|}{N}$

More formally

$$L(p) = \prod_{i=1}^{N} P(y_i|p) = \prod_{i=1}^{N} p^{y_i} \cdot (1-p)^{1-y_i} \rightarrow \max_{p}$$

Or, since trials are independent, we can group mutliplcations. Thus the probability of the series gonna be:

$$L(p) = p^{|y|} \cdot (1-p)^{N-|y|} \rightarrow \max_{p}$$

*NOTE that if we multiply by $\binom N {|y|}$ we'll get probability for the Binomial = P of the series with exact number of "ones"

It is easier to optimize sums => apply logarithm and simplify to make a Log-Likelihood

$$L(p) = \log{\bigg[\prod_{i} p^{y_i} \cdot (1-p)^{1-y_i}}\bigg] = \sum_{i=1}^{N} \bigg[ \log{p^{y_i}} + \log{(1-p)^{1-y_i}}\bigg] = \sum_{i=1}^{N} \bigg[y_i \cdot log{p} + (1-y_i) \cdot \log{(1-p)}\bigg]$$

NOTE when we take logarithm, we make L negative. So by maximizing we make L less negative as possible.

In Machine Learning it is more common to use loss functions rather than do P maximization, so we can alter the sign and solve a minimization problem

$$V(y,p)= - \sum_{i=1}^{N} \bigg[y_i \cdot log{p} + (1-y_i) \cdot \log{(1-p)}\bigg] \rightarrow \min_{p}$$

Hence,<br>
we showed that finding a proper model for a binary variable is equivalent to maximizing its loglikelihood OR minimizing the logistic loss

## Maximum Likelihood

If $p$ is a single parameter in the model we can solve it analytically. Let's prove formally that $L(p)$ attains maximum at the Y's sample mean

__Proof__

Take the loglikelihood
$$L(p) = \sum_{i=1}^{N} \bigg[y_i \cdot log{p} + (1-y_i) \cdot \log{(1-p)}\bigg]$$

Get rid of the sums to make expression more consise

$$L(p) = |y| \cdot log{(p)} + (N - |y|) \cdot log{(1-p)}$$


Now let's write the first and second order conditions for the maximum. First derivative becomes

$$
\frac{\partial{L}}{\partial{p}} = \frac{1}{p} \cdot |y| - \frac{1}{1-p} \cdot (n - |y|) = 0 \\
$$

Or if we rearrange
$$\frac{1-p}{p} = \frac{n - |y|}{|y|}$$

This equals to
$$\frac{1}{p} - 1  = \frac{n}{|y|} - 1$$

Thus necessary condition for maximum is that p equals the sample average of $y$

$$p  = \frac{|y|}{n} = \frac{1}{n}\sum_{i=1}^N y_i$$

To prove that it will be the maximum (not saddle or minimum) let's check the Second order conditions which takes the form of

$$\frac{\partial^2{L}}{\partial{p^2}} = \frac{1}{p^2} \cdot \sum y_i + \frac{1}{(1-p)^2} \cdot \sum (1-y_i) < 0
$$

It's easy to see that this condition is always matched




__NOTE__ Sort of similar logic is applied when training quantile regression

## Logistic Regression

In logistic regression we model $p$ as a sigmoid $\sigma$ over the linear regression of the features $x$

$$p = \sigma(z) = \frac{1}{1+e^{-z}} = \frac{1}{1+e^{-(w_0 + w_1 x_1 + w_2 x_2 ... + w_n x_n)}}$$

The dependency chain becomes a little more complicated
$X \rightarrow z \rightarrow P$

Task: find parameters $w$ that make the logloss between p and y mimimal

<img src="img/logistic_intro.png" width=500>

Unlike logloss and linear part which are convex, sigmoid is not convex


If we plug in the expression for $P=\sigma(f(w,x))$ we can rewrite the Logistic Loss for regression task

$$L(y,z) = log{(1+e^{-y \cdot z})}$$


## Logistic Regression Loss

Margin-based loss = symmetric variant of the loss function, where $z=X\beta$ fo retiher class is plotted over the X axis

Take the logistic loss:

$$L=\begin{cases}
\log{(p)}, \text{ if } y=1 \\
\log{(1-p)}, \text{ if } y=0
\end{cases}
$$

Plug-in p as a sigmoid over $z = (w,x)$ to get:

$$L=\begin{cases}
\log{(p)} = \log{\big(\frac{1}{1+e^{-z}}\big)} = - \log{\big(1+e^{-z}\big)} \\
\log{(1-p)} = \log{\big(\frac{1}{1+e^{z}}\big)} = - \log{\big(1+e^{z}\big)}
\end{cases}
$$

Observation = the loss is symmetric under new variable $z$

Let's use $y \in \{-1,1\}$ as a class indicator. Thus we can rewrite in one row as $$L=\log{(1+e^{-y \cdot z})}$$

__Important__ class tag $y$ is used as indicator only => it does not affect the loss



Hence,<br>
we reformulate the logistic Regresson task as:
$$w = \underset{p}{\mathrm{argmin}} \bigg( \sum_{i=1}^N \log{\big(1 + e^{-y x w}\big)} \bigg)$$

Comparison with some other popular margin-based variants of losses

__NOTE__ logistic loss is defined up to lograithm base. Here plotted for $log_2(x)$



<img src="img/logistic_margin.png" width=500>

### Alternative formulations

Sometimes logistic regression is set in "Logit" form
$$X\beta = \log\bigg(\frac{p}{1-p}\bigg)$$

Here logit = inverse of the sigmoid - a function that links linear domain to the response domain

It defines the same set of points but in slightly different notation

$X \rightarrow z \leftarrow P$



### Solution

Let's define some notaton <br> $y \in \{0,1\}$ - correct output <br> $\hat{y} \in (0,1)$ - predicted output <br> $z = \sum_{i=1}^{N} w_i x_i$ - linear part of the regression

Recall the logistic loss:

$$L = - \sum_{i=1}^{N} \bigg[ y_i \log(\hat{y_i}) - (1-y_i) \log{(1-\hat{y_i})} \bigg]$$

Let's compute the gradient for a single loss instance by applying the chain rule:

$$\frac{\delta L}{\delta w_i} = \frac{\delta L}{\delta \hat{y}} \cdot \frac{\delta \hat{y}}{\delta z} \cdot \frac{\delta z}{\delta w_i}$$

Individual derivatives will be the folowing:

1. Derivative of logistic loss: $\frac{\delta L}{\delta \hat{y}} = \bigg( \frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}} \bigg)$<br><br>

2. Derivative of the sigmoid: $\frac{\delta \hat{y}}{\delta w} = (1-\hat{y}) \cdot \hat{y}$<br><br>

3. Derivative of the linear part $\frac{\delta z}{\delta w_i} = x_i$<br><br>


Now let's plug them in and simplify the expression

$$\frac{\delta L}{\delta w_i} = \bigg(\frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}}\bigg) \cdot (1-\hat{y}) \cdot \hat{y} \cdot x = (y-\hat{y}) \cdot x$$


So the gradient step gonna be
$$w_i := w_i - \eta \cdot (y-\hat{y}) \cdot x$$

## Logistic Loss and Cross-entropy

Let's recall some definitions from information theory

Entropy of a distribution $H(P) = - \sum_{i=1}^N P_i \cdot \log{P_i}$

Cross-entropy of two distributions is $H(P,Q) = - \sum_{i=1}^N P_i \cdot \log{Q_i}$

One can clearly see that when $N=2$ (distributions are binary) cross-entropy takes the same form as a logloss

$$D_{KL} = \sum_{i} P_i \cdot \log{\frac{P_i}{Q_i}} = \sum_{i} P_i \cdot \log{P_i} - \sum_{i} P_i \cdot \log{Q_i} = H(P) - H(P,Q)$$

How do those distributions relate to binary distributions?<br>
- $P = y$<br>real distribution - defined by "true" parameter p=P<br><br>
- $Q = \hat{y}$<br>predicted distribution - defined by candidate parameter p=Q<br>

Here $H(p)$ is defined over observed data thus constant. To minimize $H(P,Q)$ means to minimize $D_{KL}$

Hence,<br>
to find proper distribution for $\hat{y}$ = to minimize a gap between the "real" and "predicted" = to minimize $D_{KL}$

## Convexity

Logistic Loss is convex

#### By $\hat{y}$

Logarithm is a concave transform<br>
Logloss is a negative weighted sum of logaritms => convex



#### By $w$

Function is convex <=> its second derivative is postiive everywhere. Let's check

$$L'(z) = \frac{d}{dz} \bigg[\log(1+e^{−z}) \bigg] = -\frac{e^{-z}}{1+e^{-z}} = - \frac{1}{1+ e^{z}} = −\sigma(z)$$

$$L''(z) = \frac{d}{dz} (- \sigma(z)) = -\frac{e^{z}}{(1+e^{z})^2} \ge 0$$

NOTE Despite sigmoid is not convex the Logistic loss is convex => gradient methods converge to the local minima

### Logit model

Logit is the inverse of sigmoid function. It maps probability $p \in [0,1]$ back to $\mathbb{R}$


Alternative way to pose a logistic regression problem is through logit function

$$\log{\bigg(\frac{p}{1-p}\bigg)}=w_0 + w_1 x_1 + w_2 x_2 + ... + w_n x_n$$

$Logit = \sigma^{-1}(p) = \log{\bigg(\frac{p}{1-p}\bigg)}$

## Generalized Linear Model

We have linear regression and logistic regression. They share many common elelments.
Let's define some universal regression model that would model

GLM model is: $E[y|x] = \mu = g^{-1}(X\beta)$

$E[y|x]$ = mean of the response variable<br>
$g^{-1}$ = link function<br>
$\mu$ = mean function

In general they don't have closed-form (aka analytical) solution => are fit with IRLS procedure

## Bishop

Тут чуть-чут другой взгляд на модель, более общий вероятностный. В каждой точке пространства x есть латентная переменная: номер класса. И каждый из двух классов может производить ответ с какой-то вероятностью

$$P(C_1|x) = \frac{P(x|C_1)P(C_1)}{P(x|C_1)P(C_1) + P(x|C_2)P(C_2)} = \frac{1}{1+exp \big[\log \frac{P(x|C_2)P(C_2)}{P(x|C1)P(C_1)}\big]} = \frac{1}{1+exp \big[- \log \frac{P(x|C_1)P(C_1)}{P(x|C2)P(C_2)}\big]} = \frac{1}{1+e^{-\alpha}}$$

где $\alpha=\log{\big(\frac{P(x|C_1)P(C_1)}{P(x|C_2)P(C_1)}\big)}$ - это логарифм отношения правдоподобий классов