## Logistic Regression

Binary classification task: $\mathbb{Y} = \{-1, +1\}$

Linear classifier

$a(x)=sign(b(x) - t) = sign(\langle w, x\rangle - t)$

Error rate loss function

$min_w \frac{1}{N} \sum^{N}_{i=1} [y_i \langle w, x_i \rangle < 0] = min_w \frac{1}{N} \sum^{N}_{i=1} [M_i < 0]$

We can optimize differetiable upper bound

$min_w \frac{1}{N} \sum^{N}_{i=1} \log(1+\exp(-M_i))$

__Can we use $b(x)=\langle w, x \rangle$ as a probability estimate?__

__Linear classifier__

Let us convert the output to [0, 1]

Sigmoid function $\sigma(\langle w, x \rangle) = \large \frac{1}{1+\exp(-\langle w, x \rangle)}$

__Logistic Regression__

Binary classification task: $\mathbb{Y} \in {-1, +1}$

Predicted probability:  

$P(y_i=1) = b(x_i)$

Use sigmoid function to map outputs to the range from 0 to 1

$b(x)= \sigma(\langle w, x \rangle) = \large \frac{1}{1+\exp(-\langle w, x \rangle)}$

- In some tasks it is important to predict class probabilities

- We can apply sigmoid function to the output of the model to get numbers between 0 and 1

- Finally we want to train our model in such a way, that they are interpreted as probabilities

## MLE for Logistic Regression

$y_i \in \{-1, +1\}$

$P(y_i=1|x_i) = \large \frac{1}{1+\exp(-\langle w, x \rangle)}$

$P(y_i=-1|x_i) = 1-P(y_i=1|x_i) = \large \frac{1}{1+\exp(\langle w, x \rangle)}$

use maximum likelihood estimation in order to find optimal parameters $w$.

MLE: It takes the likelihood of a model and maximize it with respect to the desired parameters.

$\log L = \log \Pi^N_{i=1} P(y_i|x_i) \to max_w$

$ - \log \Pi^N_{i=1} P(y_i|x_i) \to min_w$

---

$- \sum^N_{i=1} \log P(y_i|x_i) $ (from logoritm of probobility product, to sum of logoritm of probobility)

$= -\sum^N_{i=1} [[y_i=1] \log \frac{1}{1+\exp(-\langle w, x_i \rangle)} + [y_i=-1] \log \frac{1}{1+\exp(\langle w, x_i \rangle)}]$ (we have two options, $y\in \{-1, +1\}$)

$=\sum^N_{i=1} [[y_i=1] \log (1+\exp(-\langle w, x_i \rangle)) + [y_i=-1] \log (1+\exp(\langle w, x_i \rangle))]$

$= \sum^N_{i=1} \log(1 + \exp(-y_i\langle w, x_i \rangle)) \to min_w$ 

this is the final loss we will minimizing with respect to $w$

where $y_i\langle w, x_i \rangle$ is __margin__ and we saw that if it is positive, then we say that the sign of the scalar product and the target variable are the same, meaning that our classifier is correct.

## Logistic Regression: Proper Probabilities

Upper bound on the error rate

$\tilde{l}(M) = log(1+e^{-M})$

Logistic regression loss

$min_w \sum^{N}_{i=1} log(1+\exp(-y_i\langle w, x_i \rangle)$

We will say that the model $b(x)$ predicts probabilities correctly, if among objects with $b(x)=p$ proportion of positive is $p$

### predicting probabilities

Consider objects $x_1, \dots, x_n$, where the b(x) outputs the same probability around $p$:

$\sum^n_{i=1} l(y_i, b(x_i)) = \sum^n_{i=1} l(y_i, p)$

What is the optimal for these objects?

$p_* = \text{arg} \text{min} \sum^n_{i=1} l(y_i, p)$

We expect that $p_* = \frac{1}{n} \sum^n_{i=1} [y_i = +1]$ (the proportion of positive objects in the data set.)

### Log loss

$p_* = \text{arg} \text{min} \sum^n_{i=1} \{-[y_i=+1]\log p - [y_i=-1]\log (1-p) \}$

Calculate the derivative and find optimal probability:

$\sum_i \{ -\frac{[y_i=+1]}{p} + \frac{[y_i=-1]}{1-p} \} = -\frac{n_+}{p} + \frac{n_-}{1-p} = 0$

$p_* = \frac{n_+}{n_+ + n_-} = \frac{1}{n} \sum^n_{i=1} [y_i=+1]$

- we can formulate the condition that the model estimates the probabilities correctly
- choose loss function which satisfy this condition
- Log-loss is one example of such loss
- Another example is MSE, but MSE works poorly with classification task

## Support Vector Machine

### Linear classifier and hinge loss

Binary classification task $\mathbb{Y}=\{-1, +1\}$

We can optimize differentiable upper bound, hinge loss

$min_w \frac{1}{N}\sum^N_{i=1} max(0, 1-M_i)$

### Margin of a classifier

Recall that linear classifier defines a hyperplane

$\langle w, x \rangle = 0$

The distance from any point x to the hyperplane

$\frac{| \langle w, x \rangle |}{\|w\|}$

Margin of a classifier: distance from the hyperplane to the nearest object in the dataset

We will maximize the margin of the classifier, distance from the hyperplane to the nearest object in the dataset. But also try not to make mistakes at the same time. (Fail to classify anything)

### Linearly separable case

__Condition 1__: Maximal margin

The distance from any point x to the hyperplane is:

$\frac{| \langle w, x \rangle |}{\|w\|}$

Margin of a classifier:

$ min_{i=1 \dots n} \frac{| \langle w, x \rangle |}{\|w\|}$

__Condition 2__: No mistakes

$y_i \langle w, x_i \rangle > 0, i = 1,\dots , N$

### Simple assumption

For a linear classifier:

$a(x) = sign(\langle w, x \rangle)$

If we divide w by a scalar $a>0$, the output will not change:

$a(x) = sign (\frac{\langle w, x \rangle}{a}) = sign(\langle w, x \rangle)$

Let us divide w by $min_{i=1,\dots l} |\langle w, x_i \rangle| > 0$, as a result:

$min_{i=1,\dots l} |\langle \tilde{w}, x_i \rangle| = 1$

where $\tilde{w} = \frac{w}{min_{i=1,\dots l} |\langle w, x_i \rangle|}$

### Margin of a classifier

$ min_{i=1 \dots n} \frac{| \langle w, x_i \rangle |}{\|w\|}$

Given our assumption that 

$min_{i=1 \dots n} | \langle w, x_i \rangle | = 1$

The result margin is 

$ min_{i=1 \dots n} \frac{| \langle w, x_i \rangle |}{\|w\|} = \frac{min_{i=1 \dots n}| \langle w, x_i \rangle |}{\|w\|} = \frac{1}{\|w\|}$

### Linearly separable case

Condition 1: Maximal margin

$max_w \frac{1}{\|w\|}$

Condition 2: No mistaks 

$y_i \langle w, x \rangle > 0, i = 1,\dots, N$

Condition 3: 

$|\langle w, x \rangle| \ge 1, i = 1,\dots, N$

Condition 2 and 3 together:

$y_i \langle w, x \rangle \ge 1, i = 1,\dots, N$

The final optimization problem:

$\begin{cases} min_w \|w\|^2 \\  y_i \langle w, x_i \rangle \ge 1 \end{cases}$

### Linearly non-separable case

$\begin{cases} min_w \|w\|^2 \\  y_i \langle w, x_i \rangle \ge 1 - \xi_i \\ \xi_i \ge 0 \end{cases}$

(but in this case we may choose a very large $\xi$, and get a large margin and many mistakes. so we also punish for large $\xi$)

### Support Vector Machine

$\begin{cases} min_{w, \xi_i} \|w\|^2 + C\sum^l_{i=1} \xi_i \\  y_i \langle w, x_i \rangle \ge 1 - \xi_i \\ \xi_i \ge 0 \end{cases}$

small $C$ means small punishment for mistakes

Combine these two conditions

$\xi_i \ge max(0, 1-y_i \langle w, x_i \rangle )$

And rewrite the optimization task

$min_w \|w\|^2 + C\sum^{l}_{i=1} max(0, 1- y_i \langle w, x_i \rangle)$

Hinge loess with regularization 

$min_w \sum^{l}_{i=1} max(0, 1- y_i \langle w, x_i \rangle) + \lambda \|w\|^2$

>Recall the hinge loss, which was the upper bound of our error rate. It was exactly maximum between zero and 1 minus margin. But also we have the norm of a vector here. That is exactly the L2 regularization. We can say that support vector machine, which we derived from a very different intuitive assumptions, is equivalent to optimizing hinge loss with L2 regularization.

![image.png](attachment:image.png)

both logistic regression and support vector machine are upper bounds on the error rate. This upper bounds are slightly different and you can see how they differ on the graph.

It is support vector machine which use hinge loss, has very low values of loss in the right part of the plot. But as for values that are closer to zero and -1, the hinge loss is slightly larger than the logistic one.


- Support vector machine is based on the idea that we want to maximize the margin of a classifier. That is, we want our hyperplane to be far away from all the points in the data. 

- If our data set is not linearly separable, we can't build a classifier that perfectly splits the point. Therefore, will have to decide which is more important for us, the width of a margin or the amount of errors that the classifier does. 

- We also saw that support vector machine is analogous to training linear classification with hinge loss and L2 regularization.