# Logistic Regression #

## ***Vocabulary***

none

# Lecture Notes #

## ***1.10.0 Introduction***

### **Loss Functions**
- Classification
    - loss function: we predict $sign(w^Tx)$ for some vector $w$, and we will be penalized if $sign(w^Tx) \ne y^i$, and there will be no penalty if $sign(w^Tx) = y^i$. We examine $y^i*(w^Tx^i)$, and if positive, no penalty, if negative we enact penalty.
    - Penalty can be 0-1 loss: $\phi_{0-1}(y^i*w^Tx^i)$, where
$$ \phi_{0-1}(z) = \begin{cases} 
          1\;if\;z\;\le\;0 \\
          0\;if\;z\;\gt\;0 
          \end{cases}
$$  
- Linear Regression
    - loss function (square loss): $\underset{w}{\min}$ is $\frac{1}{m} \sum_{i=1}^{m}(w^Tx^i-y^i)^2$ was our loss objective
    - Square loss penalty: least squared loss $(w^Tx^i-y^i)^2$
    - We knew that if $\mathbf{E}[Y|X] = w^Tx$, then linear regression is a great option. Even if it did not hold, we could use GD to find the best fitting line

### **Optimization Problems**
- Classification
    - $\underset{w}{\min} \frac{1}{m} \sum_{i=1}^{m}\phi_{0-1}(y^i*w^Tx^i)$, this is what we want to solve for classification.

---

***Question: When does perceptron find a $w$ with small loss?***
<br><br>
Recall that perceptron required $\exists \;w \;such\;that\; \forall x: y*w^Tx > \rho$. There had to be a $w$ that for $w^Tx$ was negative, it was correct, and the margin was at least $\rho$. This implies convergence, or a number of mistakes, at most $\frac{1}{\rho^2}$.

<br>

***Question: What if there is no margin? (There may not even be a $w$ that is consistent with all the labels in our trianing set!)*** 
<br><br>
We can still use $\underset{w}{\min} \frac{1}{m} \sum_{i=1}^{m}\phi_{0-1}(y^i*w^Tx^i)$, becuase it will still minimize the number of mistakes on our training set. The goal in this case is to find a linear function the separates the data while making as few mistakes as possible.

<br> 

What is that computational complexity of this optimization problem?

$$\underset{w}{\min} \frac{1}{m} \sum_{i=1}^{m}\phi_{0-1}(y^i*w^Tx^i)$$

This problem is NP-hard, it is unlikely to admit a polynomial time solution. It is not convex nor differentiable, as was the linear regression optimization problem. This problem is sometimes referred to as "agnostically learning a halfspace". (There is a model of learning called agnostic learning, which is generalized PAC learning to a noisy halfspace).

Because of this, we aim to find a loss function for classification that is more relaxed. That will be logistic loss, which will turn out to be convex and differentiable.

---

### **TLDR;**

<br>
<center>
    <img src="images/1.10.1.png" alt="Professor Notes" />
</center>
<br>

We will be looking at relaxing the 0-1 loss function to create logistic regression.

## ***1.10.1 Losses***

#### **Introducing some new losses**
- $\phi_{logistic}$
- $\phi_{hinge}$
- $\phi_{exp}$

---

#### **Logistic Loss Function**

$$ \phi_{logistic}(z) = log(1+e^{-z}) $$

Notice if we plug in our values:

$$ \phi_{logistic}(y^i*w^Tx^i) = log(1+e^{-(y^i*w^Tx^i)}) $$

If $(y^i*w^Tx^i)$, which is often referred to as ***margin***, is $<< 0$, then $w^Tx^i$ has a different sign than $y^i$, meaning the guess was incorrect. This makes $ \phi_{logistic}(y^i*w^Tx^i)$ large, because of the negation. Likewise, if the guess was correct, then $ \phi_{logistic}(y^i*w^Tx^i)$ is small, moving to 0.

<br>
<center>
    <img src="images/1.10.2.png" alt="Professor Notes" />
</center>
<br>

---

#### **Hinge Loss Function**

$$ \phi_{hinge}(z) = max\{1-z,\;0\} $$

Notice if we plug in our values:

$$ \phi_{hinge}(y^i*w^Tx^i) = max\{1-(y^i*w^Tx^i),\;0\} $$

When our prediction is correct by a margin of 1 or greater, there will be no loss. When our prediction is incorrect, there will be loss. Notice in this loss function that even if we are correct but the margin is .5, we will still incur some loss.

<br>
<center>
    <img src="images/1.10.3.png" alt="Professor Notes" />
</center>
<br>

---

#### **Exponential Loss Function**

$$ \phi_{exp}(z) = e^{-z} $$

Which has similar properties.

---

#### **Visualizing the Losses**
Visualizing these losses:

<br>
<center>
    <img src="images/1.10.4.png" alt="Professor Notes" />
</center>
<br>

Notice the are all convex and differentiable. 

We will be focusing on logistic loss for the rest of the notes.

## ***1.10.2 Logistic Loss Optimization***

#### **Optimization Problem Associated with Logistic Loss:**

$$L(w) = \frac{1}{m} sum_{i=1}^{m}log(1+exp(-y^i*w^Tx^i))$$

so we want to find $\underset{w}{\min}\;L(w)$. Enter the **sigmoid function**:

$$ g(z) = \frac{1}{1+e^{-z}}$$

Notice that
- as $z$ gets larger, $g(z) \to 1$
- as $z$ gets smaller, $g(z) \to 0$.

<br>
<center>
    <img src="images/1.10.5.png" alt="Professor Notes" style="width: 50%;"/>
</center>
<br>

#### **Properties of sigmoid**
Fact, for the sigmoid function, $ g(z) = \frac{1}{1+e^{-z}}$: $$g(z) + g(-z) = 1$$

Also, $\exists \;w$ such that $$\mathbf{E}[Y|X] = g(Y*w^TX)$$

Which means that $$ \implies Pr[Y=1|X] = g(w^Tx)$$

Thus, given $x$, if $w^Tx$ is large, the probability that $y$ equals 1 is very large. Likewaise, if $w^Tx$ is negative and small, the probability that $y$ equals 1 is very small. This is due to the shape of the sigmoid function, and this is the relationship between the sigmoid function and the halfspace scenario.

#### **Relation to Halfspaces**
If your $w^Tx$ was large in the halfspace scenario, your label was definitely one, because it was positive. In logistic regression, we are going to assume that the probability that it will equal 1 is very large, but there is still some chance it could still equal zero (and likewise for a small $w^Tx$. Hence the relaxation.

In the sigmoid function, if $w^Tx = 0$, you will be equally likely to have a label 1 or 0.

#### **Model for Logistic Regression**

$$ Pr[Y=y^i|x^i;w] = g(y^i*w^Tx^i)$$

Given a training set $S$, what is the most likely $w$, given the training set?

$$ Likelihood(w) = \prod_{i=1}^{m}p(Y=y^i|x^i;w) = \prod_{i=1}^{m}g(y^i*w^Tx^i) $$

$$ Log-Likelihood(w) = -\sum_{i=1}^{m}log\;g(1+exp(-y^i*w^Tx^i)) $$

Notice that the likelihood we just derived was the logistic loss, $L(w)$ we looked at in the last section.

## ***1.10.3 Minimizing Logistic Loss***

#### **New Goal: Minimizing L(w)**

Since the goal of finding the logistic loss function was to find a convex and differentiable function, and we achieved that, we can now use **gradient descent** on logistic loss. This is **logistic regression**.

#### **Computing the gradient of L(w)**
1. Finding $ \phi_{logistic}'(z)$:
$$ \phi_{logistic}(z) = log(1+e^{-z}) $$
$$ \phi_{logistic}'(z) = \frac{-e^{-z}}{1+e^{-z}} = \frac{1}{1+e^z} = -g(-z)$$
2. Finding the partial derivative of $ \phi_{logistic}(y*w^Tx^)$ with respect to $w_k$:
$$ \frac{\partial \;\phi_{logistic}(y*w^Tx)}{\partial\; w_k} = -g(-y*w^Tx)*y*x_k$$

Thus, for the entire dataset, the gradient is:

$$\nabla L(w) = -\frac{1}{m}\sum_{i=1}^m(-y^iw^Tx^i)*y^i*x^i$$

With this formula, we can directly apply gradient descent, $w \gets w-\eta\nabla L(w)$. This precisely tells us how to find the max likelihood $w$.

---
#### **Multinomial logistic regression**

What happens if we have multiple labels for y? Instead of $y \in \{0,1\}$, what if $y \in \{1, ..., k\}$?

$$ Pr[y=1|x] \propto e^{w^{1^T}x}$$
$$ Pr[y=j|x] \propto e^{w^{j^T}x}$$
$$ Pr[y=k|x] = 1-\sum_{i=1}^{k-1} Pr[y=i]$$

#### **Cross-Entropy Loss**

What is the associated loss with multinomial regression? Cross-entropy loss is generalization of logistic loss. 

Imagine $Y$ is a vector of length $k$ with a 1 in the $j^{th}$ position, if the correct label is $j$. (This is called **one-hot encoding** of labels). Let's say our guess for the probability $y$ has the label $i$ is $P_i$.

$$ P_i = -\sum_{i=1}^k y_i\;log(P_i) $$

#### **Softmax**

Softmax turns real-values in to probabilities, like how we did when we used the sigmoid function to map $w^Tx$ to a probability.

Softmax takes a vector of $k$ real-values coordinates and maps them to a vector of probabilities by:

$$ (z_1, ..., z_k) \leadsto (\frac{e^z_1}{z}, \frac{e^z_2}{z}, ..., \frac{e^z_k}{z}) $$ 

Where $z = \sum_{i=1}^{k}e^{z_i}$. Since we divided all the values in the vector by $z$ (normalized the vector) all the values will add to 1.

Thus, performing a softmax really corresponds to taking some $k$ long vector of real values, and mapping them to probabilities between 0 and 1, that all sum to 1. It corresponds to taking some real-valued scores and transforming them into probabilities that correspond to guesses for what the class label should be.

# Personal Notes #