# 3. Logistic Regression

### Introduction

In a classification problem, the variable we are trying to classify has two possible values: 0 and 1.

$ y \in \{0,1\}$  with 0 = **Negative class** | 1 = **Positive class**

**Note**: In a multiclass classification problem, our variable can assume multiple values.  

In our simplest case - **binary classification problem** - linear regression can assume value between 0 and 1: 

$ 0 \le h_\theta(x) \le 1$ 

### Hypothesis Representation

$ h_\theta(x) = g(\theta^T x) $  

where g(z) is the **sigmoid function** or **logistic function**:

$g(z) = \frac{1}{1+e^{-z}}  =  \frac{1}{1+e^{-\theta^T x}} $

### Interpretation

$h_\theta(x)$ = estimated probability that y = 1 on input x  
$h_\theta(x) = p (y=1 | x=0) $

### Decision Boundary

Suppose we predict:  

$h_\theta(x) \ge 0.5 \to y = 1$  

$h_\theta(x) < 0.5 \to y = 0$  

By design of our logistic function:

$z \ge 0 \to g(z) \ge 0.5 $

Therefore:

$\theta^Tx \ge 0 \to  h_{\theta}(x) = g(\theta^Tx) \ge 0.5 $  

The **decision boundary** is the line which delimits the area where y = 0 or y = 1.   

### Cost Function

**Recap: Linear Regression**

$Cost(h_{\theta}(x),y) = \frac{1}{2}(h_{\theta}(x) - y)^2$

**Logistic Regression**

$Cost(h_{\theta}(x),y) =
\begin{cases}
-log(h_{\theta}(x)) & y = 1 \\
-log(1-h_{\theta}(x)) & y = 0
\end{cases}$

**Note**: we cannot use the linear regression function since it would not create a _convex_ function, creating multiple local optima.

### Simplified Cost Function

$Cost(h_{\theta}(x), y) = -ylog(h_{\theta}(x)) - (1-y)log(1-h_{\theta}(x))$

$ J(\theta) = -\frac{1}{m} [\sum_{i=1}^m y^{(i)} log h_{\theta}x^{(i)} + (1-y^{(i)}) log (1- h_{\theta}x^{(i)})]$

To fit parameters $\theta$:

**min** $ J(\theta)$ 

### Gradient Descent

Repeat {

$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) = \theta_j - \alpha \sum_{i=1}^m (h_\theta (x^{(i)}) - y^{(i)}) x_j^{(i)} $ 

_simultaneously update all $\theta_j$_
    
}

**Note: this gradient descent looks exactly like the one we had for linear regression, but keep in mind that the hypothesis function $h_\theta$ has changed. 

### Advanced Optimization

Turns out that gradient descent is only one of many optimization algorithms:
* Conjugate gradient
* BFGS 
* L-BFGS

**Advantages**:

1. No need to pick $\alpha$ manually
2. Faster than GD

**Disadvantages**:

1. Complexity

### Multiclass classification

#### One vs. all classification

To solve multiclassification, we can split the problem in individual binary classification problems. 

We will therefore need $i$ classifiers with $i$ = number of intended classes for our dataset.