# Logistic regression

<a href="https://nbviewer.jupyter.org/github/hongjiaherng/ML-Collections/blob/main/just4funml/notes/note_logistic_regression.ipynb" 
   target="_parent">
   <img align="left" 
      src="https://raw.githubusercontent.com/jupyter/design/master/logos/Badges/nbviewer_badge.svg" 
      width="109" height="20">
</a>

## 1. Brief explanation

`Logistic regression` is a binary classifier that classify an instance based on its probability that it belongs to a class (says 0 and 1). It will classify an instance to positive class if the probability is >= 0.5 otherwise it will classifier that instance to negative class.
<br><br>

In fact, we can customize the threshold for classifying instance to meet our special needs. For example, if the we set the threshold to 0.7, then only instances with probability 0.7 or above will be classified into positive class and the rest will be classified as negative class.
<br><br>

Logistic regression always try to draw a decision boundary to separate out two different class of data. If we ever want to use logistic regression for multiclass classification, then we will need to use `One-vs-all` (aka `One-vs-rest`) technique to achieve it. In simple word, it will train K different model on the data where each model will be able to recognize one particular class.


## 2. Model hypothesis
- In this part, we will take one single instance $\mathbf{x}$ to describe the model hypothesis

First, understand that:<br><br>
$
\mathbf{x} = 
\begin{bmatrix}
x_0 \\
x_1 \\
\vdots \\
x_n \\
\end{bmatrix}
,
\theta =
\begin{bmatrix}
\theta_0 \\
\theta_1 \\
\vdots \\
\theta_n \\
\end{bmatrix}
,
y = 0\; or \;1
$

### Sigmoid function
- We will plug the weighted sum of x, that is $\mathbf{z}$ into the `sigmoid function`, and it will map $\mathbf{z}$ to the range of 0 to 1

$
\mathbf{z} = \theta^\top \mathbf{x}
$
<br>
$
\sigma(\mathbf{z}) = \frac{1}{1 + \mathbf{e}^{-\mathbf{z}}}
$
<br><br>

$
h_\theta(\mathbf{x}) = \sigma(\mathbf{\theta^\top \mathbf{x}}) = \frac{1}{1 + \mathbf{e}^{-\theta^\top \mathbf{x}}}
$
, where
$
0 \leq h_\theta(\mathbf{x}) \leq 1
$

### Prediction
- The model will predict 1 if the hypothesis is greater than a threshold value, otherwise 0

$
\hat{y} = 
\left\{\begin{matrix}
1 & if \; \; h_\theta(\mathbf{x}) >= 0.5 \\ 
0 & if \; \;  h_\theta(\mathbf{x}) < 0.5
\end{matrix}\right.
$

## 3. Cost function

$
J(\theta) = - \frac{1}{m}\sum_\limits{i=1}^m \left [ y^{(i)} log (h_\theta(x^{(i)})) + (1-y^{(i)}) log (1 - h_\theta(x^{(i)})) \right ]
$

- Why this cost function? (Try sub value to $h_\theta(x^{(i)})$ ranging from 0 to 1)
    - If $y^{(i)} == 1$, this $log (h_\theta(x^{(i)}))$ part of equation is used and the back part of the equation is removed, and
        - if we predict $h_\theta(x^{(i)})$ as high, then the cost is lower, because $y^{(i)}$ is in fact belongs to positive class
        - if we predict $h_\theta(x^{(i)})$ as low, then the cost is higher, because $y^{(i)}$ is not belongs to positive class
    - If $y^{(i)} == 0$, this $log (1 - h_\theta(x^{(i)}))$ part of equation is used and the front part is being removed, and
        - if we predict $h_\theta(x^{(i)})$ as low, then the cost is lower, because $y^{(i)}$ is in fact belongs to negative class
        - if we predict $h_\theta(x^{(i)})$ as high, then the cost is higher, because $y^{(i)}$ is not belongs to negative class

### Minimizing $J(\theta)$
- The following gradient vector is what we need to compute

$
\nabla_{\theta} J(\theta) = 
\begin{bmatrix}
\dfrac{\partial}{\partial\theta_0} J(\theta) \\
\dfrac{\partial}{\partial\theta_1} J(\theta) \\
\vdots \\
\dfrac{\partial}{\partial\theta_n} J(\theta) \\
\end{bmatrix}
$

### Compute derivative of $J(\theta)$

- First, let's compute the derivative of sigmoid function w.r.t $\mathbf{z}$ so that we can simplify the process later

$
\sigma(\mathbf{z}) = \dfrac{1}{1 + \mathbf{e}^{-z}}
$

$
\begin{align*}
\dfrac{d\sigma(\mathbf{z})}{d\mathbf{z}} &=
\dfrac{d}{d\mathbf{z}} \left( 1 + e^{-\mathbf{z}} \right)^{-1} \\ &=
- \dfrac{1}{(1 + e^{-\mathbf{z}})^{2}} (e^{-\mathbf{z}}) \dfrac{d}{d\mathbf{z}} \left(-\mathbf{z}\right) \\ &=
- \dfrac{e^{-\mathbf{z}}}{(1 + e^{-\mathbf{z}})^{2}} \left(-1\right) \\ &=
\dfrac{ e^{-\mathbf{z}} }{ 1 + e^{-\mathbf{z}} } \left(\dfrac{ 1 }{ 1 + e^{-\mathbf{z}} }\right) \\ &=
\sigma(\mathbf{z}) \dfrac{1 + e^{-\mathbf{z}} - 1 }{1 + e^{-\mathbf{z}}}\\ &=
\sigma(\mathbf{z})\left( 1- \sigma(\mathbf{z}) \right)
\end{align*}
$

- Compute the partial derivative of cost function w.r.t. $\theta_j$

$
J(\theta) = - \frac{1}{m}\sum_\limits{i=1}^m \left [ y^{(i)} log (h_\theta(x^{(i)})) + (1-y^{(i)}) log (1 - h_\theta(x^{(i)})) \right ]
$

$
\begin{align*}\frac{\partial}{\partial \theta_j} J(\theta) 
&= 
\frac{\partial}{\partial \theta_j} \frac{-1}{m}\sum_{i=1}^m \left [ y^{(i)} log (h_\theta(\mathbf{x}^{(i)})) + (1-y^{(i)}) log (1 - h_\theta(\mathbf{x}^{(i)})) \right ] 
\newline&= 
- \frac{1}{m}\sum_{i=1}^m \left [     y^{(i)} \frac{\partial}{\partial \theta_j} log (h_\theta(\mathbf{x}^{(i)}))   + (1-y^{(i)}) \frac{\partial}{\partial \theta_j} log (1 - h_\theta(\mathbf{x}^{(i)}))\right ] 
\newline&= 
- \frac{1}{m}\sum_{i=1}^m \left [     \frac{y^{(i)} \frac{\partial}{\partial \theta_j} h_\theta(\mathbf{x}^{(i)})}{h_\theta(\mathbf{x}^{(i)})}   + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - h_\theta(\mathbf{x}^{(i)}))}{1 - h_\theta(\mathbf{x}^{(i)})}\right ] 
\newline&= 
- \frac{1}{m}\sum_{i=1}^m \left [     \frac{y^{(i)} \frac{\partial}{\partial \theta_j} \sigma(\theta^T \mathbf{x}^{(i)})}{h_\theta(\mathbf{x}^{(i)})}   + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - \sigma(\theta^T \mathbf{x}^{(i)}))}{1 - h_\theta(\mathbf{x}^{(i)})}\right ] 
\newline&= 
- \frac{1}{m}\sum_{i=1}^m \left [     \frac{y^{(i)} \sigma(\theta^T \mathbf{x}^{(i)}) (1 - \sigma(\theta^T \mathbf{x}^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T \mathbf{x}^{(i)}}{h_\theta(\mathbf{x}^{(i)})}   + \frac{- (1-y^{(i)}) \sigma(\theta^T \mathbf{x}^{(i)}) (1 - \sigma(\theta^T \mathbf{x}^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T \mathbf{x}^{(i)}}{1 - h_\theta(\mathbf{x}^{(i)})}\right ] 
\newline&= 
- \frac{1}{m}\sum_{i=1}^m \left [     \frac{y^{(i)} h_\theta(\mathbf{x}^{(i)}) (1 - h_\theta(\mathbf{x}^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T \mathbf{x}^{(i)}}{h_\theta(\mathbf{x}^{(i)})}   - \frac{(1-y^{(i)}) h_\theta(\mathbf{x}^{(i)}) (1 - h_\theta(\mathbf{x}^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T \mathbf{x}^{(i)}}{1 - h_\theta(\mathbf{x}^{(i)})}\right ] 
\newline&= 
- \frac{1}{m}\sum_{i=1}^m \left [     y^{(i)} (1 - h_\theta(\mathbf{x}^{(i)})) \mathbf{x}^{(i)}_j - (1-y^{(i)}) h_\theta(\mathbf{x}^{(i)}) \mathbf{x}^{(i)}_j\right ] \newline&= 
- \frac{1}{m}\sum_{i=1}^m \left [     y^{(i)} (1 - h_\theta(\mathbf{x}^{(i)})) - (1-y^{(i)}) h_\theta(\mathbf{x}^{(i)}) \right ] \mathbf{x}^{(i)}_j 
\newline&= 
- \frac{1}{m}\sum_{i=1}^m \left [     y^{(i)} - y^{(i)} h_\theta(\mathbf{x}^{(i)}) - h_\theta(\mathbf{x}^{(i)}) + y^{(i)} h_\theta(\mathbf{x}^{(i)}) \right ] \mathbf{x}^{(i)}_j 
\newline&= 
- \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - h_\theta(\mathbf{x}^{(i)}) \right ] \mathbf{x}^{(i)}_j  
\end{align*}
$

### Summary on cost function

#### Cost function (Unregularized)

$
J(\theta) = - \frac{1}{m}\sum_\limits{i=1}^m \left [ y^{(i)} log (h_\theta(x^{(i)})) + (1-y^{(i)}) log (1 - h_\theta(x^{(i)})) \right ]
$

$
\begin{align*}
\frac{\partial}{\partial \theta_j} J(\theta) &=
- \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - h_\theta(x^{(i)}) \right ] x^{(i)}_j
\end{align*}
$

```python
# Vectorized implementation
cost = (-1 / m) * np.sum(y * np.log(y_proba_predict) + (1 - y) * np.log(1 - y_proba_predict))
gradients = (-1 / m) * (X_with_bias.T @ (y - y_proba_predict))
```

#### Cost function with $l_2$ regularization

$
J(\theta) = - \frac{1}{m}\sum_\limits{i=1}^m \left [ y^{(i)} log (h_\theta(x^{(i)})) + (1-y^{(i)}) log (1 - h_\theta(x^{(i)})) \right ] + \dfrac{ \lambda }{ 2m } \sum_\limits{j=1}^{n} \theta_j^{2} 
$

$
\begin{align*}
\frac{\partial}{\partial \theta_j} J(\theta) &=
- \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - h_\theta(x^{(i)}) \right ] x^{(i)}_j + \dfrac{\lambda}{m}\theta_j 
 \ \ \ \ \ \ \ \ \ where\ \ j \in \lbrace 1,2...n\rbrace
\end{align*}
$

```python
# Vectorized implementation
cost = (-1 / m) * np.sum(y * np.log(y_proba_predict) + (1 - y) * np.log(1 - y_proba_predict))
l2_loss = (alpha / 2 * m) * np.sum(np.square(theta[1:]))
cost += l2_loss

gradients = (-1 / m) * (X_with_bias.T @ (y - y_proba_predict)) + (alpha / m) * np.r_[0, theta[1:]]
```

#### Cost function with $l_1$ regularization

$
J(\theta) = - \frac{1}{m}\sum_\limits{i=1}^m \left [ y^{(i)} log (h_\theta(x^{(i)})) + (1-y^{(i)}) log (1 - h_\theta(x^{(i)})) \right ] + \dfrac{ \lambda }{ m } \sum_\limits{j=1}^{n} |\theta_j|
$

$
\begin{align*}
\frac{\partial}{\partial \theta_j} J(\theta) &=
- \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - h_\theta(x^{(i)}) \right ] x^{(i)}_j + \dfrac{\lambda}{m}sign(\theta_j) 
 \ \ \ \ \ \ \ \ \ where\ \ j \in \lbrace 1,2...n\rbrace
\end{align*}
$

```python
# Vectorized implementation
cost = (-1 / m) * np.sum(y * np.log(y_proba_predict) + (1 - y) * np.log(1 - y_proba_predict))
l1_loss = (alpha / m) * np.sum(np.abs(theta[1:]))
cost += l1_loss

gradients = (-1 / m) * (X_with_bias.T @ (y - y_proba_predict)) + (alpha / m) * np.r_[0, np.sign(theta[1:])]
```

***References:*** 
- [Coursera Machine Learning Week 3 Resource](https://www.coursera.org/learn/machine-learning/resources/Zi29t)