# Sigmoid function

$$S(x) = \frac{1}{1 + e^{-x}}$$

# Logistic Regression

Logistic regression tries to find the output of this formula:

$$S(f_{\Theta}(\vec{x_i}))=\frac{1}{1 + e^{-(\theta_0 + \theta_1 x^{(i)}_1 + \theta_2 x^{(i)}_2 + \cdot \cdot \cdot + \theta_n x^{(i)}_n)}}$$

where,

$$f_{\Theta}(\vec{x_i}) = \theta_0 + \theta_1 x^{(i)}_1 + \theta_2 x^{(i)}_2 + \cdot \cdot \cdot + \theta_n x^{(i)}_n= \vec{\theta}^{T} \vec{x_i}, \vec{\theta}=\begin{bmatrix}
  \theta_0 \\
  \theta_1 \\
  \theta_2 \\
  \cdot \\
  \cdot \\
  \theta_n
\end{bmatrix}, \vec{x_i}=\begin{bmatrix}
  1 \\
  x^{(i)}_1 \\
  x^{(i)}_2 \\
  \cdot \\
  \cdot \\
  x^{(i)}_n
\end{bmatrix}$$

There are two cases:

$$\begin{cases}
    \hat{y_i} = 1 & \text{if $S(f_{\Theta}(\vec{x_i})) \geq 0.5$} \\
    \hat{y_i} = 0 & \text{if $S(f_{\Theta}(\vec{x_i})) \lt 0.5$}
\end{cases}$$

Where $\hat{y_i}$ is the classified output of data input $\vec{x_i}$

This is our goal: Given a new $\vec{x_i} \in \mathbb{R}^{n + 1}$, we want to find an approriate $\vec{\theta}$ above in order to maximize the value of $S(f_{\Theta}(\vec{x_i}))$, where $0 \leq S(f_{\Theta}(\vec{x_i})) \leq 1$. 

## Loss function

Considering each data input $\vec{x_i}$ in the training set:

$$\begin{cases}
    P(y_i = 1|\vec{x_i}, \vec{\theta})=S(\vec{\theta}^{T} \vec{x_i}) = p_i \\
    P(y_i = 0|\vec{x_i}, \vec{\theta})= 1-S(\vec{\theta}^{T} \vec{x_i}) = 1-p_i
\end{cases}$$

Hence, at each $\vec{x_i}$, the cost function for a single value is:

$$\begin{cases}
    -\log(p_i) & \text{if $y_i=1$ } \\
    -\log(1-p_i) & \text{if $y_i=0$}
\end{cases}$$

### Why $-\log(x)$


<p align="center">
  <img src="../Images/LogisticLog.png" alt="Alt Text" width=300 height=auto>
</p>

In the picture above, we can see the function $y=-\log(x)$ grows very large when $x$ approaches $0$ and if $x$ approaches $1$, the function is close to $0$, or:

$$\begin{cases}
    \lim_{{x \to 0}} -\log(x) = \infty \\
    \lim_{{x \to 1}} -\log(x) = 0
\end{cases}$$

### General Formula for Loss Function

The loss function for the logistic regression model at every input $\vec{x_i}$ is:

$$\mathcal{L}(\vec{\theta}, \vec{x_i}, y_i)=-y_i\log(p_i)-(1-y_i)\log(1-p_i), \text{where}:$$

$$y_i=0, 1; p_i=S(\vec{\theta}^{T}\vec{x_i})$$

## Loss function optimization

### Derivative of the loss function
Firstly, we need to find the derivative of $\mathcal{L}(\vec{\theta}, \vec{x_i}, y_i)$ with respect to $\vec{\theta}$ (the details are not covered in here, if you want to know how to calculate this derivative, please check the section below):

$$\frac{\partial \mathcal{L}(\vec{\theta}, \vec{x_i}, y_i)}{\partial \theta}=(S(\vec{\theta}^{T}\vec{x_i}) - y_i)\vec{x_i}$$

### Calculating Derivative

$$\mathcal{L}(\vec{\theta}, \vec{x_i}, y_i)=-y_i\log(p_i)-(1-y_i)\log(1-p_i) \text{, then:}$$

$$\frac{\partial \mathcal{L}(\vec{\theta}, \vec{x_i}, y_i)}{\partial \theta}=\frac{\partial \mathcal{L}(\vec{\theta}, \vec{x_i}, y_i)}{\partial p_i} \cdot \frac{\partial p_i}{\partial \theta}$$

$$\Longleftrightarrow \frac{\partial \mathcal{L}(\vec{\theta}, \vec{x_i}, y_i)}{\partial \theta} = \frac{p_i - y_i}{p_i(1-p_i)} \cdot \frac{\partial}{\partial \theta} S(\vec{\theta}^{T}\vec{x_i})$$

$$\Longleftrightarrow \frac{\partial \mathcal{L}(\vec{\theta}, \vec{x_i}, y_i)}{\partial \theta} = \frac{p_i - y_i}{p_i(1-p_i)} \cdot S(\vec{\theta}^{T}\vec{x_i})(1- S(\vec{\theta}^{T}\vec{x_i}))\vec{x_i}$$


$$\Longleftrightarrow \frac{\partial \mathcal{L}(\vec{\theta}, \vec{x_i}, y_i)}{\partial \theta} = (S(\vec{\theta}^{T}\vec{x_i}) - y_i) \cdot \vec{x_i}$$

### Optimization with Stochaic Gradient Descent (SGD)

Given a dataset with $\textbf{N}$ $\vec{x_i}$, to find $\vec{\theta}$, the Stochaic Gradient Descent performs these following steps below:

1. Assigning the number of epochs, intialize $\vec{\theta}$ randomly
2. At each epoch:

    2.1 Shuffle $\textbf{N}$ samples
    
    2.2 Iterating every $\vec{x_i}$ and update $\vec{\theta}$:

    $$\vec{\theta}_{n+1} = \vec{\theta}_{n} - \eta \frac{\partial \mathcal{L}(\vec{\theta}, \vec{x_i}, y_i)}{\partial \theta}, \eta \text{ is learning rate}$$

Doing so until the last epoch through, we achieve the $\vec{\theta}$


