# Logistic Regression

A binary classifier.

## Logistic Regression Prediction

To perform a prediction using logistic regression we use the following formula: 

\begin{equation}
\hat{y} = \sigma (w^Tx + b)
\end{equation}

\begin{equation}
\sigma(z) = \frac{1}{1 + e^{-z}}
\end{equation}

$\sigma(z)$: Also called the sigmoid function. The output will never be greater than one

$\hat{y}^{(i)}$: The predicted label for example $i$

$w$: Vector of weights

$b$: bias (a real number)

$x^{(i)}$: The $i$th input or training example

${y}^{(i)}$: Ground truth label for example $i$

We want $\hat{y}$ to be as close as possible to $y$, or $\hat{y} \approx y$

## Loss (error) function $\mathcal{L}$

\begin{equation*}
\mathcal{L}(\hat{y}^{(i)},y^{(i)}) = -(y^{(i)}\log{\hat{y}^{(i)}} + (1 - y^{(i)}) \log({1 - \hat{y}^{(i)}}))
\end{equation*}

This function will measure how close our output $\hat{y}$ is to the true label $y$.

### Intuition 
We want $\mathcal{L}$ to be as small as possible. 

If $y = 1$ then $\mathcal{L}(\hat{y},y) = - \log \hat{y}$. 

The second expression cancels out because $(1-1) = 0$. So now we want $- \log \hat{y}$ to be as large as possible. That means we want $\hat{y}$ to be large. The sigmoid function above, $\sigma (z)$ can never be greater than one.

If $y = 0$ then $\mathcal{L}(\hat{y},y) = - \log (1 - \hat{y})$

Similar reasoning, now we want $\hat{y}$ as small as possible because we still want $\log 1 -\hat{y}$ large

Another option is to use the squared error function: $\mathcal{L}(\hat{y},y) = \frac{1}{2}(\hat{y} - y)^2$  But this will produce a non-convex surface which is not good for gradient descent because it may not find the global optimum, but rather only a local optimum.

## Cost function $J$ used in Logistic Regression

\begin{equation*}
J(w,b) = - \frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})
\end{equation*}

The cost function is the average, the sum, over the loss functions, divided by the number of examples $m$. This measures how well your parameters, $w$ and $b$ are performing on the training set. So the optimization problem is to minimize $J(w,b)$. We want to find $w$ and $b$ that make $J$ as small as possible.

## Gradient Descent

_Repeat until convergence:_
\begin{equation}
w = w - \alpha \frac{\partial J(w, b)}{\partial w}
\end{equation}

\begin{equation}
b = b - \alpha \frac{\partial J(w, b)}{\partial b}
\end{equation}

$\alpha$: Learning rate. How big of a step we take towards the minimum with each iteration

### Initializing $w$
Typically this is initialized to zero. Since the error surface is convex, it should arrive at the same minimum given any initialization.