# Logistic Regression
___

## Phrasing the problem
Given $x$, we want $\hat{y} = P(y = 1 \mid x)$ where $x \in \mathbb{R}^{n_x}$.

For a standard layer, we'll have $\hat{y} = \sigma(\omega^Tx + b)$ - a linear transformation with a sigmoid applied to it.

The sigmoid function, $\sigma(x) = \frac{1}{1+e^{-x}}$. As $x \rightarrow \infty$, $\sigma(x)$ goes to 1; inversely, $\sigma(x)$ goes to 0 as $x \rightarrow-\infty$.

## The Loss Function

We *could* use the sum of square error:

$\mathcal{L}(\hat{y}, y) = \frac{1}{2}(\hat{y}-y)^2$

But this ends up being non-convex, and makes the solution very challenging to arrive at - we get lots of local optima.

So what we do use is:

$\mathcal{L}(\hat{y}, y) = -(y\log\hat{y} + (1-y)\log(1-\hat{y}))$

Now, when $y=1$ - when that's our target, that is - then the second term disappears:

$\mathcal{L}(\hat{y}, y) = -\log\hat{y}$

So if $\hat{y}$ is near 1, this is just about $0$ (that's good for a loss function!) and otherwise it just gets larger and larger as it approaches 0.

Likewise, when $y=0$ - when our target is 0 - then the function becomes:

$\mathcal{L}(\hat{y}, y) = -\log(1-\hat{y})$

And likewise, as $\hat{y}$ goes to 1, the term approaches 0 and the loss function gets larger and larger.

## The Cost Function

Very good - now we want to define our cost function, $J(\omega,b)$:

$J(\omega,b) = \frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y}^i,y^i)$

Which is to say, it's the sum across all the samples.

The loss function is applied to a **single training example**; the cost function is the **cost of your parameters**.

## Gradient Descent

So when we calculate gradient descent, we start at the end of the function and work our way back.

For us, this means that we're starting our loss function and working backwards - and we need to consider this on multiple examples. 