# Machine Learning: Week 3 - Logistic Regression
## Logistic Regression
Now we are switching from regression problems to classification problems. Don’t be confused by the name “Logistic Regression”; it is named that way for historical reasons and is actually an approach to classification problems, not regression problems.

## Classification
Instead of our output vector y being a continuous range of values, it will only be 0 or 1.

$$\large
y\in \{0,1\}
$$

Where $0$ is usually taken as the **“negative class”** and $1$ as the **“positive class”**, but you are free to assign any representation to it.

We’re only doing two classes for now, called a _**“Binary Classification Problem”**_.

One method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. This method doesn’t work well because classification is not actually a linear function.

For instance, if we are trying to build a spam classifier for email, then $x^{(i)}$ may be some features of a piece of email, and $y$ may be $1$ if it is a piece of spam mail, and $0$ otherwise. Hence, $y\in \{0,1\}$. $0$ is called the negative class, and 1 the positive class, and they are sometimes also denoted by the symbols “-” and “+.” Given $x^{(i)}$, the corresponding $y^{(i)}$ is also called the label for the training example.


## Hypothesis Representation
Our hypothesis should satisfy:

$$\large
0\leqslant h_\theta(x) \leqslant 1
$$

Our new form uses the **“Sigmoid Function”**, also called the **“Logistic Function”:**

![sigmoidFunction](./Week3_Images/SigmoidFunction.png)

The function $g(z)$, shown here, maps any real number to the $(0, 1)$ interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification. Try playing with an interactive plot of sigmoid function: (https://www.desmos.com/calculator/bgontvxotm).

We start with our old hypothesis (linear regression), except that we want to restrict the range to $0$ and $1$. This is accomplished by plugging $\theta^Tx$ into the Logistic Function.

$h_\theta(x)$ will give us the probability that our output is $1$. For example, $h_\theta(x)=0.7$ gives us the probability of $70\%$ that our output is $1$.

$$\large
 \begin{array}{l}
h_{\theta }( x) =P( y=1|x;\theta ) =1−P( y=0|x;\theta )\\
\\
\ P( y=0|x;\theta ) +P( y=1|x;\theta ) =1
\end{array}
$$


## Decision Boundary
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:

$$\large
 \begin{array}{l}
h_{\theta }( x) \geqslant 0.5\rightarrow y=1\\
h_{\theta }( x) \ < 0.5\rightarrow y=0
\end{array}
$$

The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:

$$\large
 \begin{array}{l}
g( z) \geqslant 0.5\\
when\ z\geqslant 0
\end{array}
$$

Remember:

$$\large
\begin{array}{l}
z=0,\ e^{0} =1\Longrightarrow g( z) =\frac{1}{2}\\
\\
z\rightarrow \infty ,e^{-\infty }\rightarrow 0\Longrightarrow g( z) =1\\
\\
z\rightarrow -\infty ,\ e^{\infty }\rightarrow \infty \Longrightarrow g( z) =0
\end{array}
$$

So if our input to $g$ is $\theta^TX$, then that means:

$$\large
 \begin{array}{l}
h_{\theta }( x) \ =\ g\left( \theta ^{T} X\right) \geqslant 0.5\\
when\ \theta ^{T} X\geqslant 0
\end{array}
$$

From these statements we can now say:

$$\large
 \begin{array}{l}
\theta ^{T} X\geqslant 0\Longrightarrow y=1\ \\
\theta ^{T} X< 0\Longrightarrow y=0
\end{array}
$$

The decision boundary is the line that separates the area where $y = 0$ and where $y = 1$. It is created by our hypothesis function.

Example:

$$\large
\begin{array}{l}
\theta =\begin{bmatrix}
5\\
-1\\
0
\end{bmatrix}\\
h_{\theta }( x) =\theta _{0} +\theta _{1} x_{1} +\theta _{2} x_{2} +...+\theta _{n} x_{n}\\
y=1\ if\ 5+( -1) x_{1} +0x_{2} \geqslant 0\\
5-x_{1} \geqslant 0\\
-x_{1} \geqslant -5\\
x_{1} \leqslant 5
\end{array}
$$

In this case, our decision boundary is a straight vertical line placed on the graph where $x1 = 5$, and everything to the left of that denotes $y = 1$, while everything to the right denotes $y = 0$.

Again, the input to the sigmoid function $g(z)$ (e.g. $\theta^TX$) doesn’t need to be linear, and could be a function that describes a circle (e.g. $z=\theta_0+\theta_1x^{2}_1+\theta_2x^{2}_2$) or any shape to fit our data.