## Classification and Representation

- **Classification**

    - The classification is just look like regression problem, except that the values we now want to predict take on only s small number of discrete values.

    - Focus on the **binary classification problem** in which y can take on only 2 values 0 and 1. 0 is called "negative class" (-), and 1 the "positive class" (+).

- **Hypothesis Representation**

    - Let's change the form for our hypothesis $h_\theta(x)$ to satisfy $0 \le h_\theta(x) \le 1$. This is accomplished by plugging $\theta^Tx$ into the Logistic Function (or Sigmoid Function):

        $$h_\theta(x) = g(\theta^Tx)$$

        $$g(z) = \dfrac{1}{1 + e^{-z}}$$

        $$z = \theta^Tx$$
    
    - With the sigmoid function, when z go to $-\infty$, h will go to 0 (actually when z=-5, h will be very near 0) and when z go to $+\infty$, h will go to 1 (actually when z=5, h will be very near 1).

    - $h_\theta(x)$ will give us the **probability** that our output is 1. For example, $h_\theta(x) = 0.7$ gives us a probability of 70% that our output is 1 and 30% that our output is 0.

        $h_\theta(x) = P(y=1|x;\theta) = 1 - P(y = 0|x;\theta)$ 

- **Decision Boundary**

    - In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:

        $h_\theta(x) \ge 0.5 \to y = 1$

        $h_\theta(x) < 0.5 \to y = 0$
    
    - And: $g(z) \ge 0.5$ when $z \ge 0$

    - Remind:

        $z = 0, e^0 = 1 \Rightarrow g(z) = 1/2$

        $z \to \infty, e^{-\infty} \Rightarrow g(z) = 1$

        $z \to -\infty, e^\infty \Rightarrow g(z) = 0$

    - So:

        $\theta^Tx \ge 0 \Rightarrow y = 1$

        $\theta^Tx < 0 \Rightarrow y = 0$

    - The **decision boundary** is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function ($\theta^Tx = 0$).

## Logistic Regression Model

- **Cost Function**

    - We cannot use the same cost function that we use for linear regression because Logistic Regression will cause the output to be wavy, causing many local optima and it will not be a convex function.

    - Instead, out cost function for logistic regression looks like:

        $J(\theta) = \dfrac{1}{m}\sum_{i=1}^mCost(h_\theta(x^{(i)}, y^{(i)})$

        $Cost(h_\theta(x), y)  = -log(h_\theta(x))$ if y = 1

        $Cost(h_\theta(x), y)  = -log(1 - h_\theta(x))$ if y = 0

    - And:

        $Cost(h_\theta(x), y) = 0$ if $h_\theta(x) = y$

        $Cost(h_\theta(x), y) \to \infty$ if y = 0 and $h_\theta(x) \to 1$

        $Cost(h_\theta(x), y) \to \infty$ if y = 1 and $h_\theta(x) \to 0$

- **Simplified Cost Function**

    - We can compress our cost function into:
    
        $Cost(h_\theta(x),y) = -ylog(h_\theta(x)) - (1-y)log(1-h_\theta(x))$
    
    - We can fully write out our entire cost as follows:
        
        $J(\theta) = -\frac{1}{m}\sum_{i=1}^m[y^{(i)}log(h_\theta(x^{(i)})) + (1-y^{(i)})log(1-h_\theta(x^{(i)}))]$
    
    - Vectorized implementation is:

        $h = g(X\theta)$

        $J(\theta) = \dfrac{1}{m} \left( -y^Tlog(h) - (1-y^T)log(1-h)\right)$

    - Gradient Descent

        - Repeat

        $\theta_j := \theta_j - \alpha\dfrac{\partial}{\partial\theta_j}J(\theta)$

        - Work out the derivative part using calculus:

        $\theta_j := \theta - \dfrac{\alpha}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$
        - And vectorized implementation is:

        $\theta := \theta - \dfrac{\alpha}{m}X^T(g(X\theta) - y)$