### Logistic Regression is a model used in classification problems where the desired prediction is a discrete value

### Example Classification Problems

- Identifying email spam
- Determining if online ecommerce transactions are fraudulent
- Classifying tumors as malignant or benign

### Outline

- Linear vs Logistic Regression for Classification
- Hypothesis Function
- Interpretation of Hypothesis Output
- Decision Boundary
--------
- Cost Function
--------
- Simplified Cost Function
- Gradient Descent
- Advanced Optimization
- Multiclass Classification

### Linear vs Logistic Regression for Classification

Applying Linear Regression to a classification problem is typically not a good idea because of the discrete nature of the y values.

- Outliers can heavily skew the model and dramatically hinder the accuracy of the model
- Another reason is that in Linear Regression the output of the hypothesis function can be > 1 or < 0 despite the classification values only lying within the range of 0 or 1

**Because of these limitations, we ditch the linear model and adopt the logistic model for classifcation problems**

### Hypothesis Function

Since we want our output values to be contained within 0 and 1, we want to use a Sigmoid or Logistic function as the basis for our hypothesis function.

In order to step from the Linear model to the Logistic model, we start with our same Linear hypothesis function:

$h_{Θ}(x) = Θ^{T}x  $

In order to ensure the output values are between 0 and 1, we pass the hypothesis function as a parameter to a Sigmoid/Logistic function $ g(z) $:

$g(z) = \frac{1}{1 + e^{-z}} $

Which means the hypothesis function ultimately becomes equivalent to:

$h_{Θ}(x) = \frac{1}{1 + e^{-Θ^{T}x}} $

This function is a representation of the Sigmoid/Logistic function which assimptotes at 0 for values approaching -inf and assimptotes at 1 for values approaching inf. All of the function's return values are within the range of 0 and 1.

### Interpretation of Hypthothesis Output

The hypothesis function output contains a series of probabilities representing the likelihood of y = 1 (i.e. a positive classification).

For example, in a problem where we are trying to classify malignant tumors (y=1), then if $h_{Θ}(x) = 0.7 $, we can say there is a 70% chance that the tumor is malignant based on the feature values. Conversely, the probability that the tumor is benign (y=0) would be 30% in this example.

### Decision Boundary

The Decision Boundary is a system of inequalities that define the points where $y = 1$.

**A very simple example:**

Suppose we predict $y=1$ if $h_{θ}(x) \ge 0.5$

and we predict $y=0$ if $h_{θ}(x) \lt 0.5$

**A harder example:**

Suppose our hypothesis function looks like this:

$h_{θ}(x) = g(θ_{0} + θ_{1}x_{1} + θ_{2}x_{2})$

and for whatever reason we know that the θ parameters are as follows:

$θ_{0}=-3, θ_{1}=1, θ_{2}=1$

Which can also be represented in matrix form like this:

θ = $\begin{bmatrix}
-3\\
1\\
1\\
\end{bmatrix}$

Our decision boundary then would look like this:

Predict $y=1$ if $-3 + x_{1} + x_{2} \ge 0$

**Note that $-3 + x_{1} + x_{2}$ is just $θ^Tx$**

The decision boundary can be simplified to this:

$x_{1} + x_{2} \ge 3$

You can visualize this equation pretty easily by plugging in values for $x_{1}$ and $x_{2}$. Start with the x and y-intercepts, especially for a simple linear inequality like this example.

The points along the actual Decision Boundary represent $h_{θ}(x) = 0.5$, or in other words, a perfect split in probability of $y=1$ or $y=0$. This is essentially the inflection point and the probability will increase or decrease as you get farther from these points in a certain direction. 

![DECISION_BOUNDARY](../img/decision_boundary.png)

### Non-Linear Decision Boundaries



![MULTI_DECISION_BOUNDARY](../img/multi_decision_boundary.png)

##### NOTE ON DECISION BOUNDARIES

The Decision Boundary is a function of the Hypothesis function and NOT the dataset characteristics. The Hypothesis function is derived from the dataset and is used to determine the theta parameters, but the actual dataset does not determine the Decision Boundary because its dependent on the model

### Cost Function

REMINDER: Fitting the Cost Function is really just solving for the theta parameters and fitting them to the dataset - it's that simple in concept.

##### Can we use the same Cost Function as in Linear Regression?

No, because the function would be non-convex in a Logistic Regression model because of the non-linearity of the Sigmoid function (it naturally creates a non-convex function). This is important because a non-convex function would result in many local minima, making it unreliable to use Gradient Descent in an attempt to find the global minimum. Remember, the goal of Gradient Descent and what the Cost Function is ultimately used for, is minimization! A convex function ensures that the Gradient Descent algorithm will find the global minimum because it is identical to the local minimum.

See below for visual representation of non-convex and convex functions:

![CONVEX_COST_FUNCTIONS](../img/convex_cost_functions.png)

### Logistic Regression Cost Function

For the reasons mentioned above, we use a different Cost Function in Logistic Regression as opposed to Linear Regression (to avoid non-convex functions).

We need to create a Cost Function that is equal to 0 if the prediction is correct (i.e.  $h_{θ}(x)=1, y=1$), but also approaches infinity as $h_{θ}(x)$ approaches zero. The idea here is we need to heavily penalize the algorithm based on how far off the Hypothesis Function output is from the actual observation.

The new Cost Function we will use for Logistic Regression is the following:

![LOGISTIC_COST_FUNCTION](../img/logistic_cost_function.png)