# Logistic Regression Model
## Cost Function

![image.png](attachment:image.png)

We have a training set of m examples. Each of our examples is represented by a feature vector that is n+1 dimensional. x0 = 1. Because this is a classification problem, we know that y is either 0 or 1. Given this training set, how do we choose parameters theta?

We cannot use the cost function from linear regression because it would be non-convex. 

![image.png](attachment:image.png)

We are not guaranteed that gradient descent would reach a global minimum. This is because of the non-linear sigmoid function. 

The logistic regression cost function is:

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

Notice that if y=1 and h=1, then Cost = 0. As h(x) --> 0, Cost goes to infinity. Which should be intuitive. If y = 1 and we predict h = 0, there should be a large penalty. 

If y=0, then our curve looks like this:

![image-4.png](attachment:image-4.png)

## Simplified Cost Function and Gradient Descent

![image.png](attachment:image.png)

There is a simpler way to write this function. We can write it in one equation. 

<center>$Cost(h_\theta(x),y) = -y*log(h_\theta(x))-(1-y)log(1-h_\theta(x))$</center>

Half the equation will drop out when y=1 or 0. 

![image.png](attachment:image.png)

There is a statistical method to get this equation showing it is the one we want to use but is beyond the scope of this material. 

We are going to try to find the parameters theta that minimize J(theta). 

![image-2.png](attachment:image-2.png)


![image-3.png](attachment:image-3.png)

This looks exactly like linear regression. But the difference is that h is different. 

Vectorized implementation:

![image-4.png](attachment:image-4.png)

## Advanced Optimization

We can computer the cost function and the derivative of the cost function. We have also seen once algortihm to minimize the cost using gradient descent. There are other algorithms though:
- Conjugate gradient
- BFGS
- L-BFGS

The details of these are well beyond the scope of this course. There is no need to manually pick an learning rate and they are often faster than gradient descent. however they are more complex. 

# Multiclass Classification

One vs All. 

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

# Solving the Problem of Overfitting

## The Problem of Overfitting

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)

If we think overfitting is happening, what can we do?
1. Reduce the number of features
    - Manually select which features to keep
    - Model selection algorithm (later in course)
2. Regularization
    - Keep all the features, but reduce magnitude/values of parameters
    - Works well when we have lots of features, each of which contributes a bit to predicting y. 


## Cost Function

Small values for the parameters leads to a simpler hypothesis and less prone to overfitting. 

We don't know which ones to penalizes so we are going to make all of them smaller. So we add an extra regularization term at the end. 

![image.png](attachment:image.png)

The term will shrink all parameters except theta zero by convention. 

The term is called the regularization parameter and controls a trade off between two different goals. The first goal, captured by the first tem, is that we would like to train the data well. The second goal is that we want to keep the parameters small. 

If the term is too large, then the parameters will be penalized too much and effectively make them 0, meaning our hypothesis is likely to just be a flat line and underfit the data. 

## Regularized Linear Regression

![image-3.png](attachment:image-3.png)

![image-4.png](attachment:image-4.png)

## Regularized Logistic Regression

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

![image-3.png](attachment:image-3.png)