# Week 3

## Logistic regression Model

Want $0 <= h_\theta(x) <= 1$

So define $h_\theta(x) = g(\theta^T x)$  

where $g(z) = \frac{1}{1 + \exp(-z)}$  

<div>
<img src="attachment:image.png" width="500" align="center"/>  
</div>


## Interpretation of Hypothesis Output

$h_\theta(x)$ = estimated probability that y=1 on input x

Example:  
Predict tumor size
$x=\begin{bmatrix}
x_0\\
x_1
\end{bmatrix}
= \begin{bmatrix}
1\\
tumorSize
\end{bmatrix}
$

$h_\theta(x) = 0.7$ means 70% chance tumor is malignant.

Same as:  

$h_\theta(x) = P(y=1 | x ; \theta)$  means "Probability that y = 1, given x, parameterized by theta"


## Decision Boundary

<div>
<img src="attachment:image-3.png" width="500" align="center"/>  
</div>

Note the theta matrix. Alternatively, the hypothesis can be rewritten as x_1 + x_2 >= 3.

## Logistic regression cost function

$Cost(h_\theta(x),y) = \begin{cases}
-log(h_\theta(x)) &\text{if y=1} \\
-log(1-h_\theta(x)) &\text{if y=0} \\ 
\end{cases}
$

<div>
    <img src="attachment:image.png" width="500" align="center"/>
    <img src="attachment:image-2.png" width="500" align="center"/>
</div>


Alternate form that satisfies both piecewise conditions:  
$Cost(h_\theta(x),y) = -ylog(h_\theta(x)) - (1-y)log(1-h_\theta(x))$

$J(θ)=(\frac{1}{m}) \sum_{i=0}^{m} Cost(h_\theta(x^{(i)}), y^{(i)})$  
$= -(\frac{1}{m})[\sum_{i=0}^{m} y^{(i)}log(h_\theta(x^{(i)})) + (1-y^{(i)})log(1-h_\theta(x^{(i)}))]$  

J is the overall cost function. $Cost(h_\theta(x^{(i)}), y^{(i)})$ is the cost of making different predictions on different labels of y^(i).  

To fit parameters theta, need to minize J(theta) to make a prediction given new x:

$h_\theta(x) = \frac{1}{1 + \exp(-\theta^T x)}$  (for logistic regression)  

To minimize J(theta):  (alpha is learning rate)  
repeat {  
    $\theta_j := \theta_j - \alpha \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$  
}  

The above formula is exactly the same as for linear regression, but the difference is that h_theta, the hypothesis, is a logistic function rather than a linear function.

<div>
    <img src="attachment:image.png" width="500" align="left"/>
    <img src="attachment:image-2.png" width="300" align="left"/>
</div>


## Advanced Optimization

Optimization Algorithms:
- Gradient descent
- Conjugate gradient
- BFGS
- L-BFGS

Latter 3:
- Advantages:
    - no need to manually pick alpha
    - often faster than gradient descent
- Disadvantages:
    - more complex
    
![image.png](attachment:image.png)

In [None]:
def costFunction(theta):
    # code to compute J(theta)
    jVal = "..."
    
    # code to compute partial derivative d/(d theta_0) J(theta)
    gradient0 = "..."
    # code to compute partial derivative d/(d theta_1) J(theta)
    gradient1 = "..."
    
    return [jVal, gradient]

## The problem of overfitting

### Regularization

$J(\theta) = (\frac{1}{2m})[\sum_{i=0}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta_j^2]$

In $\lambda \sum_{j=1}^{n} \theta_j^2$, lambda is the regularization parameter: controls a tradeoff between 2 goals
1. Fit training data well. 
2. Keep parameters small.

If lambda is too large, the model will end up underfitting. We say that the "hypothesis has too strong of a preconception".

### Regularized Linear Regression

Gradient descent for regularized linear regression.



# Week 3 Assignment: Logistic Regression
