# Week 3

## 1. Classification and Representation

### - Classification
* Classification is **Logistic Regression**<br>
: Classification is NOT Linear Regression<br>

### &nbsp; cf) Linear Regression vs Logistic Regression<br>
* **Similarities**<br>
&nbsp; - Both are supervised Machine Learning algorithms.<br>
&nbsp; - Both are parametric regression i.e. both the models use linear equations for predictions. <br>

* **Differences**<br>
&nbsp; - Linear Regression is used to handle *regression* problems whereas Logistic regression is used to handle the *classification* problems.<br>
&nbsp; - Linear regression provides a *continuous output* but Logistic regression provides *discreet output*.<br>
&nbsp; - The purpose of Linear Regression is to find the *best-fitted line* while Logistic regression is one step ahead and fitting the line values to the sigmoid curve.<br>
&nbsp; - The method for calculating loss function in linear regression is the *mean squared error* whereas for logistic regression it is *maximum likelihood estimation*.

### - Hypothesis Representation
* Logistic Function : $h(x) = g(\theta^Tx) = \frac{1}{1+e^{-\theta^Tx}}$ &nbsp; $(0\le h(x) \le 1)$<br>
* $h(x)$ is the probability that $y=1$, given $x$.

### - Decision Boundary (for discrete 0 or 1)
* $h(x) \ge 0.5 \rightarrow y = 1$
* $h(x) \lt 0.5 \rightarrow y = 0$
* The boundary is when $h(x)=g(\theta^Tx)=0.5, \Leftrightarrow \theta^Tx=0$

## 2. Logistic Regression Model

### - Cost Function<br>
* We cannot use the same cost function that we use for linear regression.<br>
* $J(\theta) = \frac{1}{m} \sum_{i=1}^{m}Cost(h_\theta(x),y)$<br>
&nbsp; $Cost(h(x),y) = -ylog(h_\theta(x))-(1-y)log(1-h_\theta(x))$ &nbsp; y=0 or 1<br>
&nbsp; Therefore, $J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}y_i\log(h_\theta(x_i))+(1-y_i)\log(1-h_\theta(x_i))$<br>
* This cost function guarantees that $J(\theta)$ is convex for logistic regression.


### - Gradient Descent<br>
* $\theta_j := \theta_j - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)x_{ij}$ where $j$=0, 1, ..., n<br>
* Notice that this algorithm is identical to the one we used in linear regression, but with different $h_\theta(x)$


### - Advanced Optimization<br>
* Conjugate gradient<br>
* BFGS<br>
* L-BFGS<br>
* Code in Octave/Matlab<br>
: function [jVal, gradient] = costFunction(theta)<br>
&nbsp; &nbsp; &nbsp; &nbsp; jVal = [...code to compute J(theta)...];<br>
&nbsp; &nbsp; &nbsp; &nbsp; gradient = [...code to compute derivative of J(theta)...];<br>
&nbsp; end<br>
&nbsp; options = optimset('GradObj', 'on', 'MaxIter', 100);<br>
&nbsp; initialTheta = zeros(2,1);<br>
&nbsp; &nbsp; &nbsp; &nbsp; [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);


## 3. Multiclass Classification

&nbsp; : We have more than two categories. Instead of y = {0,1}, we have y = {0,1...n}.<br>
### - One vs All(Rest)<br>
* Treat n+1 cases where in each case is a binary classification problem.<br>
&nbsp; i.e y=0 $\rightarrow$ 1, &nbsp; &nbsp; y=1, 2, ... , n $\rightarrow$ 0<br>
&nbsp; &nbsp; &nbsp; y=1 $\rightarrow$ 1, &nbsp; &nbsp; y=0, 2, ... , n $\rightarrow$ 0<br>
&nbsp; &nbsp; &nbsp; y=n $\rightarrow$ 1, &nbsp; &nbsp; y=0, 1, ... , n-1 $\rightarrow$ 0<br>
* Then we will get each of $h_\theta^{(n)}(x)$ which is the possibility that x=n.<br>
* With a new feature $x$, compute each of $h_\theta^{(0)}(x), h_\theta^{(1)}(x), ..., h_\theta^{(n)}(x)$ and choose the highest probability.<br>
&nbsp; &nbsp; $\Rightarrow prediction = \max_{i}(h_\theta^{(i)}(x))$ 



## 4. Solving the Problem of Overfitting

### - Options to correct Overfitting<br>
1) Reduce the number of features<br>
&nbsp; - Manually select which features to keep.<br>
&nbsp; - Use a model selection algorithm.<br>
2) Regularization<br>
&nbsp; - Keep all the features, but reduce the magnitude of parameters $\theta_j$.<br>
&nbsp; - Regularization works well when we have a lot of slightly useful features.

### - Modifying Cost Function<br>
* If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.<br>
* $\min_{\theta}J(\theta) = \min_{\theta}\frac{1}{2m} [\sum_{i=1}^{m}(h_\theta(x_i)-y_i)^2 + \lambda\sum_{j=1}^{n}\theta_j^2]$ ($j$=1, ..., n)<br>
* $\lambda$ : regularization parameter<br>
* If $\lambda$ is chosen to be too large, it may smooth out the function too much and cause underfitting.



### - Regularized Linear Regression<br>
* Since the modified cost function doesn't include $\theta_0$, we will modify our gradient descent function to separate out $\theta_0$ from the rest of the parameters.<br>
* Repeat{$$\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)x_{i0}$$
$$\theta_j := \theta_j - \alpha[(\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)x_{ij})+\frac{\lambda}{m}\theta_j]$$
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; where ($j$=1, ..., n)
}<br>
* = $\theta_j := \theta_j(1-\alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)x_{ij}$ &nbsp; &nbsp; where ($j$=1, ..., n)<br>
* **Normal Equation**<br>
&nbsp; $\theta = (X^TX + \lambda L)^{-1}X^Ty$<br>
&nbsp; where $L = \begin{bmatrix}
0 & 0 & 0 & 0\\
0 & 1 & 0 & 0\\
0 & 0 & ... & 0\\
0 & 0 & 0 & 1
\end{bmatrix}$<br>
* L is a matrix with 0 at the top left and 1's down the diagonal, with 0's everywhere else. It should have dimension (n+1)×(n+1).

### - Regularized Logistic Regression<br>
* Modified Cost Function<br>
&nbsp; $J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}[y_i\log(h_\theta(x_i))+(1-y_i)\log(1-h_\theta(x_i))]+\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2$ &nbsp; &nbsp; ($j$=1, ..., n)<br>
* Gradient descent (Same with Linear Regression)<br>
&nbsp; Reapeat{$$\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)x_{i0}$$
$$\theta_j := \theta_j - \alpha[(\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x_i)-y_i)x_{ij})+\frac{\lambda}{m}\theta_j]$$
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; where ($j$=1, ..., n), $h(x) = \frac{1}{1+e^{-\theta^Tx}}$
}<br>