- Two possible classess/categories
- Output (y) can be one of two values
- Only 2 possible output values, e.g., yes/no, 1/0, true/false, positive/negative, presence/absence
- Here
positive
andnegative
doesn't meangood
orbad
insteadpoitive class
means thepresence
andneagative class
means theabsence
of something.
- The dividing line where we are neautral. Before and after the decision boundry, we classify data points.
- Adding a single data point (say at right) can change the decision boundry, i.e., our conclusion about how to classify data, e.g.,
malignant tumor
vsbenign tumor
. - The problem is that, decision boundry isn't generaized and keeps on changing as you add more examples.Thus changing decision boundry, can end up with a worse function for classification problem.
- Our goal is to come up with a genralized decision boundry that doesn't
change significantly
by adding new examples to dataset. - Sometimes,
Linear Regression
can work for classification, but often doesn't work very well.
-
Also known as logistic function.
-
The formula for a sigmoid function is as follows -
- In the case of logistic regression, z (the input to the sigmoid function), is the output of a linear regression model.
- In the case of a single example,
$z$ is scalar. - In the case of multiple examples,
$z$ may be a vector consisting of$m$ values, one for each example. -
Sigmoid function
>> a Non-linear component used to add Non-linearity in the output (z) oflinear regression
and makes it alogistic regression
and capable of drawing non-linear decision boundry.
- In the case of a single example,
- A classification algorithm, outputs probability of y = 1
- Works over
Signmoid function
- Build in 2 steps:
- Calculate z(x), i.e., f(x) for linear regression
- Calculate g(z) = sigmoid(z)
-
In case of
linear regression
thesquared error cost
function generates a convex curve, i.e., a curve having same (single) local and global minimum. -
In case of
logistic regression
thesquared error cost
function generates a Non-convex curve, i.e., a curve having multiple local minima, so we have to find multiple local minima to reach global minima. -
That's why
squared error cost
function not suitable for logistic regression. Thus, we used another loss function, given as:$$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)$$ -
The above loss function is also known as:
Logistic Loss
function. -
The curve of
logistic loss function
is well suited to gradient descent! It does not have plateaus, local minima, or discontinuities. Note, it is not necessary to always have abowl shape
curve as in the case of squared error.
- Loss is a measure of the difference of a single example to its target value.
- Cost is a measure of all losses over the training set
- used to reduce overfitting
Overfitting
High Varience means high variation in model fitting curve.- Model does not fit the training set well.
- Very complex Decision Boundry.
f(x)
has aHigh Order Ploynomial
equation.
Underfitting
High Bias means model prediction is biased towards some class.- Model fits the training set extremely well.
f(x)
has aLinear
equation.
- Our goal is to acheive
Generalization
. We want to generalize the model, i.e., model perform well even on examples not used in training. - We want
Just Right
generalized model. - Interchangeably used terms:
- High Bias == Underfit
- High Varience == Overfit
- Model Function / Equation
f(x)
- Cost Function
J(w,b)
- How to minimize cost using Gradient Descent or some other algorithm?
min J(w,b)