Skip to content

Latest commit

 

History

History
47 lines (39 loc) · 4.06 KB

gradient_descent.md

File metadata and controls

47 lines (39 loc) · 4.06 KB

Index

grape

The Idea

plum

  • Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide range of problems.
  • The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

Example 01

  • Suppose you are lost in the mountains in a dense fog, and you can only feel the slope of the ground below your feet.
  • A good strategy to get to the bottom of the valley quickly is to go downhill in the direction of the steepest slope.
  • This is exactly what Gradient Descent does: it measures the local gradient of the error function with regard to the parameter vector θ, and it goes in the direction of descending gradient.
  • Once the gradient is zero, you have reached a minimum!

Steps

  • First, start by filling θ with random values (this is called random initialization).
  • Then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE), until the algorithm converges to a minimum.

In this depiction of Gradient Descent, the model parameters are initialized randomly and get tweaked repeatedly to minimize the cost function; the learning step size is proportional to the slope of the cost function, so the steps gradually get smaller as the parameters approach the minimum

Learning rate

plum

Too Small

  • An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter.
  • If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time.
  • On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before.
  • This might make the algorithm diverge, with larger and larger values, failing to find a good solution.

Too Large

The Problem

plum

  • If the random initialization starts the algorithm on the left, then it will converge to a local minimum, which is not as good as the global minimum.
  • If it starts on the right, then it will take a very long time to cross the plateau.
  • And if you stop too early, you will never reach the global minimum.

GD in Linear Regression

plum

  • Fortunately, the MSE cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve.
  • This implies that there are no local minima, just one global minimum.
  • It is also a continuous function with a slope that never changes abruptly.
  • These two facts have a great consequence: Gradient Descent is guaranteed to approach arbitrarily close the global minimum (if you wait long enough and if the learning rate is not too high).

When using Gradient Descent , you should ensure that all features have a similar scale (e.g., using Scikit-Learn’s StandardScaler class), or else it will take much longer to converge.