### quadratic cost function, MSE

$ C\big(w,b\big)\;\equiv\;\frac{1}{2n}\displaystyle\sum_{x}\|y(x)-b\|^{2} $

$ C\big(w,b\big) $ is non-negative, our training algorithm has done a good job if it can find weights and biases so that $ C\big(w,b\big)\;\approx\;0 $. In other words, we want to find a set of weights and biases which make the cost as small as possible. We'll do that using an algorithm known as gradient descent.

let's suppose we're trying to minimize some function, C(v). This could be any real-valued function of many variables, v=v1,v2,…. Note that I've replaced the w and b notation by v to emphasize that this could be any function - we're not specifically thinking in the neural networks context any more. To minimize C(v) it helps to imagine C as a function of just two variables, which we'll call v1 and v2:

![image.png](attachment:image.png)

What we'd like is to find where CC achieves its global minimum.  
let's think about what happens when we move the ball a small amount Δv1 in the v1 direction, and a small amount Δv2 in the v2 direction. Calculus tells us that C changes as follows:

$ \Delta C\;\approx\;\frac{\partial C}{\partial v_{1}}\Delta v_{1}\;+\;\frac{\partial C}{\partial v_{2}}\Delta v_{2} $

We're going to find a way of choosing Δv1 and Δv2 so as to make ΔC negative; i.e., we'll choose them so the ball is rolling down into the valley. To figure out how to make such a choice it helps to define Δv to be the vector of changes in v, 

$ \Delta v\;\equiv\;(\Delta v_{1},\;\Delta v_{2})^{T} $  

We'll also define the gradient of CC to be the vector of partial derivatives, $ \big(\frac{\partial C}{\partial v_{1}},\;\frac{\partial C}{\partial v_{2}}\big)^{T} $. We denote the gradient vector by $ \nabla C $, i.e.:

$ \nabla C\;=\; \big(\frac{\partial C}{\partial v_{1}},\;\frac{\partial C}{\partial v_{2}}\big)^{T} $

So, so we will have

$ \Delta C\;\approx\;\nabla C\;\dot\;\Delta v $

This equation helps explain why $ \nabla C $ is called the gradient vector: $ \nabla C $ relates changes in $ v $ to changes in $C$, just as we'd expect something called a gradient to do. But what's really exciting about the equation is that it lets us see how to choose $ \Delta v $ so as to make $ \Delta C $ negative. In particular, suppose we choose

$ \Delta v\;=\;-\eta\nabla C $

where ηη is a small, positive parameter (known as the learning rate).   
Then Equation $ \Delta C\;\approx\;\nabla C\;\dot\;\Delta v\;+\; $ tells us that $ \Delta C\;\approx\;-\eta\nabla C\;\dot\;\nabla v\;=\;-\eta\|\nabla C\|^{2} $.   
Because $ \|\nabla C\|^{2}\;\geq\;0 $, this guarantees that $ \Delta C\;\leq\;0 $, i.e., $ C $ will always decrease, never increase, if we change $ v $ according to the prescription in above equation. (Within, of course, the limits of the approximation in Equation $ \Delta C\;\approx\;\nabla C\;\dot\;\Delta v\;+\; $).   
This is exactly the property we wanted! And so we'll take Equation $ \Delta v\;=\;-\eta\nabla C $ to define the "law of motion" for the ball in our gradient descent algorithm. That is, we'll use Equation $ \Delta v\;=\;-\eta\nabla C $ to compute a value for $ \Delta v $, then move the ball's position $ v $ by that amount:

$ v\;\to\;v^{'}\;=\;v-\eta\nabla C $ 

Then we'll use this update rule again, to make another move. If we keep doing this, over and over, we'll keep decreasing $ C $ until - we hope - we reach a global minimum.  
  
Summing up, the way the gradient descent algorithm works is to repeatedly compute the gradient $ \nabla C $, and then to move in the opposite direction, "falling down" the slope of the valley. We can visualize it like this:

![image.png](attachment:image.png)