## Checking Gradient descent for convergence

* To converge means to find such parameters $\vec{w}, b$ that minimize to cost function $J(\vec{w},b)$ - i.e. to find parameters close to its global minimum.
* Remember that at each iteration the parameters $\vec{w}$ and $b$ are simultaneously updated.
* If the algorithm is working properly, than cost $J$ must decrease at every single iteration. 
* If $J$ is leveling off and is no longer decreasing much - this means that gradient descent has more or less converged.
* If this is not the case &rarr; ie. the loss $J$ only increases instead of diminishing, or is fluctuating (sometimes going up, sometimes going down) - that usually means $\alpha$ is too large (causing overshooting of the minimum), or there could be a bug in the code.
* The number of iterations that gradient descent takes to converge can vary a lot between different applications (30, or 1000, or even 100000 iterations).
* Decide when to stop training with an automatic convergence test:
* * Let $\epsilon = 10^{-3}$ or some similar very small number.
* * If $J$ decreases by less than $\epsilon$ on one iteration, then you're likely on the flattened part of the curve and can declare convergence.

 ![piai44-2.png](attachment:piai44-2.png)

## How to chose approprriate learning rate 

* If $\alpha$ is too small - the algorithm will run very slowly.
* If $\alpha$ is too large - it may not even converge.
* If the cost $J$ sometimes goes up instead of decreasing - learning rate is too big and the update step may be overshooting the minimum &rarr; try a smaller learning rate $\alpha$
* How to debug if gradient descent isn't working:
* * Try setting $\alpha$ to a very small number and see if that causes the cost to decrease on every iteration.
* * If even with very small $\alpha$ $J$ doesn't decrease on every single iteration, but instead sometimes increases, then that usually means there's a bug somewhere in the code.
* * Note that setting $\alpha$ to be really small is meant only for debugging purposes and is not going to be the most efficient choice for actual training, as the gradient descents will take a lot of iterations to converge.
* How to chose best $\alpha$:
* * Try a range of values for the learning, with each $\alpha$ being roughly 3 times bigger than the previous one: starting by 0.001 &rarr; 0.003 &rarr; 0.01 &rarr; 0.03 ...  
* * For each choice of $\alpha$, run gradient descent just for a handful of iterations and plot the cost function $J$ as a function of the number of iterations
* * Pick the value of $\alpha$ that seems to decrease the cost rapidly, but also consistently - this usually is just something slightly smaller than the largest reasonable value found.


## Loss vs. learning rate for different batch sizes 

The corve of $J$ becoming smoother with a larger batch size. An additional peak only appears when the batch size becomes very small, such as 8 or 4.
<br>(Source: by Carlos Perez)
![image.png](attachment:901df099-f8ee-441b-b3dc-8d63a28079cd.png)