### Gradient Descent 
- is an algorithm that finds the best fit line for given training data set.
- This is not a cost function, but just an algorithm to find the best fit line in less iterations.
- Gradient descent is not a loss function. It is an optimization algorithm used to find the parameters that minimize a loss function.
- The loss function quantifies how well a model's predictions match the actual values, and gradient descent iteratively adjusts the model's parameters to reduce this loss.
- We use the cost function (MAE) to find the cost inside gradient descent.
- Gradient decent is used in background by algorithms.

Types of Gradient Descent

The main types of gradient descent are Batch Gradient Descent, Stochastic Gradient Descent, and Mini-batch Gradient Descent, each differing in how they use training data to update model parameters. 

Batch Gradient Descent uses the entire dataset

Stochastic Gradient Descent uses a single data point

Mini-batch Gradient Descent uses a small subset (a batch) of the data. 

Here's a breakdown of each:

Batch Gradient Descent: This method calculates the error for every data point in the entire training set and updates the model's parameters based on the average error. 
It's computationally expensive for large datasets but tends to converge to a minimum more reliably. 

Stochastic Gradient Descent (SGD): In contrast, SGD uses only one randomly selected data point to compute the error and update the parameters in each iteration. It's faster than batch gradient descent but can be more erratic and fluctuate around the minimum. 

Mini-batch Gradient Descent: This is a compromise between the two, using a small random subset of the training data (a mini-batch) to calculate the error and update parameters. It offers a balance between speed and stability, making it a popular choice. 

Other important concepts related to gradient descent: 
Learning Rate:
A crucial hyperparameter that determines the step size during parameter updates. Choosing an appropriate learning rate is essential for effective convergence. 

Momentum:
A technique that adds a momentum term to the gradient update, accelerating convergence and helping to escape local minima. 

Local Minima:
Cost functions can have multiple local minima, and gradient descent can sometimes get stuck in one of these rather than finding the global minimum. 

Exploding and Vanishing Gradients:
These are problems that can occur during the training of deep neural networks, where gradients become too large or too small, respectively, hindering learning

- Instead of drawing different permutations and combinations of linear line (m and b)
- In less iterations, find the best fit line by determinig m and b
- We use Mean Square Error (MSE), for every value of m and b, we find the cost.
- .
- Steps:-
- .
- Start with m(Slope) = 0 and b(Y Intercept) = 0 (Or) Start with some random value of m(Slope) and b(Y Intercept).
- Iterate the below steps, n number of times(EPOCHS). Ex EPOCHS = 100, iterate 100 times. Untill the best cost/lowest value for b and c are found.
- .
- Find the value of y, i.e. predict the value using y = mx + b. We already have values for m and b.
- 
- Reduce/Increase the value of m and b by learning curver(Ex: 0.01) and find the new slope and intercept.
![image.png](attachment:ea7aa840-65ee-4ec9-9fe2-bc0c74c5306f.png)
- Remember to keep the learning curve as small as possible, otherwise we might miss the minimum point as shown in below diagram
![image.png](attachment:2acbc51d-99dd-4d4a-8f6b-cee6922181dc.png)
- Keep reducing the value of m and b untill the error(Cost) is minimum Or loss is < 0.0001 Or Go untill the slope becomes constant(Cost new = Cost old)
![image.png](attachment:a3a50958-cf03-4cd8-8a4a-f4d76d5905c5.png)
- Find the next point(slope) by using the formula. At each point, we need to calculate the slope at a particular point and identify the next slope.
Slope(m) between two points = Change in Y / Change in X = Delta Y / Delta X
![image.png](attachment:985fedac-746d-43e2-89a9-1c0984d5170e.png)

Slope(m) at a particular point(Small change in Y and X)
X Changes from x to x + Delta x
Y Changes from f(x) to f(x+Delta x)

Now if we put the above values in the slope formula
= change in Y/Change in X = (f(x + Delta x) - f(x))/ Delta x
Simplify it as best as we can, make delta x shrink towards 0
If we simply further the Delta X heads towards zero, we get 2x,
in other words, the slope at x is 2x
So when x=2, slope = 4
when x=5, slope = 10

- Derivate of functions that depends on two variables is

f(x,y) = x power 3 + y power 2

partial derivate of x = 3 * x power2

partial derivate of y = 2 * y

![image.png](attachment:2b45bb70-a750-4ba1-962c-8117678efb95.png)

Now if we put this partial derivate in MSE, so this finally gives us the required slope.

![image.png](attachment:45b3ead8-95f0-4b55-840f-6975eb260eea.png)

- Iterate this process n number of times(EPOCHS). Ex EPOCHS = 100, iterate 100 times >>> hyper parameter tuning  - gives understanding of how epochs and learning rate
- Steps and learning rate. Now since we know the initial value of slope, and we decided the learning rate, calculate the next slope.

![image.png](attachment:e74772fb-f46f-42e0-8254-9d6d1c5f85f6.png)

Example to calculate the next slope:-

![image.png](attachment:b8a43355-6012-4bde-b8ba-22a7f2fa0191.png)

since we started with b1, to calculate the next slope

b2 = b1 - learning rate * derivate 

#### Implement Gradient Descent in Python

In [17]:
import numpy as np

def gradient_descent(x,y):

    # Start by taking slope and intercept as 0
    m_slope_curr = b_intercept_curr = 0

    # Start with 1000 iterations and fine tune it later
    iterations = 10000

    # Fix the learning rate
    learning_rate = 0.25

    # Length of data points i.e. number of data points.
    n = len(x)
    
    # Iterate the calculations n number of times, until the cost is less, with min slope.
    for i in range(iterations):
        # Predict the new value
        y_predict = m_slope_curr * x + b_intercept_curr

        # Calculate the cost, Mean Square Error MSE
        cost = 1/n * sum([val**2 for val in (y - y_predict)])

        # Derivates of m and b
        m_derivate = -(2/n)*sum(x*(y-y_predict))
        b_derivate = -(2/n)*sum(y-y_predict)

        # Make the next m and b as per formula
        m_slope_curr = m_slope_curr - learning_rate * m_derivate
        b_intercept_curr = b_intercept_curr - learning_rate * b_derivate

        # Print the values of m, b and iternation number
        print("m_Curr_Slope = {}, b_Curr_Intercept = {}, Curr_Iteration_No = {}, Cost = {} ".format(m_slope_curr,b_intercept_curr,i,cost))

x = np.array([1,2,3,4,5])
y = np.array([5,7,9,11,13])

gradient_descent(x,y)

m_Curr_Slope = 15.5, b_Curr_Intercept = 4.5, Curr_Iteration_No = 0, Cost = 89.0 
m_Curr_Slope = -61.0, b_Curr_Intercept = -16.5, Curr_Iteration_No = 1, Cost = 2128.5 
m_Curr_Slope = 314.75, b_Curr_Intercept = 87.75, Curr_Iteration_No = 2, Cost = 51410.25 
m_Curr_Slope = -1532.5, b_Curr_Intercept = -423.75, Curr_Iteration_No = 3, Cost = 1242154.125 
m_Curr_Slope = 7547.375, b_Curr_Intercept = 2091.375, Curr_Iteration_No = 4, Cost = 30012795.5625 
m_Curr_Slope = -37084.75, b_Curr_Intercept = -10270.875, Curr_Iteration_No = 5, Cost = 725166268.03125 
m_Curr_Slope = 182303.1875, b_Curr_Intercept = 50496.1875, Curr_Iteration_No = 6, Cost = 17521397590.640625 
m_Curr_Slope = -896093.125, b_Curr_Intercept = -248202.1875, Curr_Iteration_No = 7, Cost = 423350322838.3828 
m_Curr_Slope = 4404737.84375, b_Curr_Intercept = 1220043.09375, Curr_Iteration_No = 8, Cost = 10228949769749.098 
m_Curr_Slope = -21651369.4375, b_Curr_Intercept = -5997080.71875, Curr_Iteration_No = 9, Cost = 247150900206250.2

  cost = 1/n * sum([val**2 for val in (y - y_predict)])
  cost = 1/n * sum([val**2 for val in (y - y_predict)])
  m_derivate = -(2/n)*sum(x*(y-y_predict))
  m_slope_curr = m_slope_curr - learning_rate * m_derivate


m_Curr_Slope = nan, b_Curr_Intercept = nan, Curr_Iteration_No = 5481, Cost = nan 
m_Curr_Slope = nan, b_Curr_Intercept = nan, Curr_Iteration_No = 5482, Cost = nan 
m_Curr_Slope = nan, b_Curr_Intercept = nan, Curr_Iteration_No = 5483, Cost = nan 
m_Curr_Slope = nan, b_Curr_Intercept = nan, Curr_Iteration_No = 5484, Cost = nan 
m_Curr_Slope = nan, b_Curr_Intercept = nan, Curr_Iteration_No = 5485, Cost = nan 
m_Curr_Slope = nan, b_Curr_Intercept = nan, Curr_Iteration_No = 5486, Cost = nan 
m_Curr_Slope = nan, b_Curr_Intercept = nan, Curr_Iteration_No = 5487, Cost = nan 
m_Curr_Slope = nan, b_Curr_Intercept = nan, Curr_Iteration_No = 5488, Cost = nan 
m_Curr_Slope = nan, b_Curr_Intercept = nan, Curr_Iteration_No = 5489, Cost = nan 
m_Curr_Slope = nan, b_Curr_Intercept = nan, Curr_Iteration_No = 5490, Cost = nan 
m_Curr_Slope = nan, b_Curr_Intercept = nan, Curr_Iteration_No = 5491, Cost = nan 
m_Curr_Slope = nan, b_Curr_Intercept = nan, Curr_Iteration_No = 5492, Cost = nan 
m_Curr_Slope = n

In [5]:
# Graphical representation of Gradient descent

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

def gradient_descent(x,y):
    m_curr = b_curr = 0
    rate = 0.01
    n = len(x)
    plt.scatter(x,y,color='red',marker='+',linewidth='5')
    for i in range(10000):
        y_predicted = m_curr * x + b_curr
#         print (m_curr,b_curr, i)
        plt.plot(x,y_predicted,color='green')
        md = -(2/n)*sum(x*(y-y_predicted))
        yd = -(2/n)*sum(y-y_predicted)
        m_curr = m_curr - rate * md
        b_curr = b_curr - rate * yd


x = np.array([1,2,3,4,5])
y = np.array([5,7,9,11,13])


gradient_descent(x,y)

Error in callback <function _draw_all_if_interactive at 0x00000136FDA0E7A0> (for post_execute), with arguments args (),kwargs {}:


TypeError: must be real number, not str

#### Gradient Descent vs Cost Functions

- In machine learning, cost functions measure how well a model's predictions match the actual data, while gradient descent is an algorithm used to (Quickly come up with the cost function/regression line) minimize the cost function by iteratively adjusting the model's parameters.
- Essentially, the cost function quantifies the error, and gradient descent finds the best parameter settings to reduce that error. 
- Here's a more detailed breakdown:
- 
- Cost Functions: 
- Purpose:
- Cost functions (also known as loss functions) are mathematical expressions that quantify the difference between predicted values from a machine learning model and the actual values in the training data.
- Evaluation:
- They output a single value (the "cost" or "loss") that represents the model's error. A higher cost indicates a larger discrepancy between predictions and actual values.
- Examples:
- Common cost functions include Mean Squared Error (MSE) for regression and Cross-Entropy for classification. 
- .
- Gradient Descent:
- Purpose:
- Gradient descent is an optimization algorithm that aims to find the minimum value of a function, in this case, the cost function.
- Process:
- It works by iteratively adjusting the model's parameters (e.g., weights in a neural network) in the direction of the negative gradient (steepest descent) of the cost function. 

#### Note:-
- Linear Regression uses Ordinary Least Squares (OLS) method to estimate the parameters in a linear regression model.
- Logistic regression, neural networks, support vector machines, and even boosting algorithms like XGBoost uses Gradient Descent.