***
# <center>***Gradients, Partial Derivatives, and the Chain Rule***
***

The **derivatives** that we have solved so far have been cases where there is only **one independent variable** in the function that is, the result depended solely on, in our case, x​. However, our neural network consists, for example, of neurons, which have multiple inputs. Each input gets multiplied by the corresponding weight (a function of 2 parameters), and they get summed with the bias (a function of as many parameters as there are inputs, plus one for a bias). To learn the impact of all of the inputs, weights, and biases to the neuron output and at the end of the loss function, we need to calculate the derivative of each operation performed during the forward pass in the neuron and the whole model. To do that and get answers, we will need to use the **chain rule**.

***
### ***The Partial Derivative***
***

The **partial derivative** measures how much impact a single input has on a function’s output. The method for calculating a partial derivative is the same as for derivatives explained in the derivative file, we simply have to repeat this process for each of the independent inputs. 

Each of the function’s inputs has some impact on this function’s output, even if the impact is 0. We need to know these impacts, this means that we have to calculate the derivative with respect to each input separately to learn about each of them. That’s why we call these **partial derivatives** with respect to given input we are calculating a partial of the derivative, related to a singular
input. **The partial derivative is a single equation**, and the full multivariate function’s derivative consists of a set of equations called the **gradient**. In other words, the **gradient** is a vector of the size of inputs containing partial derivative solutions with respect to each of the inputs.

To denote the **partial derivative**, we will be using **Euler’s notation**. It’s very similar to Leibniz’s notation, as we only need to replace the differential operator *d*​** with **∂**. While the **d​** operator might be used to denote the differentiation of a multivariate function, its meaning is a bit different it can mean the rate of the function’s change in relation to the given input, but when other inputs might change as well, and it is used mostly in physics. We are interested in the partial derivatives, a situation where we try to find the impact of the given input to the output while treating all of the other inputs as constants. We are interested in the impact of singular inputs since our goal, in the model, is to update parameters. The **∂** operator means explicitly that the **partial derivative:**

$$ f(x, y, z) = \frac{∂}{∂x}f(x, y, z) = \frac{∂}{∂y}f(x, y, z) = \frac{∂}{∂z}f(x, y, z)$$

***
### ***The Partial Derivative of a Sum***
***

Calculating the partial derivative with respect to a given input means to calculate it like the 
regular derivative of one input, just while treating other inputs as constants. For example:

$$ f(x, y) = x + y$$

$$ \frac{∂}{∂x}f(x, y) = \frac{∂}{∂x}[x + y] = \frac{∂}{∂x}x + \frac{∂}{∂x}y = 1 + 0 = 1$$

$$ \frac{∂}{∂y}f(x, y) = \frac{∂}{∂y}[x + y] = \frac{∂}{∂y}x + \frac{∂}{∂y}y = 0 + 1 = 1$$

***
### ***The Partial Derivative of Multiplication***
***

$$ f(x, y) = x \cdot y$$

$$ \frac{∂}{∂x}f(x, y) = \frac{∂}{∂x}(x \cdot y) = y \cdot \frac{∂}{∂x}x = y \cdot 1 = y$$

$$ \frac{∂}{∂y}f(x, y) = \frac{∂}{∂y}(x \cdot y) = x \cdot \frac{∂}{∂y}y = x \cdot 1 = x$$

***
### ***The Partial Derivative of Max***
***

Derivatives and partial derivatives are not limited to addition and multiplication operations, or
constants. We need to derive them for the other functions that we used in the forward pass, one of 
which is the derivative of the max()​ function: 

$$f(x, y) = max(x, y)$$

$$\frac{∂}{∂x}f(x, y) = \begin{cases}
1 & \text{if } x > y \\
0 & \text{if } x < y \\
\text{undefined} & \text{if } x = y
\end{cases}$$

The max function returns the greatest input. We know that the derivative of x​ with respect to x 
equals 1, ​ so the derivative of this function with respect to x​ equals 1 if x ​ is greater than y​, since the function will return x​. In the other case, where y​ is greater than x​ and will get returned instead, the derivative of **max()​** with respect to x​ equals 0 we treat y​ as a constant, and the derivative of y with respect to x​ equals 0. We can denote that as 1(x > y)​, which means 1​ if the condition is met, and 0​ otherwise. 

***
### ***The Gradient***
***

As we mentioned, the gradient is a vector composed of all of the partial derivatives of a unction, calculated with respect to each input variable.

$$f(x, y, z) = 3x^3z - y^2 + 5z + 2yz$$

$$\frac{∂}{∂x}f(x, y, z) = 9x^2z$$
$$\frac{∂}{∂y}f(x, y, z) = -2y + 2z$$
$$\frac{∂}{∂z}f(x, y, z) = 3x^3 + 5 + 2y$$

If we calculate all of the partial derivatives, we can form a gradient of the function. Using different notations, it looks as follows:

$$\nabla f(x, y, z) = \begin{bmatrix}
    \frac{∂}{∂x}f(x, y, z) \\
    \frac{∂}{∂y}f(x, y, z) \\
    \frac{∂}{∂z}f(x, y, z) 
\end{bmatrix} = \begin{bmatrix}
    \frac{∂}{∂x} \\
    \frac{∂}{∂y} \\
    \frac{∂}{∂z} 
\end{bmatrix} f(x, y, z) = \begin{bmatrix}
    9x^2z \\
    -2y + 2z \\
    3x^3 + 5 + 2y
\end{bmatrix}$$

That’s all we have to know about the **gradient** it’s a vector of all of the possible partial 
derivatives of the function, and we denote it using the ∇ nabla symbol that looks like an inverted delta symbol. 

We will be using **derivatives** of single-parameter functions and **gradients** to perform **gradient descent** using the **chain rule**, of multivariate functions or, in other words, to perform the **backward pass**, which is a part of the model training. How exactly we’ll do that is the subject of the next chapter. 


***
### ***The Chain Rule***
***

During the **forward pass**, we are passing the data through the neurons, then through the activation function, then through the neurons in the next layer, then through another activation function, and so on. We are calling a function with an input parameter, taking an output, and using that output as an input to another function. For this simple example, let’s take 2 functions: **f​** and **g​**

$$z = f(x)$$
$$y = g(z)$$

x is the input data, z is an output of the function f, but also an input for the function g, and y is an output of the function g. We could write the same calculation as: 

$$y = g(f(x))$$

In this form, we do not use the intermediate z variable, showing that function g takes the output of function f directly as an input. This does not differ much from the above 2 equations but shows an important property of functions chained this way — since x is an input to the function f and then the output of the function f is an input to the function g, the output of the function g is influenced by x in some way, so there must exist a derivative which can inform us of this influence. The forward pass through our model is a chain of functions similar to these examples. We are passing in samples, the data flows through all of the layers, and activation functions to form an output. Let’s bring the equation and the code of the example model from chapter 1: 

$$L = -\sum_{i=1}^{N} y_i \log \left( \frac{e^{z_i} \prod_{j=1}^{n_i} \left( 1 + \sum_{k=1}^{m_i} \max(0, f_i^k(x_{i,j})) \right) \prod_{j=1}^{n_i} \max(0, f_i^0(x_{i,j})) \sum_{j=1}^{n_i} \sum_{k=1}^{m_i} w_{i,j,k} \cdot x_{i,j,k} + b_i}{\sum_{l=0}^{n_i} e^{z_l} \prod_{j=1}^{n_i} \left( 1 + \sum_{k=1}^{m_i} \max(0, f_l^k(x_{l,j})) \right) \prod_{j=1}^{n_i} \max(0, f_l^0(x_{l,j})) \sum_{j=1}^{n_i} \sum_{k=1}^{m_i} w_{l,j,k} \cdot x_{l,j,k} + b_l} \right)$$