# Gradients

## Partial Derivatives and Gradients

- **Partial Derivatives**: In multivariable calculus, a partial derivative of a function is its derivative with respect to one of those variables, holding the other variables constant. For a function `f(x, y)`, the partial derivative with respect to `x` is denoted as `∂f/∂x` or `f_x`, and the partial derivative with respect to `y` is denoted as `∂f/∂y` or `f_y`.

    The intuition behind partial derivatives is that they measure the rate at which the function changes with respect to one of its variables, while keeping the other variables constant. This is like slicing the function along a plane parallel to one of the axes and finding the slope of the function along that slice.

- **Gradient**: The gradient of a function is a vector that contains all of its first order partial derivatives. For a function `f(x, y)`, the gradient is denoted as `∇f` or `grad f`, and is given by `[∂f/∂x, ∂f/∂y]`.

    The intuition behind the gradient is that it points in the direction of the greatest rate of increase of the function, and its magnitude is the rate of increase in that direction. This is useful for optimization problems, where one wants to find the maximum or minimum of a function. The gradient is also used in the gradient descent algorithm, which is used to minimize a function by iteratively moving in the direction of steepest descent.

    The row vector of partial derivatives is essentially the gradient. When you compute the partial derivatives and put them in a row vector, you're essentially computing the gradient of the function.

### Computing Gradients

- **Computing the Gradient**: The gradient of a function at a specific point can be computed by taking the partial derivatives of the function with respect to each variable at that point. For a function `f(x, y)`, the gradient at the point `(a, b)` is given by `[∂f/∂x(a, b), ∂f/∂y(a, b)]`.

    The intuition behind computing the gradient is that it gives the direction and rate of fastest increase of the function at a specific point. This is useful for understanding the behavior of the function near that point, and for optimization problems where one wants to find the maximum or minimum of the function.
    
- **Partial Derivatives and Row Vectors**: The partial derivatives of a function can be represented as a row vector, where each element of the vector is a partial derivative of the function with respect to one of its variables. For a function `f(x, y)`, the partial derivatives can be represented as the row vector `[∂f/∂x, ∂f/∂y]`.

    The intuition behind representing partial derivatives as a row vector is that it allows one to compute the gradient of the function by taking the dot product of the row vector with a column vector of variables. This is useful for understanding the behavior of the function near a specific point, and for optimization problems where one wants to find the maximum or minimum of the function.
### Basic Rules of Partial Differentiation

It's important to remember that since our gradients involve vectors and matrices. With this, matrix multiplication is not commutative. This means that the order of multiplication matters.

1. **Constant Rule**: The partial derivative of a constant is zero. This is because a constant doesn't change, so its rate of change is zero.

    `∂/∂x[c] = 0`

2. **Power Rule**: The partial derivative of `x^n` with respect to `x` is `n*x^(n-1)`. This rule comes from the limit definition of the derivative and binomial theorem.

    `∂/∂x[x^n] = n*x^(n-1)`

3. **Sum Rule**: The partial derivative of a sum of functions is the sum of their partial derivatives. This is intuitive because the rate of change of a sum of functions at a point is just the sum of their rates of change at that point.

    `∂/∂x[f(x, y) + g(x, y)] = ∂f/∂x + ∂g/∂x`

4. **Product Rule**: The partial derivative of a product of two functions is the first function times the partial derivative of the second, plus the second function times the partial derivative of the first. This rule is less intuitive and usually requires proof via the limit definition of the derivative.

    `∂/∂x[f(x, y) * g(x, y)] = f(x, y) * ∂g/∂x + g(x, y) * ∂f/∂x`

5. **Chain Rule**: The partial derivative of a composition of functions is the derivative of the outer function evaluated at the inner function, times the derivative of the inner function. This rule is crucial for differentiating complex functions and comes from the limit definition of the derivative.

    `∂/∂x[f(g(x, y))] = f'(g(x, y)) * g'(x)`

### Relationship Between Gradient and Divergence

- **Gradient**: The gradient operates on a scalar field (a function that assigns a scalar value to each point in space) and produces a vector field. The resulting vector field points in the direction of the greatest rate of increase of the scalar field, and its magnitude is the rate of increase in that direction.

- **Divergence**: The divergence operates on a vector field (a function that assigns a vector to each point in space) and produces a scalar field. The resulting scalar field measures the rate at which "density" exits a given region of space. If the divergence at a point is positive, it means vectors are "diverging" or moving away from that point. If it's negative, vectors are "converging" or moving towards that point.

The divergence is a measure of a vector field's tendency to originate from or converge upon certain points, while the gradient is a measure of how a scalar field changes in different directions.

## Gradients of Vector-Valued Functions

- **Vector-Valued Functions (Vector Fields)**: A vector field is a function that assigns a vector to each point in a subset of space. For example, in two dimensions, a vector field `F` might be defined as `F(x, y) = [P(x, y), Q(x, y)]`, where `P` and `Q` are scalar functions that give the components of the vector field.

- **Gradients of Vector Fields**: The gradient of a scalar field is a vector field that points in the direction of the greatest rate of increase of the scalar field, and whose magnitude is the rate of increase in that direction. However, the concept of a gradient doesn't directly apply to a vector field, because a vector field assigns a vector to each point in space, not a scalar.

    Instead, there are several operations that can be applied to vector fields that are somewhat analogous to taking the gradient of a scalar field. These include the divergence, which measures the rate at which "density" exits a given region of space, and the curl, which measures the rate of rotation or circulation of the vectors in the field.

    This means for each value in our column vector, we have a row of partial derivatives which when we take the limit of the function, we get a matrix of partial derivatives. This is the Jacobian matrix.

### Jacobian Matrix

- **Jacobian Matrix**: The Jacobian matrix is a matrix of all first-order partial derivatives of a vector-valued function. For a vector-valued function `F(x, y) = [f(x, y), g(x, y)]`, the Jacobian matrix is given by:

    `J = [[∂f/∂x, ∂f/∂y], [∂g/∂x, ∂g/∂y]]`

    The Jacobian matrix is used to represent the rate of change of a vector-valued function with respect to its variables. It is also used to represent the gradient of a vector-valued function, which is a vector of all its first-order partial derivatives. The Jacobian matrix gives you information about how small changes in the input can affect the output. In particular, if you think of the function `F` as transforming the input space, then the Jacobian matrix at a point gives you the best linear approximation to that transformation near that point.

    For example, if `F` represents the transformation of a physical object (like a stretch, squeeze, or twist), then the Jacobian matrix at a point tells you how a small patch of material centered at that point would be stretched, squeezed, or twisted.


- **Reparameterization Trick**: The reparameterization trick is a method used in variational inference to allow for the backpropagation of gradients through random nodes in a computational graph.

    The trick involves reparameterizing the random variables in a way that separates the deterministic and stochastic parts. For example, if we have a random variable `z` that is normally distributed with mean `μ` and standard deviation `σ`, we can reparameterize `z` as `z = μ + σ * ε`, where `ε` is a standard normal random variable. This allows us to backpropagate gradients through `μ` and `σ` while keeping the stochasticity of `z`.

- **Use of Jacobian**: The Jacobian comes into play when we want to compute the derivative of the expectation of a function with respect to the parameters of the distribution. The reparameterization trick allows us to express this derivative as an expectation of the derivative of the function, which can then be estimated by sampling.

    Specifically, if `f(z)` is a function of `z`, and `z` is reparameterized as `z = g(ε; θ)`, where `ε` is a random variable and `θ` are the parameters, then the derivative of the expectation of `f(z)` with respect to `θ` can be expressed as an expectation of the product of the Jacobian of `g` with respect to `θ` and the gradient of `f` with respect to `z`.

## Gradients of Matrices

- **Gradients of Matrices**: The gradient of a matrix is a concept that generalizes the notion of a derivative to functions that output matrices. If you have a function `F` that takes a vector `x` in `R^n` and outputs a matrix `A` in `R^m x R^p`, then the gradient of `F` with respect to `x` is a three-dimensional array (or tensor) that has dimensions `m x p x n`.

    Each slice of this tensor along the third dimension is a matrix that represents the derivative of `A` with respect to one component of `x`. In other words, if `F_i` is the `i`-th component of `F`, then the `i`-th slice of the gradient tensor is the Jacobian matrix of `F_i`.

- **Intuition**: The gradient of a matrix tells you how each component of the output matrix changes as you change the components of the input vector. It's a way of capturing the sensitivity of the output to changes in the input.

### Gradients of Vectors with Respect to Matrices

If you have a function `f` that maps a matrix `A` in `R^m x R^n` to a vector `b` in `R^p`, the gradient of `f` with respect to `A` is a tensor in `R^p x R^m x R^n`. 

Each slice of this tensor along the first dimension is a matrix that represents the derivative of one component of `b` with respect to the matrix `A`. 

In other words, if `f_i` is the `i`-th component of `f`, then the `i`-th slice of the gradient tensor is the matrix of partial derivatives of `f_i` with respect to `A`.

### Gradients of Matrices with Respect to Matrices

If you have a function `f` that maps a matrix `A` in `R^m x R^n` to another matrix `B` in `R^p x R^q`, the gradient of `f` with respect to `A` is a four-dimensional tensor in `R^p x R^q x R^m x R^n`.

    Each slice of this tensor along the first two dimensions is a matrix that represents the derivative of one component of `B` with respect to the matrix `A`. 

    In other words, if `f_ij` is the `(i, j)`-th component of `f`, then the `(i, j)`-th slice of the gradient tensor is the matrix of partial derivatives of `f_ij` with respect to `A`.  

The gradient of a matrix with respect to another matrix tells you how each component of the output matrix changes as you change the components of the input matrix. It's a way of capturing the sensitivity of the output to changes in the input.
