In [1]:
import torch

In [2]:
# Create tensors with requires_grad=True
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

In [3]:
f = x**2 + y**3

In [4]:
f

tensor(31., grad_fn=<AddBackward0>)

In [5]:
f.backward()

In [6]:
x.grad

tensor(4.)

In [7]:
y.grad

tensor(27.)

In [9]:
a = torch.tensor([2.0, 3.0], requires_grad=True)

In [10]:
y = a**2

In [11]:
y

tensor([4., 9.], grad_fn=<PowBackward0>)

In [13]:
y.backward(torch.tensor([1.0, 1.0]))

In [14]:
a.grad

tensor([4., 6.])

The **Jacobian** and the **Jacobian-vector product (JVP)** are closely related, but they serve different purposes in computation and interpretation.

### 1. **Jacobian**:
- The **Jacobian** is a matrix that contains all the first-order partial derivatives of a vector-valued function with respect to its inputs. It describes how each component of the output changes with respect to each input.
- If \( f: \mathbb{R}^n \to \mathbb{R}^m \) is a function, the **Jacobian matrix** \( J_f(\mathbf{x}) \) has the shape \( m \times n \) and is defined as:

\[
J_f(\mathbf{x}) = \begin{pmatrix}
    \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \\
    \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\
    \vdots & \vdots & \ddots & \vdots \\
    \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{pmatrix}
\]

- The **Jacobian matrix** gives the rate of change of each output with respect to each input.
- **Example**: If you have a function \( f(\mathbf{x}) = [x_1^2, 2x_1x_2] \), its Jacobian matrix will be:

\[
J_f(\mathbf{x}) = \begin{pmatrix}
    2x_1 & 0 \\
    2x_2 & 2x_1
\end{pmatrix}
\]

This provides a full representation of how all inputs affect all outputs.

### 2. **Jacobian-vector product (JVP)**:
- The **Jacobian-vector product** computes the product of the Jacobian matrix \( J_f(\mathbf{x}) \) with a given vector \( \mathbf{v} \), without explicitly forming the full Jacobian.
- This operation provides a linear approximation of how the output changes when the input is perturbed along a specific direction, defined by \( \mathbf{v} \).
  
\[
J_f(\mathbf{x}) \cdot \mathbf{v}
\]

- Instead of computing the entire Jacobian matrix, the JVP allows you to compute the effect of perturbations in a specific direction efficiently.
- **Example**: If we have the Jacobian \( J_f(\mathbf{x}) \) from the example above:

\[
J_f(\mathbf{x}) = \begin{pmatrix}
    2x_1 & 0 \\
    2x_2 & 2x_1
\end{pmatrix}
\]

Multiplying it by a vector \( \mathbf{v} = [v_1, v_2] \) gives the **Jacobian-vector product**:

\[
J_f(\mathbf{x}) \cdot \mathbf{v} = \begin{pmatrix}
    2x_1 \cdot v_1 + 0 \cdot v_2 \\
    2x_2 \cdot v_1 + 2x_1 \cdot v_2
\end{pmatrix}
\]

- The result is a vector that tells you how the outputs of the function change when the inputs are perturbed along the direction \( \mathbf{v} \).

### Key Differences:
1. **Jacobian**:
   - Computes all partial derivatives of the function.
   - Forms a matrix that fully describes how each output changes with respect to each input.
   - Useful when you need a complete picture of the sensitivity of the output to all inputs.
   - Often too expensive to compute directly for large systems because it involves calculating all the derivatives.

2. **Jacobian-vector product (JVP)**:
   - Instead of computing the full Jacobian matrix, it computes the product of the Jacobian with a vector \( \mathbf{v} \), giving an efficient way to approximate how a perturbation in a specific direction affects the outputs.
   - Useful in optimization or neural network training when you need to know how a specific small change in the input affects the output, without computing all the partial derivatives.
   - More computationally efficient, especially in high-dimensional problems.

### Analogy:
Think of the **Jacobian** as a map that describes how every road (input) in a city (system) affects every destination (output). The **Jacobian-vector product** is like asking: "If I drive along a specific road in a specific direction, how will it affect my arrival at different destinations?" It’s a focused, more efficient way to get information along one direction without needing the full map.

Correct! In PyTorch, the `.grad` attribute of a tensor does not return the full **Jacobian** for non-scalar outputs (vector-valued functions). Instead, it returns the **gradient** of a scalar-valued function (typically the loss function) with respect to the inputs. This is fundamentally different from the Jacobian.

### What `.grad` Returns:

1. **For scalar-valued functions** (like the loss in machine learning):
   - If you compute the gradient of a scalar function \( f: \mathbb{R}^n \to \mathbb{R} \) with respect to an input \( \mathbf{x} \in \mathbb{R}^n \), the `.grad` attribute will give you a vector of partial derivatives, which is the gradient.
   - This is a special case where the gradient coincides with the **Jacobian**, because the Jacobian for a scalar-valued function \( f \) is simply a row vector of partial derivatives:
   
   \[
   \nabla f(\mathbf{x}) = \begin{pmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \cdots & \frac{\partial f}{\partial x_n} \end{pmatrix}^T
   \]
   - In this case, `.grad` contains the vector of gradients (with respect to each input).

2. **For non-scalar (vector-valued) functions**:
   - If the function \( f: \mathbb{R}^n \to \mathbb{R}^m \) produces a vector output (i.e., has multiple components in the output), calling `.grad` does not return the full **Jacobian** matrix (which would be an \( m \times n \) matrix).
   - Instead, what `.grad` returns is a **vector of gradients** corresponding to the **sum of the output components**, assuming you’re backpropagating from a **scalar-valued loss** or a scalar function of the output (e.g., by summing or reducing the output into a single value).

### Why `.grad` Does Not Return the Jacobian for Non-Scalar Outputs:
- **Backpropagation** in deep learning is typically designed to compute gradients of a **scalar loss** with respect to the model parameters. When you compute `.backward()`, the goal is to compute how much a small change in each parameter affects the **scalar loss** value. This results in a single gradient vector, not the full Jacobian matrix.
- In practical terms, this means `.grad` gives you the gradients required for optimization but not the full set of partial derivatives (Jacobian) for every output component with respect to every input component.

### Example:
Let’s look at a simple example with a vector-valued function in PyTorch to illustrate this:

```python
import torch

# Define a function with a vector output
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = torch.tensor([1.0, 1.0])

# A vector-valued function
f = torch.stack([x[0]**2, x[1]**2])

# If you attempt to backpropagate, you would need to reduce the output to a scalar first
loss = torch.dot(f, y)  # Example of reducing to scalar by dot product

# Backpropagate to compute gradients
loss.backward()

# Check the gradients
print(x.grad)
```

### Output:
```
tensor([4.0, 6.0])
```

In this case:
- `f = [x[0]^2, x[1]^2]` is a vector-valued function, but we reduced it to a scalar (`loss = x[0]^2 * 1 + x[1]^2 * 1`).
- When we called `loss.backward()`, it computed the **gradient of the scalar loss** with respect to `x`, which is `4.0` (for \( x_0 \)) and `6.0` (for \( x_1 \)), the derivatives of each squared term in the loss.
- The `.grad` attribute contains the vector of gradients, not the full Jacobian of the vector function \( f \).

### How to Get the Full Jacobian in PyTorch:
To compute the full Jacobian for vector-valued functions, PyTorch provides `torch.autograd.functional.jacobian`, which will give you the full matrix of partial derivatives:

```python
from torch.autograd.functional import jacobian

# Function that outputs a vector
def f(x):
    return torch.tensor([x[0]**2, x[1]**2])

# Compute the Jacobian
J = jacobian(f, x)

print(J)
```

### Output:
```
tensor([[4., 0.],
        [0., 6.]])
```

This matrix gives the full Jacobian of the vector-valued function \( f(x) = [x_0^2, x_1^2] \), where each element is the partial derivative \( \frac{\partial f_i}{\partial x_j} \).

### Summary:
- **`.grad` in PyTorch** computes the gradient of a **scalar-valued** function (like a loss) with respect to its inputs, not the full Jacobian.
- For **non-scalar outputs**, `.grad` gives gradients with respect to a scalar function of the outputs (usually the loss function), not the full Jacobian.
- To compute the full Jacobian for vector-valued functions, you can use `torch.autograd.functional.jacobian`.

In [15]:
# Define a function with a vector output
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = torch.tensor([1.0, 1.0])

In [16]:
# A vector-valued function
f = torch.stack([x[0]**2, x[1]**2])

In [18]:
f

tensor([4., 9.], grad_fn=<StackBackward0>)

In [17]:
# If you attempt to backpropagate, you would need to reduce the output to a scalar first
loss = torch.dot(f, y)  # Example of reducing to scalar by dot product

In [19]:
loss

tensor(13., grad_fn=<DotBackward0>)

In [20]:
# Backpropagate to compute gradients
loss.backward()

In [23]:
x.grad

tensor([4., 6.])