Attributes: - data (float): the scalar value associated with this node - grad (float): the gradient of the output of the computational graph w.r.t. this node’s value - label (str): a label for this node, used for debugging and visualization purposes - _op (str): a string representation of the operation that produced this node in the computational graph - _prev (set of Value objects): the set of nodes that contributed to the computation of this node - _backward (function): a function that computes the gradients of this node w.r.t. its inputs
Methods: - init(self, data, children=(), op=’‘, label=’’): Initializes a Value object with the given data, children, op, and label - repr(self): Returns a string representation of this Value object - add(self, other): Implements the addition operation between two Value objects - mul(self, other): Implements the multiplication operation between two Value objects - item(self): Returns the scalar value associated with this Value object - tanh(self): Applies the hyperbolic tangent function to this Value object and returns a new Value object
Attributes: - data: The actual data of the tensor. It is computed lazily. - children: Other tensors that this tensor depends on for computing its value. - requires_grad: Whether this tensor needs to compute gradients.
Methods: - realize_data: Computes and returns the actual data for this tensor. - shape: Returns the shape of this tensor. - dtype: Returns the data type of this tensor.
The out_grad parameter refers to the gradient of the loss function with respect to the output of the node. Multiplying this with the local gradient gives the gradient of the loss with respect to the input to the node, according to the chain rule of calculus, which is the basis for backpropagation in neural networks.
+
The chain rule is a fundamental concept in calculus that provides a method to compute the derivative of composite functions. In simple terms, the chain rule states that the derivative of a composite function is the derivative of the outer function multiplied by the derivative of the inner function.
+
Given a composite function that is the composition of two functions, say, \(f(g(x))\), the chain rule can be stated as follows:
Explanation for the derivative of the AddScalar operator:
+
Let’s denote the scalar as c and a as the tensor being added by the scalar. The operation can be described as f(a) = a + c.
+
The function for the backward pass (i.e., the gradient) is df/da = 1, which means the derivative of f(a) with respect to a is simply 1.
+
We are given a function \(f(a) = a + c\), where \(a\) is a tensor and \(c\) is a scalar. Our task is to find the derivative of this function with respect to \(a\).
+
By differentiating the function \(f(a)\) with respect to \(a\), we find:
+
\[\begin{align*}
+\frac{df}{da} &= \frac{d}{da} (a + c) \\
+&= 1
+\end{align*}\]
+
Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(1\).
+
We starts by defining the function f(a) = a + c. It then explains that when we differentiate f(a) with respect to a, we find that the derivative is 1. This means that the gradient of f(a) with respect to a is 1, which matches the behavior of the AddScalar operator as provided in the gradient method.
Example: >>> a = Tensor([1, 2, 3]) >>> op = AddScalar(5) >>> result = op.compute(a) >>> print(result) Tensor([6, 7, 8])
+
+
+
Element Wise Multiplication
+
Explanation for the derivative of the EWiseMul (element-wise multiplication) operator:
+
Let’s denote the two input tensors as a and b. The operation can be described as f(a, b) = a * b, where * represents element-wise multiplication.
+
The function for the backward pass (i.e., the gradient) is df/da = b and df/db = a. This means that the derivative of f(a, b) with respect to a is b, and the derivative with respect to b is a.
+
We are given a function \(f(a, b) = a \odot b\), where \(a\) and \(b\) are tensors, and \(\odot\) represents element-wise multiplication. Our task is to find the derivatives of this function with respect to \(a\) and \(b\).
+
By differentiating the function \(f(a, b)\) with respect to \(a\), we find:
+
\[\begin{align*}
+\frac{df}{da} &= \frac{d}{da} (a \odot b) \\
+&= b
+\end{align*}\]
+
Therefore, the gradient of \(f(a, b)\) with respect to \(a\) is \(b\).
+
Similarly, by differentiating the function \(f(a, b)\) with respect to \(b\), we find:
+
\[\begin{align*}
+\frac{df}{db} &= \frac{d}{db} (a \odot b) \\
+&= a
+\end{align*}\]
+
Therefore, the gradient of \(f(a, b)\) with respect to \(b\) is \(a\).
Performs element-wise multiplication of two tensors.
+
Example: >>> a = Tensor([1, 2, 3]) >>> b = Tensor([4, 5, 6]) >>> op = EWiseMul() >>> result = op.compute(a, b) >>> print(result) Tensor([4, 10, 18])
+
+
+
Scalar Multiplication
+
Let’s denote the scalar as c and a as the tensor being multiplied by the scalar. The operation can be described as f(a) = a * c.
+
The function for the backward pass (i.e., the gradient) is df/da = c, which means the derivative of f(a) with respect to a is c.
+
The LaTeX document will look as follows:
+
We are given a function \(f(a) = a \cdot c\), where \(a\) is a tensor and \(c\) is a scalar. Our task is to find the derivative of this function with respect to \(a\).
+
By differentiating the function \(f(a)\) with respect to \(a\), we find:
+
\[\begin{align*}
+\frac{df}{da} &= \frac{d}{da} (a \cdot c) \\
+&= c
+\end{align*}\]
+
Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(c\).
+
We starts by defining the function f(a) = a * c. It then explains that when we differentiate f(a) with respect to a, we find that the derivative is c. This means that the gradient of f(a) with respect to a is c, which matches the behavior of the MulScalar operator as provided in the gradient method.
Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(-1\).
+
+
class Negate(TensorOp):
+"""
+ Negates the given tensor.
+
+ Example:
+ >>> a = Tensor([1, -2, 3])
+ >>> op = Negate()
+ >>> result = op.compute(a)
+ >>> print(result)
+ Tensor([-1, 2, -3])
+ """
+
+def compute(self, a: NDArray) -> NDArray:
+"""
+ Computes the negation of a tensor.
+
+ Args:
+ - a: The tensor to negate.
+
+ Returns:
+ The negation of a.
+ """
+return-1* a
+
+def gradient(self, out_grad: Tensor, node: Tensor) -> Tuple[Tensor,]:
+"""
+ Computes the gradient of the negation operation.
+
+ Args:
+ - out_grad: The gradient of the output of the operation.
+ - node: The node in the computational graph where the operation was performed.
+
+ Returns:
+ The gradients with respect to the inputs.
+ """
+return (negate(out_grad), )
+
+
+def negate(a: Tensor) -> Tensor:
+"""
+ Negates the given tensor.
+
+ Args:
+ - a: The tensor to negate.
+
+ Returns:
+ The negation of a.
+
+ Example:
+ >>> a = Tensor([1, -2, 3])
+ >>> result = negate(a)
+ >>> print(result)
+ Tensor([-1, 2, -3])
+ """
+return Negate()(a)
+
+
+
+
Exp
+
Explanation for the derivative of the Exp operator:
+
Let’s denote a as the tensor on which the exponential function is applied. The operation can be described as f(a) = exp(a), where exp represents the exponential function.
+
The function for the backward pass (i.e., the gradient) is df/da = exp(a).
+
We are given a function \(f(a) = \exp(a)\), where \(a\) is a tensor. Our task is to find the derivative of this function with respect to \(a\).
+
By differentiating the function \(f(a)\) with respect to \(a\), we find:
Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(\exp(a)\).
+
+
class Exp(TensorOp):
+"""
+ Calculates the exponential of the given tensor.
+
+ Example:
+ >>> a = Tensor([1, 2, 3])
+ >>> op = Exp()
+ >>> result = op.compute(a)
+ >>> print(result)
+ Tensor([2.71828183, 7.3890561, 20.08553692])
+ """
+
+def compute(self, a: NDArray) -> NDArray:
+"""
+ Computes the exponential of a tensor.
+
+ Args:
+ - a: The tensor.
+
+ Returns:
+ The exponential of a.
+ """
+self.out = array_api.exp(a)
+returnself.out
+
+def gradient(self, out_grad: Tensor, node: Tensor) -> Tuple[Tensor,]:
+"""
+ Computes the gradient of the exponential operation.
+
+ Args:
+ - out_grad: The gradient of the output of the operation.
+ - node: The node in the computational graph where the operation was performed.
+
+ Returns:
+ The gradients with respect to the inputs.
+ """
+return (out_grad *self.out, )
+
+def exp(a: Tensor) -> Tensor:
+"""
+ Calculates the exponential of the given tensor.
+
+ Args:
+ - a: The tensor.
+
+ Returns:
+ The exponential of a.
+
+ Example:
+ >>> a = Tensor([1, 2, 3])
+ >>> result = exp(a)
+ >>> print(result)
+ Tensor([2.71828183, 7.3890561, 20.08553692])
+ """
+return Exp()(a)
+
+
+
+
ReLU
+
The derivative of the ReLU (Rectified Linear Unit) operator:
+
Let’s denote a as the tensor on which the ReLU function is applied. The ReLU function is defined as follows:
Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(1\) if \(a \geq 0\), and \(0\) if \(a < 0\).
+
+
class ReLU(TensorOp):
+"""
+ Applies the ReLU (Rectified Linear Unit) activation function to the given tensor.
+
+ Example:
+ >>> a = Tensor([1, -2, 3])
+ >>> op = ReLU()
+ >>> result = op.compute(a)
+ >>> print(result)
+ Tensor([1, 0, 3])
+ """
+
+def compute(self, a: NDArray) -> NDArray:
+"""
+ Computes the ReLU activation function on a tensor.
+
+ Args:
+ - a: The tensor.
+
+ Returns:
+ The result of applying ReLU to a.
+ """
+self.out = array_api.clip(a, a_min=0)
+returnself.out
+
+def gradient(self, out_grad: Tensor, node: Tensor) -> Tuple[Tensor,]:
+"""
+ Computes the gradient of the ReLU operation.
+
+ Args:
+ - out_grad: The gradient of the output of the operation.
+ - node: The node in the computational graph where the operation was performed.
+
+ Returns:
+ The gradients with respect to the inputs.
+ """
+return (out_grad * Tensor(node.children[0] >=0), )
+
+def relu(a: Tensor) -> Tensor:
+"""
+ Applies the ReLU (Rectified Linear Unit) activation function to the given tensor.
+
+ Args:
+ - a: The tensor.
+
+ Returns:
+ The result of applying ReLU to a.
+
+ Example:
+ >>> a = Tensor([1, -2, 3])
+ >>> result = relu(a)
+ >>> print(result)
+ Tensor([1, 0, 3])
+ """
+return ReLU()(a)
+
+
+
+
Power Scalar
+
The derivative of the PowerScalar operator:
+
Let’s denote the scalar as n and a as the tensor being raised to the power of the scalar. The operation can be described as f(a) = a^n.
+
The function for the backward pass (i.e., the gradient) is df/da = n * a^(n-1).
+
We are given a function \(f(a) = a^n\), where \(a\) is a tensor and \(n\) is a scalar. Our task is to find the derivative of this function with respect to \(a\).
+
By differentiating the function \(f(a)\) with respect to \(a\), we find:
Therefore, the gradient of \(f(a)\) with respect to \(a\) is \(n \cdot a^{n-1}\).
+
+
class PowerScalar(TensorOp):
+"""
+ The PowerScalar operation raises a tensor to an (integer) power.
+
+ Attributes:
+ scalar (int): The power to raise the tensor to.
+
+ Example:
+ >>> import numpy as np
+ >>> tensor = Tensor(np.array([1, 2, 3]))
+ >>> pow_scalar = PowerScalar(2)
+ >>> result = pow_scalar.compute(tensor.data)
+ >>> print(result)
+ array([1, 4, 9])
+
+ """
+
+def__init__(self, scalar: int):
+"""
+ Constructs the PowerScalar operation.
+
+ Args:
+ scalar (int): The power to raise the tensor to.
+ """
+self.scalar = scalar
+
+def compute(self, a: NDArray) -> NDArray:
+"""
+ Computes the power operation on the input tensor.
+
+ Args:
+ a (NDArray): The input tensor.
+
+ Returns:
+ NDArray: The resulting tensor after the power operation.
+ """
+return array_api.power(a, self.scalar)
+
+def gradient(self, out_grad: Tensor, node: Tensor) -> Tuple[Tensor, ]:
+"""
+ Computes the gradient of the power operation.
+
+ Args:
+ out_grad (Tensor): The gradient of the output tensor.
+ node (Tensor): The node in the computational graph where the operation was performed.
+
+ Returns:
+ Tuple[Tensor, ]: The gradient with respect to the input tensor.
+ """
+ a = node.children[0]
+return (self.scalar * power_scalar(a, self.scalar -1) * out_grad, )
+
+
+def power_scalar(a: Tensor, scalar: int) -> Tensor:
+"""
+ Raises a tensor to a power.
+
+ Args:
+ a (Tensor): The input tensor.
+ scalar (int): The power to raise the tensor to.
+
+ Returns:
+ Tensor: The resulting tensor after the power operation.
+
+ Example:
+ >>> import numpy as np
+ >>> tensor = Tensor(np.array([1, 2, 3]))
+ >>> result = power_scalar(tensor, 2)
+ >>> print(result)
+ Tensor([1, 4, 9])
+ """
+return PowerScalar(scalar)(a)
+
+
+
+
Element Wise Divide
+
The operation described here is an element-wise division of two tensors, a and b, where the operation can be described as f(a, b) = a / b.
+
We’ll compute the partial derivatives with respect to a and b:
+
+
The partial derivative of f(a, b) with respect to a (df/da) is 1/b.
+
The partial derivative of f(a, b) with respect to b (df/db) is -a / b^2.
+
+
We are given a function \(f(a, b) = \frac{a}{b}\), where \(a\) and \(b\) are tensors. Our task is to find the partial derivatives of this function with respect to \(a\) and \(b\).
+
Let’s start with \(\frac{\partial f}{\partial a}\):
Given a function of the form \(y = \frac{u}{v}\), where both \(u\) and \(v\) are functions of \(x\), the quotient rule of differentiation states:
+
\[\frac{dy}{dx} = \frac{v \cdot \frac{du}{dx} - u \cdot \frac{dv}{dx}}{v^2}\]
+
In our case, we’re looking at the function \(y = \frac{a}{b}\), where \(a\) and \(b\) are tensors. We want to find the derivative with respect to \(b\) (instead of \(x\) in our general formula). So we have:
+
\[\frac{dy}{db} = \frac{b \cdot \frac{da}{db} - a \cdot \frac{db}{db}}{b^2}\]
+
Since \(a\) does not depend on \(b\), \(\frac{da}{db} = 0\), and since any variable is equal to itself, \(\frac{db}{db} = 1\).
+
So the derivative \(\frac{dy}{db}\) simplifies to:
+
\[\frac{dy}{db} = \frac{b \cdot 0 - a \cdot 1}{b^2}\]
+
Therefore, the derivative of \(y\) with respect to \(b\) is \(-\frac{a}{b^2}\).
+
Therefore, the gradient of \(f(a, b)\) with respect to \(a\) is \(\frac{1}{b}\), and the gradient of \(f(a, b)\) with respect to \(b\) is \(- \frac{a}{b^{2}}\).
+
+
class EWiseDiv(TensorOp):
+"""
+ The EWiseDiv operation divides two tensors element-wise.
+
+ Example:
+ >>> import numpy as np
+ >>> a = Tensor(np.array([1, 2, 3]))
+ >>> b = Tensor(np.array([4, 5, 6]))
+ >>> div = EWiseDiv()
+ >>> result = div.compute(a.data, b.data)
+ >>> print(result)
+ array([0.25, 0.4, 0.5])
+
+ """
+
+def compute(self, a: NDArray, b: NDArray) -> NDArray:
+"""
+ Computes the element-wise division of two tensors.
+
+ Args:
+ a (NDArray): The dividend tensor.
+ b (NDArray): The divisor tensor.
+
+ Returns:
+ NDArray: The resulting tensor after element-wise division.
+ """
+return a / b
+
+def gradient(self, out_grad: Tensor, node: Tensor) -> Tuple[Tensor, Tensor]:
+"""
+ Computes the gradient of the element-wise division operation.
+
+ Args:
+ out_grad (Tensor): The gradient of the output tensor.
+ node (Tensor): The node in the computational graph where the operation was performed.
+
+ Returns:
+ Tuple[Tensor, Tensor]: The gradients with respect to the dividend and divisor tensors.
+ """
+ a, b = node.inputs
+return divide(out_grad, b), out_grad * negate(divide(a, power_scalar(b, 2)))
+
+
+def divide(a: Tensor, b: Tensor) -> Tensor:
+"""
+ Divides two tensors element-wise.
+
+ Args:
+ a (Tensor): The dividend tensor.
+ b (Tensor): The divisor tensor.
+
+ Returns:
+ Tensor: The resulting tensor after element-wise division.
+
+ Example:
+ >>> import numpy as np
+ >>> a = Tensor(np.array([1, 2, 3]))
+ >>> b = Tensor(np.array([4, 5, 6]))
+ >>> result = divide(a, b)
+ >>> print(result)
+ Tensor([0.25, 0.4, 0.5])
+ """
+return EWiseDiv()(a, b)
+
+
+
+
Divide Scalar
+
Let’s denote the scalar as c, and a as the tensor being divided by the scalar. The operation can be described as f(a) = a / c.
+
The function for the backward pass (i.e., the gradient) is df/da = 1/c.
+
This is the derivative of f(a) with respect to a.
+
We are given a function \(f(a) = \frac{a}{c}\), where \(a\) is a tensor and \(c\) is a scalar. Our task is to find the derivative of this function with respect to \(a\).
+
By using the power rule of differentiation, where the derivative of \(a^n\) is \(n \cdot a^{n-1}\), we can rewrite \(f(a)\) as \(f(a) = c^{-1}a\).
+
Now, we can differentiate this with respect to \(a\):