# Tensors in PyTorch: What Changes Compared to the From-Scratch Implementation?

In the previous section, we implemented a Multilayer Perceptron (MLP) **from scratch** using basic Python and NumPy arrays. 
All computations were expressed in terms of scalars, vectors, and matrices, and we explicitly managed:

- the forward pass,
- gradient derivations and the backward pass,
- parameter updates.

PyTorch introduces a new core data type: the **tensor**. While tensors may look similar to NumPy arrays, they add capabilities that are central to modern deep learning systems: **automatic differentiation**, **hardware acceleration**, and a library of optimized deep learning operators.


## 1. What Is a Tensor?

In numerical computing, a **tensor** is a multi-dimensional array. The term emphasizes that we may work with data of arbitrary order (number of axes).

- Scalars are **0D tensors**
- Vectors are **1D tensors**
- Matrices are **2D tensors**
- Higher-dimensional arrays are **3D+ tensors**

Mathematically, one can view a tensor of order $k$ as an element of a tensor product space:
$$
\mathbf{T} \in V_1 \otimes V_2 \otimes \cdots \otimes V_k.
$$

In the MLP context, tensors represent the same objects you used earlier—only the container and execution model change.

| Mathematical object | From-scratch code | PyTorch |
|---|---|---|
| Scalar | `float` | 0D tensor |
| Vector | 1D NumPy array | 1D tensor |
| Matrix | 2D NumPy array | 2D tensor |
| Batch of matrices | 3D array | 3D tensor |


## 2. Why Not Just Use NumPy Arrays? (A More Convincing Answer)

NumPy arrays are excellent **numerical containers** and are sufficient for forward computation. However, deep learning workloads require additional *system-level guarantees* and *capabilities* that NumPy does not provide out of the box:

1. **Automatic differentiation (autograd)**
   - Deep networks require gradients such as $\nabla_\theta L(\theta)$ for millions of parameters $\theta$.
   - With NumPy, gradients must be derived and coded manually or via external tools.

2. **Hardware acceleration and device abstraction**
   - Training modern models efficiently depends on GPUs (and sometimes other accelerators).
   - NumPy operations run on CPU. GPU support requires switching libraries (e.g., CuPy) and re-auditing the pipeline.

3. **A differentiable operator ecosystem**
   - Deep learning uses specialized ops (convolutions, normalization, embedding lookups, fused kernels).
   - PyTorch provides these operators **together with** correct gradient rules and optimized kernels.

A useful summary is:
- **NumPy**: array computing (values only)
- **PyTorch tensor**: array computing **plus** gradient tracking **plus** device-aware execution **plus** deep-learning primitives


## 3. Side-by-Side: NumPy Arrays vs PyTorch Tensors (Values)

Consider a linear layer (affine map):
$$
\mathbf{Y} = \mathbf{X}\mathbf{W} + \mathbf{b},
$$
with $\mathbf{X} \in \mathbb{R}^{N \times d}$, $\mathbf{W} \in \mathbb{R}^{d \times m}$, and $\mathbf{b} \in \mathbb{R}^{m}$.

Both NumPy and PyTorch can compute $\mathbf{Y}$ as a forward pass.


In [1]:
# NumPy: forward computation (values only)
import numpy as np

np.random.seed(0)
X = np.random.randn(4, 3)      # N=4, d=3
W = np.random.randn(3, 2)      # d=3, m=2
b = np.random.randn(2,)        # m=2

Y_np = X @ W + b
Y_np


array([[ 3.2955051 , -0.70672864],
       [ 1.38728186,  0.24221777],
       [ 0.8147218 , -0.76782153],
       [ 2.86228391, -1.05442875]])

In [2]:
# PyTorch: forward computation (values only)
import torch

torch.manual_seed(0)
X_t = torch.randn(4, 3)
W_t = torch.randn(3, 2)
b_t = torch.randn(2)

Y_t = X_t @ W_t + b_t
Y_t


tensor([[-0.6639, -0.6620],
        [ 0.5748, -1.5384],
        [-1.7279, -1.2307],
        [-0.0104, -1.9583]])

At this point, the two libraries look similar. The crucial differences appear when we need **gradients**, **devices**, and **training loops**.


## 4. Tensors and Automatic Differentiation (Computation Graphs)

In gradient-based learning, we minimize a loss $L(\theta)$ over parameters $\theta$ (weights and biases). Training requires:
$$
\theta \leftarrow \theta - \eta \nabla_\theta L(\theta),
$$
where $\eta$ is the learning rate.

In the from-scratch section, you explicitly coded partial derivatives such as:
$$
\frac{\partial L}{\partial \mathbf{W}}, \quad \frac{\partial L}{\partial \mathbf{b}}.
$$

PyTorch tensors can **track computation graphs**. If a tensor is created with `requires_grad=True`, PyTorch records the sequence of differentiable operations. Calling `backward()` applies the chain rule automatically.

The chain rule in backpropagation has the generic form:
$$
\frac{\partial L}{\partial \mathbf{W}} = \frac{\partial L}{\partial \mathbf{Y}}\,\frac{\partial \mathbf{Y}}{\partial \mathbf{W}}.
$$

For $\mathbf{Y}=\mathbf{X}\mathbf{W}+\mathbf{b}$, this becomes:
$$
\frac{\partial L}{\partial \mathbf{W}} = \mathbf{X}^\top\frac{\partial L}{\partial \mathbf{Y}}, 
\qquad
\frac{\partial L}{\partial \mathbf{b}} = \sum_{i=1}^{N} \frac{\partial L}{\partial \mathbf{Y}_{i,:}}.
$$


### Side-by-Side: Manual Gradients (NumPy) vs Autograd (PyTorch)

We will use a simple scalar loss:
$$
L = \sum_{i,j} Y_{ij}.
$$

Then $\frac{\partial L}{\partial Y_{ij}} = 1$ for all entries, so $\frac{\partial L}{\partial \mathbf{Y}}$ is a matrix of ones.


In [3]:
# NumPy: manual gradients for L = sum(Y)
grad_Y = np.ones_like(Y_np)          # dL/dY
grad_W = X.T @ grad_Y                # dL/dW = X^T dL/dY
grad_b = grad_Y.sum(axis=0)          # dL/db = sum over batch

grad_W, grad_b


(array([[5.36563246, 5.36563246],
        [2.26040156, 2.26040156],
        [1.35251476, 1.35251476]]),
 array([4., 4.]))

In [4]:
# PyTorch: autograd for the same computation
X_t = torch.randn(4, 3, requires_grad=True)
W_t = torch.randn(3, 2, requires_grad=True)
b_t = torch.randn(2, requires_grad=True)

Y = X_t @ W_t + b_t
L = Y.sum()
L.backward()

W_t.grad, b_t.grad


(tensor([[-0.8637, -0.8637],
         [ 1.3759,  1.3759],
         [ 0.8702,  0.8702]]),
 tensor([4., 4.]))

**Key takeaway:** Autograd does not change the mathematics of backpropagation; it changes *who writes* the gradient code. 
You still conceptually start from the loss and propagate backward—PyTorch simply performs the bookkeeping consistently and efficiently.


## 5. Tensor Data Types (`dtype`) and Why They Matter

Every tensor has a `dtype` that controls numerical precision and valid operations. Common choices include:

- `torch.float32` (default for neural network weights and activations)
- `torch.float64` (higher precision; typically slower and rarely needed for standard training)
- `torch.int64` (commonly used for class labels)

This becomes important in classification. For example, `CrossEntropyLoss` expects labels as integer class indices:
$$
y \in \{0, 1, \dots, C-1\},
$$
not one-hot vectors.

In the MNIST workflow:
- Inputs `x` are floating-point tensors (e.g., `float32`)
- Labels `y` are integer tensors (typically `int64`)

NumPy will often silently cast types in mixed operations, which can hide bugs. PyTorch is stricter in many training-critical paths.


In [5]:
# dtype illustration
x = torch.randn(2, 3)          # float32 by default
y = torch.tensor([1, 0])       # int64 by default for integer literals

x.dtype, y.dtype


(torch.float32, torch.int64)

## 6. Tensor Shape and Batching

A major practical difference between educational “from-scratch” code and production deep learning code is **batching**.

For MNIST, a batch of images typically has shape:
$$
(\text{batch}, \text{channels}, \text{height}, \text{width}) = (B, 1, 28, 28).
$$

An MLP expects a matrix of shape $(B, 784)$, so we reshape (flatten) each image:
$$
\mathbf{X} \in \mathbb{R}^{B \times 784}.
$$

In PyTorch, flattening is often written as:
`x = x.view(x.size(0), -1)`.


In [6]:
# shape and flattening example
B = 128
x_batch = torch.randn(B, 1, 28, 28)
x_flat = x_batch.view(x_batch.size(0), -1)

x_batch.shape, x_flat.shape


(torch.Size([128, 1, 28, 28]), torch.Size([128, 784]))

## 7. Device Awareness (CPU vs GPU)

PyTorch tensors are **device-aware**: each tensor lives on a specific device (CPU or GPU). 
The same code can run on a GPU by moving tensors and models to that device:

```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = x.to(device)
model = model.to(device)
```

NumPy arrays do not have this concept. To use a GPU in a NumPy-like workflow, you must typically switch libraries (and sometimes APIs), which increases complexity and maintenance cost.


In [7]:
# device illustration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = torch.randn(3, 4)
x_device = x.to(device)

x.device, x_device.device


(device(type='cpu'), device(type='cpu'))

## 8. Summary: Connecting Both Worlds

- The **mathematics** of the MLP is identical in both approaches.
- The **from-scratch** implementation emphasizes understanding:
  - explicit forward/backward derivations,
  - explicit parameter updates.
- PyTorch tensors emphasize scalability and correctness:
  - automatic differentiation,
  - standardized batching,
  - device-aware execution,
  - and a large library of optimized differentiable operators.

Learning tensors effectively does not replace understanding backpropagation—it *operationalizes* it for real training workloads.
