# NumPy and Pytorch for Deep Learning
**Goal:** Cover the *minimum* NumPy required to read and run deep learning code implemented purely with NumPy (e.g., linear layers, softmax, loss computation, batch reductions, broadcasting, etc.) and corresponding pytorch knowledge.

> This notebook intentionally avoids advanced/rare NumPy features.  
> Focus: **shape**, **indexing**, **broadcasting**, **matmul**, **axis/keepdims**, **numerical stability**, **random init**, **tiny forward+loss demo**. 

## 0. Setup

In [None]:
import numpy as np

# Recommended: deterministic RNG for reproducibility
rng = np.random.default_rng(seed=42)

np.set_printoptions(precision=4, suppress=True)
print("NumPy version:", np.__version__)


## 1) Arrays, dtype, shape (the #1 survival skill)

Deep learning NumPy code is essentially **array programs**.  
Tracking `shape` consistently makes it possible to read such repos.

Common conventions:
- `X.shape == (B, D)` : batch size `B`, feature dim `D`
- `logits.shape == (B, C)` : `C` classes
- sequences often: `(B, S, D)` : sequence length `S`


In [None]:
x = np.array([[1.0, 2.0, 3.0],
              [4.0, 5.0, 6.0]], dtype=np.float32)

print("x:\n", x)
print("dtype:", x.dtype)
print("shape:", x.shape)   # (rows, cols)
print("ndim:", x.ndim)


### Mini rule
> **The gradient of a parameter has the same shape as the parameter.**  
Even when backprop is not implemented in this notebook, this rule is useful for debugging later.


## 2) Indexing & slicing: batch selection, feature selection, and the `np.arange` trick

Most code needs:
- pick a subset of samples (mini-batch)
- pick some feature columns
- pick per-sample correct-class log-probability with `np.arange(B)`


In [None]:
X = np.arange(24).reshape(4, 6)  # 4 samples, 6 features
print("X:\n", X)
print("X shape:", X.shape)

# (1) Select one sample (note: dimension drops!)
i = 2
xi = X[i]              # shape (6,)
print("\nX[i] shape:", xi.shape, "->", xi)

# (2) Keep 2D shape by slicing
xi2d = X[i:i+1]         # shape (1, 6)
print("X[i:i+1] shape:", xi2d.shape)

# (3) Select first n samples
X_small = X[:2]         # shape (2, 6)
print("\nX[:2] shape:", X_small.shape)

# (4) Select specific feature columns
X_feat = X[:, [0, 2, 4]]   # shape (4, 3)
print("X[:, [0,2,4]] shape:", X_feat.shape)
print(X_feat)

# (5) Last column (often labels are last column in datasets)
last_col = X[:, -1]       # shape (4,)
print("\nX[:, -1] shape:", last_col.shape, "->", last_col)


### `np.arange(B)` indexing (super common in cross-entropy)
When `log_probs` has shape `(B, C)` and labels `y` has shape `(B,)`, the following selects correct-class values per sample:
```python
log_probs[np.arange(B), y]
```


In [None]:
B, C = 4, 3
log_probs = rng.standard_normal((B, C))
y = np.array([0, 2, 1, 2])  # labels

picked = log_probs[np.arange(B), y]   # only the label probs are chosen
print("log_probs:\n", log_probs)
print("y:", y)
print("picked shape:", picked.shape)
print("picked (correct class per sample):", picked)


## 3) reshape & transpose: align shapes for math

Two main tools:
- `reshape(...)` : change how data is viewed (no element count change)
- `transpose(...)` / `.T` : swap axes (very common in attention/sequence code)


In [None]:
X = rng.standard_normal((4, 6))

# reshape: flatten then restore
flat = X.reshape(-1)
X_back = flat.reshape(4, 6)

print("X shape:", X.shape)
print("flat shape:", flat.shape)
print("X_back shape:", X_back.shape)

# transpose: matrix transpose
Xt = X.T
print("X.T shape:", Xt.shape)

# general transpose: (B, S, D) -> (S, B, D)
A = rng.standard_normal((2, 3, 4))   # B=2, S=3, D=4
A_perm = A.transpose(1, 0, 2)
print("A shape:", A.shape, "-> A_perm shape:", A_perm.shape)


## 4) Broadcasting: why `X @ W + b` works

Broadcasting lets arrays of different shapes interact without manual tiling.

Most common patterns:
- add bias: `(B, D) + (D,) -> (B, D)`
- per-sample scaling: `(B, D) * (B, 1) -> (B, D)`
- masks: `(B, D) * (B, D) -> (B, D)`


In [None]:
B, D = 4, 6
X = rng.standard_normal((B, D))

# Case A: bias add
b = rng.standard_normal((D,))
Y = X + b
print("Case A: X shape", X.shape, "+ b shape", b.shape, "=>", Y.shape)

# Case B: per-sample scaling (IMPORTANT: use (B,1) not (B,))
scale = rng.random((B, 1))
Y2 = X * scale
print("Case B: X shape", X.shape, "* scale shape", scale.shape, "=>", Y2.shape)

# Case C: mask (e.g., ReLU)
mask = (X > 0).astype(np.float32)
relu_out = X * mask
print("Case C: relu_out shape:", relu_out.shape)


### Common beginner bug: `(B,)` vs `(B,1)`
Using `scale = rng.random(B)` (shape `(B,)`) causes broadcasting along the **last** dimension and may produce unintended behavior.


In [None]:
B, D = 4, 6
X = rng.standard_normal((B, D))

scale_bad = rng.random(B)     # (B,)
scale_good = rng.random((B,1))# (B,1)

# Semantics differ!
# Y_bad = X * scale_bad         # broadcasts (B,) to (B, D) by matching last axis -> often WRONG intention
Y_good = X * scale_good       # clearly per-sample scaling

print("scale_bad shape:", scale_bad.shape)
print("scale_good shape:", scale_good.shape)
# print("Y_bad shape:", Y_bad.shape, "| Y_good shape:", Y_good.shape)

# See the difference for first row
print("\nFirst row scaling factors:")
print("bad uses scale_bad aligned to last axis (features):", scale_bad[:6] if B>=6 else scale_bad)
print("good uses one factor per sample:", scale_good[:,0])


## 5) Matrix multiplication: `@` as the linear layer

A fully-connected layer is:
```python
Y = X @ W + b
```
where:
- `X` : `(B, Din)`
- `W` : `(Din, Dout)`
- `b` : `(Dout,)`
- `Y` : `(B, Dout)`


In [None]:
B, Din, Dout = 4, 6, 3
X = rng.standard_normal((B, Din))
W = rng.standard_normal((Din, Dout))
b = rng.standard_normal((Dout,))

Y = X @ W + b
print("X shape:", X.shape)
print("W shape:", W.shape)
print("b shape:", b.shape)
print("Y shape:", Y.shape)


### Debug helper for matmul
For `A @ B`, the shape rule is:
- `A.shape[-1] == B.shape[-2]`


In [None]:
def check_matmul(A, B):
    """
    Compute check matmul.
    
    Args:
        A: Input parameter.
        B: Bias values or vector.
    Returns:
        None.
    """
    print("A shape:", A.shape)
    print("B shape:", B.shape)
    assert A.shape[-1] == B.shape[-2], "matmul shape mismatch!"
    print("OK: A @ B is valid. Output shape:", (A.shape[0], B.shape[1]))

check_matmul(X, W)


## 6) `axis` and `keepdims`: batch reductions, normalization, attention utilities

This is *critical* for reading loss functions, softmax, layernorm, etc.

Key ideas:
- `axis=0` reduces over batch dimension (across samples)
- `axis=1` reduces over feature dimension (per sample)
- `keepdims=True` keeps dimensions so that broadcasting back is safe


In [None]:
X = rng.standard_normal((4, 6))  # (B, D)

mu_feat = X.mean(axis=0)         # (D,)
mu_sample = X.mean(axis=1)       # (B,)

print("X shape:", X.shape)
print("mu_feat (axis=0) shape:", mu_feat.shape)
print("mu_sample (axis=1) shape:", mu_sample.shape)

# keepdims for safe broadcasting
mu_feat_k = X.mean(axis=0, keepdims=True)  # (1, D)
X_centered = X - mu_feat_k                  # broadcasts (1,D)->(B,D)

print("\nmu_feat_k shape:", mu_feat_k.shape)
print("X_centered shape:", X_centered.shape)


## 7) Numerical stability: stable softmax + logsumexp (must-have)

Naive `np.exp(logits)` can overflow.

**Stable trick:** subtract max before exp:
```python
x_shift = x - x.max(axis=axis, keepdims=True)
```


In [None]:
def softmax(x, axis=-1):
    """
    Compute softmax probabilities.
    
    Args:
        x: Input data.
        axis: Input parameter.
    Returns:
        Softmax probabilities.
    """
    x_max = np.max(x, axis=axis, keepdims=True)
    x_shift = x - x_max
    exp_x = np.exp(x_shift)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

def logsumexp(x, axis=-1):
    """
    Compute logsumexp.
    
    Args:
        x: Input data.
        axis: Input parameter.
    Returns:
        Computed result.
    """
    x_max = np.max(x, axis=axis, keepdims=True)
    return np.log(np.sum(np.exp(x - x_max), axis=axis, keepdims=True)) + x_max

# Demo stability
logits = rng.standard_normal((4, 10)) * 5
probs = softmax(logits, axis=1)

print("probs shape:", probs.shape)
print("row sums (should be ~1):", probs.sum(axis=1))

# Extreme values test
x = np.array([[1000.0, 1001.0, 999.0]])
print("\nStable logsumexp:", logsumexp(x, axis=1))


## 8) Random initialization: reproducible weights

Most pure NumPy DL repos initialize weights directly.
The main building blocks are:
- `rng.standard_normal(shape)` for Gaussian
- `rng.uniform(low, high, size=shape)` for uniform
- zeros for bias


In [None]:
Din, Dout = 128, 64

# Small Gaussian init
W = rng.standard_normal((Din, Dout)) * 0.01
b = np.zeros((Dout,), dtype=np.float32)

# Xavier/Glorot uniform (simple version)
limit = np.sqrt(6 / (Din + Dout))
W_xavier = rng.uniform(-limit, limit, size=(Din, Dout))

print("W std:", W.std())
print("W_xavier range approx:", (W_xavier.min(), W_xavier.max()))
print("b shape:", b.shape, "dtype:", b.dtype)


## 9) Mini forward + cross-entropy loss (no backprop needed)

Even when a repo provides backprop, the following should be readable:
- logits computation: `X @ W + b`
- stable softmax
- cross-entropy indexing trick

**Shapes:**
- `X`: `(B, Din)`
- `W`: `(Din, C)`
- `logits`: `(B, C)`
- `y`: `(B,)`


In [None]:
B, Din, C = 4, 6, 3
X = rng.standard_normal((B, Din))
y = rng.integers(low=0, high=C, size=(B,))

W = rng.standard_normal((Din, C)) * 0.1
b = np.zeros((C,))

logits = X @ W + b
probs = softmax(logits, axis=1)

# cross-entropy loss (stable-ish): -mean(log p(correct))
eps = 1e-12
log_probs = np.log(probs + eps)

loss = -np.mean(log_probs[np.arange(B), y])

print("X shape:", X.shape)
print("logits shape:", logits.shape)
print("probs shape:", probs.shape)
print("y:", y)
print("loss:", loss)


## 10) Debug checklist

When encountering errors or wrong outputs:
1. Print **all shapes**: `print(X.shape, W.shape, b.shape)`
2. For `A @ B`: check `A.shape[-1] == B.shape[-2]`
3. Broadcasting confusion:
   - per-sample scaling should use `(B,1)` not `(B,)`
   - use `keepdims=True` when intending to broadcast back
4. Numerical issues:
   - softmax: subtract max
   - log: add eps


## 11) High-dimensional arrays: 3D and batch operations

Beyond 2D arrays of shape `(B, D)`, deep learning implementations frequently use **3D tensors**, e.g. sequences of vectors with shape `(B, S, D)` (batch size, sequence length, feature dimension) or batches of matrices. This section introduces the corresponding shape conventions, batch matrix multiplication, multi-axis reduction, and broadcasting in 3D. The aim is to keep new syntax minimal so that code using such tensors can be read and written with confidence.

### 11.1) 3D arrays: shape and indexing

A 3D array has shape `(N0, N1, N2)`. In sequence-based code the convention is often `(B, S, D)`: batch size, sequence length, feature dimension. The same indexing and slicing rules as in 2D apply: a single integer index removes that dimension; a slice preserves it. The examples below illustrate selecting one sample, the last time step for all samples, and a single (sample, time) vector.

In [None]:
B, S, D = 2, 3, 4
seq = rng.standard_normal((B, S, D))
print("seq shape:", seq.shape, "  (B, S, D)")
print("ndim:", seq.ndim)

# Select first sample (all time steps) -> shape (S, D)
first_sample = seq[0]
print("\nseq[0] shape:", first_sample.shape)

# Select last time step for all samples -> shape (B, D)
last_step = seq[:, -1, :]
print("seq[:, -1, :] shape:", last_step.shape)

# Select one (sample, time) vector -> shape (D,)
single = seq[0, 1, :]
print("seq[0, 1, :] shape:", single.shape)

### 11.2) Batch matrix multiplication

In NumPy, the `@` operator treats the **last two dimensions** as the matrix dimensions; any leading dimensions are interpreted as a batch. Concretely:
- `(B, M, K) @ (K, N)` → `(B, M, N)`: each of the `B` matrices of shape `(M, K)` is multiplied by the same matrix of shape `(K, N)`.
- `(B, M, K) @ (B, K, N)` → `(B, M, N)`: for each batch index `b`, the matrix `A[b]` of shape `(M, K)` is multiplied by the matrix `B[b]` of shape `(K, N)`.

**Shape rule:** For `A @ B`, the last two dimensions must satisfy `A.shape[-1] == B.shape[-2]`. Leading dimensions, if present, must either match on both sides or be absent on one side (a single matrix).

In [None]:
B, M, K, N = 2, 3, 4, 5
A_batch = rng.standard_normal((B, M, K))
W = rng.standard_normal((K, N))

# (B, M, K) @ (K, N) -> (B, M, N): shared matrix W applied to each batch element
Y1 = A_batch @ W
print("A_batch shape:", A_batch.shape)
print("W shape:", W.shape)
print("A_batch @ W shape:", Y1.shape)

# (B, M, K) @ (B, K, N) -> (B, M, N): different right-hand matrix per batch element
B_batch = rng.standard_normal((B, K, N))
Y2 = A_batch @ B_batch
print("\nB_batch shape:", B_batch.shape)
print("A_batch @ B_batch shape:", Y2.shape)

### 11.3) Reduction over multiple axes

Reduction functions such as `sum`, `mean`, and `max` accept a **tuple of axes** `axis=(ax0, ax1, ...)`, so that reduction is performed over several dimensions at once. Using `keepdims=True` yields a result whose shape is suitable for broadcasting when subtracting means or normalizing over multiple dimensions (e.g. over both the sequence and feature axes).

In [None]:
X = rng.standard_normal((2, 3, 4))  # (B, S, D)

# Reduce over axes 1 and 2 -> shape (2,) (one scalar per batch element)
global_per_batch = X.sum(axis=(1, 2))
print("X shape:", X.shape)
print("X.sum(axis=(1,2)) shape:", global_per_batch.shape)

# Same reduction with keepdims -> shape (2, 1, 1), suitable for broadcasting
global_k = X.sum(axis=(1, 2), keepdims=True)
X_centered = X - global_k
print("\nglobal_k shape:", global_k.shape)
print("X_centered mean over (1,2):", X_centered.sum(axis=(1, 2)))

### 11.4) Broadcasting in 3D

The same broadcasting rules as in 2D apply: dimensions are aligned from the **right**, and a dimension of size 1 is broadcast to match the other array. Thus an array of shape `(B, S, D)` is compatible with `(D,)`, `(1, D)`, `(1, 1, D)`, `(B, 1, D)`, and similar shapes where dimensions either match or are 1. Common use cases: adding a shared bias of shape `(D,)` to every position, or applying a per-step or per-feature scale using shapes such as `(1, S, 1)` or `(1, 1, D)`.

In [None]:
B, S, D = 2, 3, 4
seq = rng.standard_normal((B, S, D))

# Bias of shape (D,) is broadcast to every (batch, time, feature) position
bias = rng.standard_normal((D,))
out1 = seq + bias
print("seq shape:", seq.shape, "+ bias shape:", bias.shape, "->", out1.shape)

# Scale of shape (1, S, 1): one scale per time step, broadcast over batch and feature
scale = rng.standard_normal((1, S, 1))
out2 = seq * scale
print("seq shape:", seq.shape, "* scale shape:", scale.shape, "->", out2.shape)

### 11.5) Example: shared linear layer over a sequence

A standard pattern in sequence models is to apply a single linear layer to every time step: input shape `(B, S, Din)`, weight matrix `(Din, Dout)`, output shape `(B, S, Dout)`. Two equivalent approaches: (1) reshape the input to `(B*S, Din)`, compute `X @ W + b`, then reshape the result to `(B, S, Dout)`; (2) use batch matmul directly: `(B, S, Din) @ (Din, Dout)` yields `(B, S, Dout)` without explicit reshaping. The code below demonstrates both and verifies that the outputs coincide.

In [None]:
B, S, Din, Dout = 2, 3, 4, 5
X_seq = rng.standard_normal((B, S, Din))
W = rng.standard_normal((Din, Dout)) * 0.1
b = np.zeros((Dout,))

# Direct batch matmul: (B, S, Din) @ (Din, Dout) -> (B, S, Dout); bias (Dout,) broadcasts
Y_seq = X_seq @ W + b
print("X_seq shape:", X_seq.shape)
print("W shape:", W.shape)
print("Y_seq shape:", Y_seq.shape)

# Equivalent via reshape: flatten batch and sequence, matmul, reshape to (B, S, Dout)
X_flat = X_seq.reshape(-1, Din)
Y_flat = X_flat @ W + b
Y_seq_alt = Y_flat.reshape(B, S, Dout)
print("\nReshape path Y_seq_alt shape:", Y_seq_alt.shape)
print("Results match:", np.allclose(Y_seq, Y_seq_alt))

---
## Part II: PyTorch basics (correspondence with NumPy above)

The following sections introduce the minimal PyTorch needed to read and run typical deep learning code. The same **shape conventions** (e.g. `(B, D)`, `(B, S, D)`) and **concepts** (linear layer = matmul + bias, cross-entropy, batch operations) apply; PyTorch mainly provides a higher-level API (built-in layers, automatic differentiation) and tensor type that can run on GPU. No advanced usage (e.g. custom autograd, distributed training) is covered—focus is on basics only.

## 12) PyTorch setup and tensors

PyTorch uses **tensors** instead of NumPy arrays. Tensors have the same notions of `shape`, `dtype`, and indexing/slicing; the main extra idea is that tensors can live on **CPU or GPU** (device) and can record operations for **automatic differentiation** (gradients). For reading code, it is enough to know: create tensors with `torch.tensor(...)` or `torch.randn(...)`, and check `x.shape`, `x.dtype`, `x.device`.

In [None]:
import torch

# Optional: set seed for reproducibility
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)

# Create a tensor (default: float32, CPU)
x = torch.tensor([[1.0, 2.0, 3.0],
                  [4.0, 5.0, 6.0]])
print("x:\n", x)
print("shape:", x.shape)
print("dtype:", x.dtype)
print("device:", x.device)

# Random tensor, same shape convention as NumPy: (B, D)
B, D = 4, 6
X = torch.randn(B, D)
print("\nX shape:", X.shape, "  (B, D)")

### 12.1) NumPy ↔ PyTorch (optional)

When data comes from NumPy (e.g. datasets), convert with `torch.from_numpy(arr)`; to get a NumPy array from a tensor, use `.numpy()` (only for CPU tensors). Shapes and indexing are the same, so the NumPy intuition from Part I carries over directly.

In [None]:
# NumPy -> PyTorch (shares memory if possible)
a_np = np.arange(6).reshape(2, 3)
a_pt = torch.from_numpy(a_np)
print("NumPy shape:", a_np.shape, "-> PyTorch shape:", a_pt.shape)

# PyTorch -> NumPy (CPU only)
b_pt = torch.randn(2, 3)
b_np = b_pt.numpy()
print("PyTorch shape:", b_pt.shape, "-> NumPy shape:", b_np.shape)

## 13) Shape, indexing, and matmul in PyTorch

The same rules as in NumPy apply: indexing and slicing are identical (`X[i]`, `X[:, -1]`, `X[i:i+1]`), and matrix multiplication is `@` or `torch.matmul`. So the mental model from sections 2–5 (batch selection, feature selection, `X @ W + b`) is unchanged; only the type is `torch.Tensor` instead of `np.ndarray`.

In [None]:
X = torch.randn(4, 6)
print("X shape:", X.shape)
print("X[0] shape:", X[0].shape)
print("X[:, [0,2,4]] shape:", X[:, [0, 2, 4]].shape)

# Linear layer "by hand": same as NumPy
B, Din, Dout = 4, 6, 3
X = torch.randn(B, Din)
W = torch.randn(Din, Dout)
b = torch.randn(Dout)
Y = X @ W + b
print("\nX @ W + b shape:", Y.shape)

## 14) Linear layer and loss in PyTorch (high-level API)

Instead of manually defining `W` and `b`, PyTorch provides **modules** that encapsulate parameters and forward logic. `nn.Linear(Din, Dout)` stores weight `(Dout, Din)` and bias `(Dout,)` and computes `Y = X @ W.T + b` (note: the stored weight is `(Dout, Din)` so that the matmul is written as `X @ W.T` in raw form; the module hides this). For classification, **cross-entropy loss** is implemented as `F.cross_entropy(logits, labels)`: it expects `logits` of shape `(B, C)` and integer labels of shape `(B,)`, and does not require softmax to be applied first (it fuses log-softmax and negative log-likelihood for numerical stability).

In [None]:
from torch import nn
from torch.nn import functional as F

B, Din, C = 4, 6, 3
X = torch.randn(B, Din)
y = torch.randint(0, C, (B,))

# Built-in linear layer: replaces manual W, b
linear = nn.Linear(Din, C)
logits = linear(X)
print("logits shape:", logits.shape)

# Cross-entropy: input logits (B, C), labels (B,); no need to apply softmax first
loss = F.cross_entropy(logits, y)
print("loss:", loss.item())

## 15) Forward, loss, backward, and one optimizer step

Training typically repeats: forward pass → compute loss → call `loss.backward()` to fill gradients → optimizer step to update parameters. The following shows one such step. Parameters must be in a module (e.g. `nn.Linear`); the optimizer is given the list of parameters to update. After `backward()`, gradients are accumulated in `param.grad`; `optimizer.step()` applies the update (e.g. SGD); `optimizer.zero_grad()` clears old gradients so the next `backward()` does not add to them. This pattern is the basis of most training loops.

In [None]:
model = nn.Linear(Din, C)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# One training step
optimizer.zero_grad()
logits = model(X)
loss = F.cross_entropy(logits, y)
loss.backward()
optimizer.step()

with torch.no_grad():
    print("Loss after one step:", F.cross_entropy(model(X), y).item())

## 16) 3D / sequence in PyTorch (correspondence with §11)

The same shape conventions apply: a sequence batch has shape `(B, S, D)`. Applying a shared linear layer to every time step is done by passing a tensor of shape `(B, S, Din)` into `nn.Linear(Din, Dout)`; PyTorch’s linear layer accepts any number of leading dimensions and applies the same `(Din, Dout)` transformation to the last dimension. So the output is `(B, S, Dout)` without any explicit reshape—this corresponds to the batch matmul `(B, S, Din) @ (Din, Dout)` from section 11.5.

In [None]:
B, S, Din, Dout = 2, 3, 4, 5
X_seq = torch.randn(B, S, Din)
linear = nn.Linear(Din, Dout)
Y_seq = linear(X_seq)
print("X_seq shape:", X_seq.shape)
print("Y_seq shape:", Y_seq.shape)
# Same as NumPy: (B, S, Din) -> (B, S, Dout) in one call

### Quick reference: NumPy ↔ PyTorch

| Concept | NumPy (Part I) | PyTorch (Part II) |
|--------|-----------------|-------------------|
| Array / tensor | `np.ndarray`, `x.shape` | `torch.Tensor`, `x.shape` |
| Linear layer | `Y = X @ W + b` (manual W, b) | `nn.Linear(Din, Dout)`, `Y = linear(X)` |
| Cross-entropy | Manual softmax + `log_probs[np.arange(B), y]` | `F.cross_entropy(logits, y)` |
| 3D / sequence | `(B, S, Din) @ (Din, Dout)` or reshape | `nn.Linear(Din, Dout)(X_seq)` → `(B, S, Dout)` |
| Gradients / training | Hand-written backprop | `loss.backward()`, `optimizer.step()` |

The same **shape conventions** (`(B, D)`, `(B, S, D)`, etc.) apply in both; PyTorch adds layers, automatic differentiation, and (optionally) GPU execution.