        # PyTorch Autograd: Hands-on Guide
        Short, runnable tour of how gradients flow in PyTorch. Run each cell and watch grads populate.
        

        ## Core ideas to repeat aloud
        - Set `requires_grad=True` on tensors you want gradients for; PyTorch records ops into a graph.
        - Call `.backward()` on a scalar loss to traverse the graph and populate `.grad` on leaves.
        - Gradients accumulate by default; clear them with `.zero_grad()`.
        - Skip tracking during evaluation or logging with `torch.no_grad()` or by `.detach()`-ing a tensor.
        

In [1]:
import torch
torch.manual_seed(0)
print('torch version:', torch.__version__)
print('device:', 'cuda' if torch.cuda.is_available() else 'cpu')


torch version: 2.9.1
device: cpu


        ## Scalar example: verify the derivative by hand
        Forward: `y = x**2 + 2x + 1`. At `x = 3`, dy/dx = `2x + 2 = 8`.
        

In [None]:
x = torch.tensor(3.0, requires_grad=True)
y = x**2 + 2 * x + 1
print('y value:', y.item())

y.backward()  # populates x.grad
print('dy/dx at x=3:', x.grad.item())

# 然后可以再给同学们看一看Linear modal里面的parameters，在一个step之后，梯度的变化。这样很直观


y value: 16.0
dy/dx at x=3: 8.0


        ## Vector example: linear layer + MSE loss
        Build a tiny linear model `y = Xw + b`, compute mean squared error, and inspect gradient shapes.
        

In [None]:
        w = torch.randn(2, 1, requires_grad=True)
        b = torch.zeros(1, requires_grad=True)

        X = torch.tensor([[1.0, 2.0],
                          [3.0, 4.0]])
        y_true = torch.tensor([[1.0],
                               [2.0]])

        y_pred = X @ w + b  # matrix multiply + bias broadcast
        loss = ((y_pred - y_true) ** 2).mean()
        print('loss:', loss.item())

        loss.backward()
        print('w.grad shape:', w.grad.shape)
        print('b.grad shape:', b.grad.shape)
        

        ## Gradient accumulation vs. zeroing
        Calling `.backward()` again **adds** to existing grads. Clear them with `.zero_grad()` (common before each optimizer step).
        

In [None]:
        loss2 = ((X @ w + b - y_true) ** 2).mean()
        loss2.backward()
        print('grad after second backward (accumulated):', w.grad.flatten())

        # Reset
        w.grad.zero_()
        b.grad.zero_()
        print('grads after zeroing:', w.grad.flatten(), b.grad)
        

        ## Turning off tracking: `torch.no_grad()` and `detach()`
        Useful for evaluation, logging, or breaking the graph when you do custom manipulations.
        

In [None]:
        with torch.no_grad():
            preds = X @ w + b
        print('preds without tracking require_grad?', preds.requires_grad)

        # Detach produces a tensor that shares storage but has no grad history
        detached = (X @ w + b).detach()
        print('detached requires_grad?', detached.requires_grad)
        

        ## Mini training loop demo
        Fit a 1D linear regression `y = 3x + 2` with noise to show loss decreasing and gradients flowing.
        

In [None]:
        torch.manual_seed(0)
        x_train = torch.linspace(-2, 2, steps=50).unsqueeze(1)
        y_train = 3 * x_train + 2 + 0.3 * torch.randn_like(x_train)

        model = torch.nn.Linear(1, 1)
        optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
        loss_fn = torch.nn.MSELoss()

        for step in range(1, 41):
            optimizer.zero_grad()
            preds = model(x_train)
            loss = loss_fn(preds, y_train)
            loss.backward()
            optimizer.step()

            if step % 10 == 0:
                print(f'step {step:02d} loss {loss.item():.4f} | weight {model.weight.item():.3f} bias {model.bias.item():.3f}')
        

        ## Custom autograd Function (lightweight intro)
        Most layers are built-in, but you can define both forward and backward for custom ops.
        

In [None]:
        class SquarePlusOne(torch.autograd.Function):
            @staticmethod
            def forward(ctx, input):
                ctx.save_for_backward(input)
                return input * input + 1

            @staticmethod
            def backward(ctx, grad_output):
                (input,) = ctx.saved_tensors
                grad_input = grad_output * 2 * input
                return grad_input

        x_demo = torch.tensor([2.0, -3.0], requires_grad=True)
        y_demo = SquarePlusOne.apply(x_demo).sum()
        y_demo.backward()
        print('x_demo grad (expected 2*x):', x_demo.grad)
        

        ## Takeaways to emphasize
        - Always start from a scalar loss when calling `.backward()`.
        - Clear grads each step; accumulating is rarely what you want.
        - Use `torch.no_grad()` for eval/logging to save memory and avoid accidental graph building.
        - Inspect `.grad` shapes to debug mismatch issues early.
        