Problem: Implement Custom Loss Function (Huber Loss)
Problem Statement
You are tasked with implementing the Huber Loss as a custom loss function in PyTorch. The Huber loss is a robust loss function used in regression tasks, less sensitive to outliers than Mean Squared Error (MSE). It transitions between L2 loss (squared error) and L1 loss (absolute error) based on a threshold parameter 
.

The Huber loss is mathematically defined as:
 
 
 

where:

 y is the true value,
 y1 is the predicted value,
 delta is a threshold parameter that controls the transition between L1 and L2 loss.
Requirements
Custom Loss Function:

Implement a class HuberLoss inheriting from torch.nn.Module.
Define the forward method to compute the Huber loss as per the formula.
Usage in a Regression Model:

Integrate the custom loss function into a regression training pipeline.
Use it to compute and optimize the loss during model training.
Constraints
The implementation must handle both scalar and batch inputs for 
 (true values) and 
 (predicted values).

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
torch.manual_seed(42)
X = torch.rand(100, 1) * 10
y = 2 * X + 3 + torch.randn(100, 1)
# Define the Huber Loss
class HuberLoss(nn.Module):
    # Purpose: Define a custom loss function by subclassing nn.Module.
    # Theory: nn.Module allows defining custom operations with forward passes, integrating with PyTorch’s autograd for gradient computation.
    
    def __init__(self, delta=1.0):
        # Purpose: Initialize the Huber Loss with a threshold parameter delta.
        # Theory: delta controls the transition between L2 (squared) and L1 (absolute) loss. A common default is 1.0, balancing robustness and smoothness.
        
        super(HuberLoss, self).__init__()
        # Purpose: Call the parent nn.Module constructor to set up the module.
        # Theory: super() ensures proper initialization, enabling features like parameter registration (though no learnable parameters are used here).
        
        self.delta = delta
        # Purpose: Store the delta parameter as an instance variable.
        # Theory: delta is a hyperparameter, not a learnable parameter, so it’s stored as a Python attribute, not a torch.Parameter.
    
    def forward(self, y_pred, y_true):
        # Purpose: Compute the Huber Loss for predicted and true values.
        # Theory: The forward method defines the loss computation, called when the loss is invoked (e.g., criterion(predictions, y)). It must handle batch inputs.
        
        residual = torch.abs(y_pred - y_true)
        # Purpose: Compute the absolute error |y_pred - y_true| element-wise.
        # Theory: The residual determines whether each error falls in the L2 (<= delta) or L1 (> delta) regime. torch.abs ensures non-negative values.
        
        is_small_error = residual <= self.delta
        # Purpose: Create a boolean tensor identifying errors where |y_pred - y_true| <= delta.
        # Theory: Logical indexing allows separating L2 and L1 loss cases. The tensor has the same shape as residual, with True for small errors.
        
        squared_loss = 0.5 * residual ** 2
        # Purpose: Compute the L2 loss term (0.5 * (y_pred - y_true)^2) for all elements.
        # Theory: For small errors, this term is used directly. The factor 0.5 ensures the gradient is -(y_pred - y_true), matching MSE behavior.
        
        linear_loss = self.delta * residual - 0.5 * self.delta ** 2
        # Purpose: Compute the L1 loss term (delta * |y_pred - y_true| - 0.5 * delta^2) for all elements.
        # Theory: For large errors, this term is used. The subtraction ensures continuity at |y_pred - y_true| = delta, making the loss differentiable.
        
        loss = torch.where(is_small_error, squared_loss, linear_loss)
        # Purpose: Select squared_loss for small errors and linear_loss for large errors.
        # Theory: torch.where(condition, x, y) applies x where condition is True, else y. This implements the piecewise Huber Loss definition.
        
        return torch.mean(loss)
        # Purpose: Average the loss over the batch to produce a scalar loss.
        # Theory: Reducing the loss to a scalar (via mean) is standard for optimization, ensuring gradients are computed for the entire batch.

# Define the Linear Regression Model
class LinearRegressionModel(nn.Module):
    # Purpose: Define a simple linear regression model by subclassing nn.Module.
    # Theory: nn.Module provides infrastructure for defining neural network architectures, including parameter management and forward passes.
    
    def __init__(self):
        super(LinearRegressionModel, self).__init__()
        # Purpose: Initialize the parent nn.Module class to set up the model.
        # Theory: super() ensures proper initialization, registering parameters and enabling methods like parameters().
        
        self.linear = nn.Linear(1, 1)
        # Purpose: Create a linear layer mapping 1 input feature to 1 output (y = wx + b).
        # Theory: nn.Linear applies a linear transformation y = wx + b, with learnable parameters w [1, 1] and b [1], initialized using Xavier initialization.
    
    def forward(self, x):
        # Purpose: Define the forward pass, specifying how input x is transformed to output.
        # Theory: The forward method is called when the model is invoked (e.g., model(X)), building the computational graph for autograd.
        
        return self.linear(x)
        # Purpose: Apply the linear transformation to input x.
        # Theory: Computes y = x @ w^T + b, where x is [batch_size, 1], producing output of shape [batch_size, 1].

# Initialize the model, loss function, and optimizer
model = LinearRegressionModel()
# Purpose: Create an instance of the linear regression model.
# Theory: Instantiates the model, initializing w and b randomly. These parameters are registered with autograd for gradient tracking.

criterion = HuberLoss(delta=1.0)
# Purpose: Initialize the Huber Loss with delta = 1.0.
# Theory: delta = 1.0 is a common choice, balancing L2 loss for small errors and L1 loss for outliers. The loss module is used like a function during training.

optimizer = optim.SGD(model.parameters(), lr=0.01)
# Purpose: Initialize Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.01.
# Theory: SGD updates parameters using θ = θ - η * ∇L, where η is the learning rate. model.parameters() provides w and b from the linear layer.

# Training loop
epochs = 1000
# Purpose: Set the number of training iterations to 1000 epochs.
# Theory: Each epoch processes the entire dataset, updating parameters to minimize loss. 1000 epochs allows convergence for this simple model.

for epoch in range(epochs):
    # Purpose: Iterate over the dataset for the specified number of epochs.
    # Theory: Training involves repeated forward and backward passes to optimize parameters.
    
    # Forward pass
    predictions = model(X)
    # Purpose: Compute model predictions by passing input X through the model.
    # Theory: Calls the forward method, computing y_pred = Xw + b. X [100, 1] produces predictions [100, 1].
    
    loss = criterion(predictions, y)
    # Purpose: Calculate the Huber Loss between predictions and true targets.
    # Theory: Computes the piecewise loss for each sample, averaging over the batch. Both tensors are [100, 1], ensuring compatibility.
    
    # Backward pass and optimization
    optimizer.zero_grad()
    # Purpose: Reset gradients of all model parameters to zero.
    # Theory: Gradients accumulate by default in PyTorch. Zeroing prevents mixing gradients from previous iterations.
    
    loss.backward()
    # Purpose: Compute gradients of the loss with respect to model parameters (w, b).
    # Theory: Autograd backpropagates through the loss → linear layer, computing ∂L/∂w and ∂L/∂b using the chain rule.
    
    optimizer.step()
    # Purpose: Update model parameters using the computed gradients.
    # Theory: SGD applies θ = θ - η * ∇L for each parameter, minimizing the loss.
    
    # Log progress every 100 epochs
    if (epoch + 1) % 100 == 0:
        # Purpose: Print training progress every 100 epochs.
        # Theory: Monitoring loss helps assess convergence and detect issues like underfitting or learning rate problems.
        
        print(f"Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}")
        # Purpose: Display the current epoch and loss value.
        # Theory: loss.item() extracts the scalar loss value. Formatting to 4 decimal places improves readability.

# Display learned parameters
[w, b] = model.linear.parameters()
# Purpose: Retrieve the learned weight and bias from the linear layer.
# Theory: model.parameters() yields an iterator of tensors (w [1, 1], b [1]). Unpacking assigns them to w and b.

print(f"Learned weight: {w.item():.4f}, Learned bias: {b.item():.4f}")
# Purpose: Print the learned weight and bias values.
# Theory: w.item() and b.item() convert tensors to scalars. With Huber Loss, parameters should approximate true values (2, 3), as it’s robust to noise.

# Testing on new data
X_test = torch.tensor([[4.0], [7.0]])
# Purpose: Create a test input tensor with values 4.0 and 7.0.
# Theory: Test inputs [2, 1] match the model’s input shape, used to evaluate generalization.

with torch.no_grad():
    # Purpose: Disable gradient tracking for inference to save memory and computation.
    # Theory: Gradient tracking is unnecessary during inference, as no parameters are updated.
    
    predictions = model(X_test)
    # Purpose: Compute predictions for test inputs.
    # Theory: Passes X_test through the model, producing predictions [2, 1].
    
    print(f"Predictions for {X_test.tolist()}: {predictions.tolist()}")
    # Purpose: Print test inputs and their predictions.
    # Theory: .tolist() converts tensors to Python lists for readable output. Predictions should approximate 2 * x + 3 (e.g., 11 for x=4, 17 for x=7).

Epoch [100/1000], Loss: 0.8869
Epoch [200/1000], Loss: 0.7855
Epoch [300/1000], Loss: 0.6945
Epoch [400/1000], Loss: 0.6134
Epoch [500/1000], Loss: 0.5433
Epoch [600/1000], Loss: 0.4861
Epoch [700/1000], Loss: 0.4404
Epoch [800/1000], Loss: 0.4045
Epoch [900/1000], Loss: 0.3767
Epoch [1000/1000], Loss: 0.3551
Learned weight: 2.0713, Learned bias: 2.4650
Predictions for [[4.0], [7.0]]: [[10.750251770019531], [16.964160919189453]]
