Problem: Write a Custom Activation Function
Problem Statement
You are tasked with implementing a custom activation function in PyTorch that computes the following operation:


Once implemented, this custom activation function will be used in a simple linear regression model.

Requirements
Custom Activation Function:

Implement a class CustomActivationModel inheriting from torch.nn.Module.
Define the forward method to compute the activation function ( \text{tanh}(x) + x ).
Integration with Linear Regression:

Use the custom activation function in a simple linear regression model.
The model should include:
A single linear layer.
The custom activation function applied to the output of the linear layer.
Constraints
The custom activation function should not have any learnable parameters.
Ensure compatibility with PyTorch tensors for forward pass computations.

In [None]:
import torch
# Purpose: Import the PyTorch library for tensor operations and neural network functionality.
# Theory: PyTorch provides tensor computations (like NumPy but with GPU support) and autograd for automatic differentiation, essential for building and training neural networks.

import torch.nn as nn
# Purpose: Import neural network modules from PyTorch, including base class nn.Module and layers like nn.Linear.
# Theory: nn.Module is the foundation for defining custom neural network models, managing parameters and forward passes.

import torch.optim as optim
# Purpose: Import optimization algorithms like SGD for updating model parameters during training.
# Theory: Optimizers use gradients computed by autograd to minimize the loss function via methods like gradient descent.

# Set random seed for reproducibility
torch.manual_seed(42)
# Purpose: Fix the random seed to ensure consistent random number generation across runs.
# Theory: Setting a seed ensures reproducibility of random operations (e.g., data generation, weight initialization), critical for debugging and comparing results.

# Generate synthetic data
X = torch.rand(100, 1) * 10
# Purpose: Create a tensor of 100 random input points between 0 and 10, with shape [100, 1].
# Theory: torch.rand generates values from a uniform distribution [0, 1). Scaling by 10 maps to [0, 10). The shape [100, 1] represents 100 samples with 1 feature each.

y = 2 * X + 3 + torch.randn(100, 1)
# Purpose: Generate target values using a linear relationship y = 2x + 3 with added Gaussian noise.
# Theory: The linear component (2 * X + 3) simulates a true linear relationship. torch.randn adds noise from a standard normal distribution (mean 0, std 1), mimicking real-world data imperfections.

# Define the Custom Activation Model
class CustomActivationModel(nn.Module):
    # Purpose: Define a custom neural network model by subclassing nn.Module.
    # Theory: nn.Module provides infrastructure for parameter management, forward pass definition, and autograd integration.
    
    def __init__(self):
        super(CustomActivationModel, self).__init__()
        # Purpose: Initialize the parent nn.Module class to set up the model.
        # Theory: super() ensures proper initialization of nn.Module, registering parameters and enabling methods like parameters().
        
        self.linear = nn.Linear(1, 1)
        # Purpose: Create a linear layer that maps a single input feature to a single output (z = wx + b).
        # Theory: nn.Linear applies a linear transformation z = wx + b, where w and b are learnable parameters initialized randomly (Xavier initialization by default in PyTorch).
    
    def custom_activation(self, x):
        # Purpose: Implement the custom activation function f(x) = tanh(x) + x.
        # Theory: The activation combines the non-linear tanh(x) (bounded in [-1, 1]) with the linear x, creating a function that grows linearly but is modulated by tanh’s non-linearity.
        
        return torch.tanh(x) + x
        # Purpose: Compute tanh(x) + x element-wise on the input tensor.
        # Theory: torch.tanh applies the hyperbolic tangent function element-wise. The addition (+) broadcasts and adds x, requiring compatible tensor shapes. The operation is differentiable, enabling autograd to compute gradients.
    
    def forward(self, x):
        # Purpose: Define the forward pass of the model, specifying how input x is transformed to output.
        # Theory: The forward method is called when the model is invoked (e.g., model(X)). It defines the computational graph for autograd.
        
        z = self.linear(x)
        # Purpose: Apply the linear transformation z = wx + b to input x.
        # Theory: For input x of shape [batch_size, 1], nn.Linear computes z = x @ w^T + b, where w is [1, 1] and b is [1], producing z of shape [batch_size, 1].
        
        return self.custom_activation(z)
        # Purpose: Apply the custom activation function to the linear output.
        # Theory: Computes f(z) = tanh(z) + z, introducing non-linearity. The output shape remains [batch_size, 1]. Autograd tracks this operation for backpropagation.

# Initialize model, loss function, and optimizer
model = CustomActivationModel()
# Purpose: Create an instance of the custom model.
# Theory: Instantiates the model, initializing the linear layer’s parameters (w, b) with random values. These parameters are registered with autograd for gradient tracking.

criterion = nn.MSELoss()
# Purpose: Define the Mean Squared Error (MSE) loss function.
# Theory: MSE loss computes L = (1/n) * sum((y_pred - y)^2), measuring the squared difference between predictions and targets. It’s suitable for regression tasks.

optimizer = optim.SGD(model.parameters(), lr=0.01)
# Purpose: Initialize Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.01.
# Theory: SGD updates parameters using the rule θ = θ - η * ∇L, where η is the learning rate and ∇L is the gradient. model.parameters() provides w and b from the linear layer.

# Training loop
epochs = 1000
# Purpose: Set the number of training iterations (epochs) to 1000.
# Theory: Each epoch processes the entire dataset once, updating parameters to minimize loss. 1000 epochs is sufficient for convergencekkeyuh on this simple dataset.

for epoch in range(epochs):
    # Purpose: Iterate over the dataset for the specified number of epochs.
    # Theory: Training involves repeated forward and backward passes to optimize parameters.
    
    # Forward pass
    predictions = model(X)
    # Purpose: Compute model predictions by passing input X through the model.
    # Theory: Calls the forward method, computing z = Xw + b, then y_pred = tanh(z) + z. X [100, 1] produces predictions [100, 1].
    
    loss = criterion(predictions, y)
    # Purpose: Calculate the MSE loss between predictions and true targets.
    # Theory: Computes L = (1/100) * sum((predictions - y)^2). Both tensors have shape [100, 1], ensuring compatibility.
    
    # Backward pass and optimization
    optimizer.zero_grad()
    # Purpose: Reset gradients of all model parameters to zero.
    # Theory: Gradients accumulate by default in PyTorch. Zeroing prevents mixing gradients from previous iterations, ensuring correct updates.
    
    loss.backward()
    # Purpose: Compute gradients of the loss with respect to model parameters (w, b).
    # Theory: Autograd backpropagates through the computational graph: loss → activation → linear layer. Computes ∂L/∂w and ∂L/∂b using the chain rule.
    
    optimizer.step()
    # Purpose: Update model parameters using the computed gradients.
    # Theory: SGD applies θ = θ - η * ∇L for each parameter (w, b), where η = 0.01. This minimizes the loss.
    
    # Log progress every 100 epochs
    if (epoch + 1) % 100 == 0:
        # Purpose: Print training progress every 100 epochs.
        # Theory: Monitoring loss helps assess convergence and detect issues like overfitting or underfitting.
        
        print(f"Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}")
        # Purpose: Display the current epoch and loss value.
        # Theory: loss.item() extracts the scalar loss value from the tensor. Formatting to 4 decimal places improves readability.

# Display learned parameters
[w, b] = model.linear.parameters()
# Purpose: Retrieve the learned weight and bias from the linear layer.
# Theory: model.parameters() yields an iterator of tensors (w [1, 1], b [1]). Unpacking assigns them to w and b.

print(f"Learned weight: {w.item():.4f}, Learned bias: {b.item():.4f}")
# Purpose: Print the learned weight and bias values.
# Theory: w.item() and b.item() convert single-element tensors to scalars. Due to the non-linear activation, these may not match the true values (2, 3).

# Testing on new data
X_test = torch.tensor([[4.0], [7.0]])
# Purpose: Create a test input tensor with two values (4.0, 7.0).
# Theory: Test inputs have shape [2, 1], matching the model’s input requirement. Used to evaluate model generalization.

with torch.no_grad():
    # Purpose: Disable gradient tracking for inference to save memory and computation.
    # Theory: Gradient tracking is unnecessary during inference, as no parameters are updated. torch.no_grad() ensures this.
    
    predictions = model(X_test)
    # Purpose: Compute predictions for test inputs.
    # Theory: Passes X_test through the model (linear → activation), producing predictions of shape [2, 1].
    
    print(f"Predictions for {X_test.tolist()}: {predictions.tolist()}")
    # Purpose: Print test inputs and their predictions.
    # Theory: .tolist() converts tensors to Python lists for readable output. Predictions reflect the non-linear transformation f(z) = tanh(z) + z.