Problem: Implement a Deep Neural Network
Problem Statement
You are tasked with constructing a Deep Neural Network (DNN) model to solve a regression task using PyTorch. The objective is to predict target values from synthetic data exhibiting a non-linear relationship.

Requirements
Implement the DNNModel class that satisfies the following criteria:

Model Definition:
The model should have:
An input layer connected to a hidden layer.
A ReLU activation function for non-linearity.
An output layer with a single unit for regression.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim


# Generate synthetic data
torch.manual_seed(42)
# Purpose: Set a random seed for reproducibility of synthetic data.
# Theory: Fixing the seed ensures consistent random number generation, aligning with previous problems (e.g., linear regression, Huber Loss).

X = torch.rand(100, 2) * 10
# Purpose: Create 100 data points with 2 features, values between 0 and 10.
# Theory: torch.rand generates uniform random numbers in [0, 1). Scaling by 10 maps to [0, 10). Shape [100, 2] represents 100 samples with 2 features.

y = (X[:, 0] + X[:, 1] * 2).unsqueeze(1) + torch.randn(100, 1)
# Purpose: Generate target values using y = x1 + 2*x2 + noise, with shape [100, 1].
# Theory: The linear term (x1 + 2*x2) defines a relationship, with Gaussian noise (mean 0, std 1) added via torch.randn. unsqueeze(1) ensures y is [100, 1] for regression.

# Define the Deep Neural Network Model
class DNNModel(nn.Module):
    # Purpose: Define a deep neural network for regression by subclassing nn.Module.
    # Theory: nn.Module provides infrastructure for layer management, parameter registration, and autograd integration.
    
    def __init__(self):
        # Purpose: Initialize the DNN with multiple fully connected layers.
        # Theory: Defines the architecture: input (2 features) → hidden layers → output (1 value). Layers include weights and biases.
        
        super(DNNModel, self).__init__()
        # Purpose: Call the parent nn.Module constructor to set up the module.
        # Theory: Ensures proper initialization, enabling parameter tracking and methods like parameters().
        
        self.fc1 = nn.Linear(2, 16)
        # Purpose: Create the first fully connected layer mapping 2 inputs to 16 hidden units.
        # Theory: Applies z1 = W1*x + b1, where W1 is [16, 2], b1 is [16]. Initialized with Xavier initialization by default.
        
        self.fc2 = nn.Linear(16, 8)
        # Purpose: Create the second fully connected layer mapping 16 units to 8 units.
        # Theory: Applies z2 = W2*h1 + b2, where W2 is [8, 16], b2 is [8]. Reduces dimensionality while adding complexity.
        
        self.fc3 = nn.Linear(8, 1)
        # Purpose: Create the output layer mapping 8 units to 1 output.
        # Theory: Applies y = W3*h2 + b3, where W3 is [1, 8], b3 is [1]. Produces a single regression value.
        
        self.relu = nn.ReLU()
        # Purpose: Define the ReLU activation function for non-linearity.
        # Theory: ReLU(x) = max(0, x) introduces sparsity and prevents vanishing gradients, applied after each hidden layer.
    
    def forward(self, x):
        # Purpose: Define the forward pass, specifying how input x is transformed to output.
        # Theory: The forward method builds the computational graph for autograd, computing predictions through layers and activations.
        
        x = self.relu(self.fc1(x))
        # Purpose: Apply first linear layer and ReLU activation.
        # Theory: Computes h1 = ReLU(W1*x + b1). Input x [batch_size, 2] produces h1 [batch_size, 16].
        
        x = self.relu(self.fc2(x))
        # Purpose: Apply second linear layer and ReLU activation.
        # Theory: Computes h2 = ReLU(W2*h1 + b2). Input h1 [batch_size, 16] produces h2 [batch_size, 8].
        
        x = self.fc3(x)
        # Purpose: Apply output layer to produce predictions.
        # Theory: Computes y = W3*h2 + b3. Input h2 [batch_size, 8] produces y [batch_size, 1]. No activation for regression.
        
        return x
        # Purpose: Return the final predictions.
        # Theory: Output tensor [batch_size, 1] represents predicted values for regression.

# Initialize the model, loss function, and optimizer
model = DNNModel()
# Purpose: Create an instance of the DNN model.
# Theory: Instantiates the model, initializing weights and biases randomly with Xavier initialization. Parameters are tracked by autograd.

criterion = nn.MSELoss()
# Purpose: Define the Mean Squared Error (MSE) loss function.
# Theory: MSE computes L = (1/n) * sum((y_pred - y_true)^2), suitable for regression. Measures squared differences between predictions and targets.

optimizer = optim.Adam(model.parameters(), lr=0.01)
# Purpose: Initialize the Adam optimizer with a learning rate of 0.01.
# Theory: Adam adapts learning rates using momentum (β1=0.9) and squared gradients (β2=0.999). model.parameters() provides all weights and biases.

# Training loop
epochs = 1000
# Purpose: Set the number of training iterations to 1000 epochs.
# Theory: Each epoch processes the entire dataset, updating parameters to minimize loss. 1000 epochs ensures convergence for this dataset.

for epoch in range(epochs):
    # Purpose: Iterate over the dataset for the specified number of epochs.
    # Theory: Training involves repeated forward and backward passes to optimize model parameters.
    
    # Forward pass
    predictions = model(X)
    # Purpose: Compute model predictions by passing input X through the model.
    # Theory: Calls the forward method, computing y_pred = DNN(X). X [100, 2] produces predictions [100, 1].
    
    loss = criterion(predictions, y)
    # Purpose: Calculate the MSE loss between predictions and true targets.
    # Theory: Computes L = (1/100) * sum((predictions - y)^2). Both tensors are [100, 1], ensuring compatibility.
    
    # Backward pass and optimization
    optimizer.zero_grad()
    # Purpose: Reset gradients of all model parameters to zero.
    # Theory: Gradients accumulate by default in PyTorch. Zeroing prevents mixing gradients from previous iterations.
    
    loss.backward()
    # Purpose: Compute gradients of the loss with respect to model parameters.
    # Theory: Autograd backpropagates through the network (output → hidden → input), computing ∂L/∂W and ∂L/∂b for all layers.
    
    optimizer.step()
    # Purpose: Update model parameters using the computed gradients.
    # Theory: Adam applies adaptive updates to weights and biases, minimizing the loss.
    
    # Log progress every 100 epochs
    if (epoch + 1) % 100 == 0:
        # Purpose: Print training progress every 100 epochs.
        # Theory: Monitoring loss helps assess convergence and detect issues like overfitting or learning rate problems.
        
        print(f"Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}")
        # Purpose: Display the current epoch and loss value.
        # Theory: loss.item() extracts the scalar loss value. Formatting to 4 decimal places improves readability.

# Testing on new data
X_test = torch.tensor([[4.0, 3.0], [7.0, 8.0]])
# Purpose: Create a test input tensor with two samples.
# Theory: Test inputs [2, 2] match the model’s input shape, used to evaluate generalization. Expected outputs are ~10 and ~23 based on y = x1 + 2*x2.

with torch.no_grad():
    # Purpose: Disable gradient tracking for inference to save memory and computation.
    # Theory: Gradient tracking is unnecessary during inference, as no parameters are updated.
    
    predictions = model(X_test)
    # Purpose: Compute predictions for test inputs.
    # Theory: Passes X_test through the model, producing predictions [2, 1].
    
    print(f"Predictions for {X_test.tolist()}: {predictions.tolist()}")
    # Purpose: Print test inputs and their predictions.
    # Theory: .tolist() converts tensors to Python lists for readable output. Predictions should approximate true values plus noise.