Single Neuron with Backpropagation
Write a Python function that simulates a single neuron with sigmoid activation, and implements backpropagation to update the neuron's weights and bias. The function should take a list of feature vectors, associated true binary labels, initial weights, initial bias, a learning rate, and the number of epochs. The function should update the weights and bias using gradient descent based on the MSE loss, and return the updated weights, bias, and a list of MSE values for each epoch, each rounded to four decimal places.

Example:
Input:
features = [[1.0, 2.0], [2.0, 1.0], [-1.0, -2.0]], labels = [1, 0, 0], initial_weights = [0.1, -0.2], initial_bias = 0.0, learning_rate = 0.1, epochs = 2
Output:
updated_weights = [0.1036, -0.1425], updated_bias = -0.0167, mse_values = [0.3033, 0.2942]
Reasoning:
The neuron receives feature vectors and computes predictions using the sigmoid activation. Based on the predictions and true labels, the gradients of MSE loss with respect to weights and bias are computed and used to update the model parameters across epochs.



In [2]:
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def train_neuron(features: np.ndarray, labels: np.ndarray, initial_weights: np.ndarray, initial_bias: float, learning_rate: float, epochs: int):
    """
    Train a single neuron using backpropagation with MSE loss.

    :param features: NumPy array of shape (n_samples, n_features)
    :param labels: NumPy array of shape (n_samples,) with binary labels (0 or 1)
    :param initial_weights: NumPy array of initial weights
    :param initial_bias: Initial bias value
    :param learning_rate: Learning rate for gradient descent
    :param epochs: Number of training epochs
    :return: Updated weights, updated bias, list of MSE loss values per epoch (all rounded to 4 decimals)
    """

    # Convert inputs to NumPy arrays
    features = np.array(features, dtype=np.float64)
    labels = np.array(labels, dtype=np.float64)
    weights = np.array(initial_weights, dtype=np.float64)
    bias = float(initial_bias)

    mse_values = []

    for epoch in range(epochs):
        #formward pass
        predicted = features @ weights + bias 
        predicted = sigmoid(predicted)

        error = predicted - labels
        mse = np.mean(error**2)
        mse_values.append(mse)

        d_grad = 2 * error * predicted * (1-predicted)
        dw = features.T @ d_grad/len(labels)
        db = np.mean(d_grad)

        weights-= dw * learning_rate
        bias-= db * learning_rate
    
    return weights, bias, mse_values 




# Example usage
features = [[1.0, 2.0], [2.0, 1.0], [-1.0, -2.0]]
labels = [1, 0, 0]
initial_weights = [0.1, -0.2]
initial_bias = 0.0
learning_rate = 0.1
epochs = 100

updated_weights, updated_bias, mse_values = train_neuron(features, labels, initial_weights, initial_bias, learning_rate, epochs)
print(updated_weights, updated_bias, mse_values)

[-0.48896563  0.89242284] -0.5058885497167279 [0.3033228034139421, 0.2942232621822798, 0.28558133945119507, 0.2774071764961471, 0.26970056278816107, 0.2624525701137394, 0.2556473456548066, 0.24926389449754993, 0.24327772692027094, 0.2376622928253055, 0.2323901671242655, 0.22743398182494132, 0.22276712214551223, 0.2183642163152234, 0.21420145385885103, 0.21025676732419143, 0.20650990954862233, 0.2029424541748781, 0.19953774225606083, 0.196280793079425, 0.19315819313448826, 0.19015797359753372, 0.18726948381955819, 0.1844832660362877, 0.18179093478295724, 0.17918506320021085, 0.17665907747199558, 0.17420715996351055, 0.17182416116168517, 0.16950552020947304, 0.167247193626997, 0.1650455916953939, 0.16289752191908377, 0.16080013896129058, 0.15875090045296691, 0.1567475280974395, 0.15478797352546103, 0.15287038839328954, 0.1509930982567664, 0.14915457979501365, 0.1473534409969651, 0.14558840396165762, 0.1438582899985562, 0.14216200674695342, 0.14049853706360216, 0.13886692945524248, 0.1372

You are using Mean Squared Error (MSE) with sigmoid output:

L = \frac{1}{N} \sum_{i=1}^N ( \hat{y}_i - y_i )^2

where
	•	\hat{y}_i = \sigma(z_i) = prediction,
	•	z_i = w \cdot x_i + b = linear output.

⸻

2. Gradient chain rule

We want derivative of the loss w.r.t. the weights.
That requires:

\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}

⸻

Step A. Derivative of loss w.r.t prediction

\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y)

That’s where 2 * errors comes from (errors = predictions - labels).

⸻

Step B. Derivative of sigmoid

\frac{\partial \hat{y}}{\partial z} = \hat{y}(1 - \hat{y})

That’s where predictions * (1 - predictions) comes from.

⸻

Step C. Combine them

So the gradient w.r.t. the linear output z:

\frac{\partial L}{\partial z} = 2(\hat{y} - y) \cdot \hat{y}(1 - \hat{y})