Neural Network 2D Classifier

A fully manual implementation of a feedforward neural network that learns a 2D binary classification boundary using the Adam optimiser. No deep learning frameworks are used, meaning every forward pass, analytical backpropagation step, and weight update is written from scratch in NumPy.

Adapted from the worked example in Higham & Higham (2019).

Overview

Given a random set of labelled 2D points whose classes are defined by a Voronoi tessellation, the network learns to approximate the class boundary. Training progress is captured as an animation, showing the decision region evolve from noise to a well-fitted boundary.

This is an example output:

Step 2 500 - loss ≈ 0.228	Step 21 700 - loss ≈ 0.00003
Network still undecided	Boundary tightly fitted

Results may vary due to the random generation of voronoi patches and training samples.

Data Generation

Training points $\mathbf{x}_i \in [0,1]^2$ are drawn uniformly at random. Labels are assigned by a Voronoi rule: three green seeds $G = {g_1, g_2, g_3}$ and three blue seeds $B = {b_1, b_2, b_3}$ are fixed, and each point is assigned the colour of its nearest seed:

$$y_i = \begin{cases} [0,1]^\top & \text{if } \min_j |\mathbf{x}_i - g_j| < \min_j |\mathbf{x}_i - b_j| \\ [1,0]^\top & \text{otherwise} \end{cases}$$

Network Architecture

The network $h : \mathbb{R}^2 \to \mathbb{R}^2$ is a three-layer feedforward network with the topology $2 \to 2 \to 8 \to 2$:

$$h(\mathbf{x}) = \sigma\bigl(W_3,\sigma(W_2,\sigma(W_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2) + \mathbf{b}_3\bigr)$$

where

Symbol	Shape	Role
$W_1 \in \mathbb{R}^{2 \times 2}$	$2\times2$	first weight matrix
$W_2 \in \mathbb{R}^{8 \times 2}$	$8\times2$	second weight matrix
$W_3 \in \mathbb{R}^{2 \times 8}$	$2\times8$	output weight matrix
$\mathbf{b}_1, \mathbf{b}_2, \mathbf{b}_3$	$2, 8, 2$	bias vectors

The sigmoid activation is applied element-wise at every layer:

$$\sigma(x) = \frac{1}{1 + e^{-x}}, \qquad \sigma'(x) = \sigma(x)\bigl(1 - \sigma(x)\bigr)$$

All weights and biases are concatenated into a single parameter vector $\mathbf{w} \in \mathbb{R}^p$ (where $p = 2{\cdot}2 + 8{\cdot}2 + 2{\cdot}8 + 2 + 8 + 2 = 46$) for uniform treatment during optimisation.

Loss Function

The mean squared error over the $n$ training pairs $(x_i, y_i)$ is

$$\mathcal{L}(\mathbf{w}) = \frac{1}{2n} \sum_{i=1}^{n} |h(\mathbf{x}_i;\mathbf{w}) - \mathbf{y}_i|_2^2$$

The factor $\tfrac{1}{2}$ is a standard convenience that cancels with the exponent when differentiating.

Backpropagation

Gradients are computed analytically via backpropagation. Defining the pre-activations $\mathbf{z}^{(k)}$ and post-activations $\mathbf{a}^{(k)}$ for a single sample $\mathbf{x}$:

$$\mathbf{z}^{(1)} = W_1\mathbf{x} + \mathbf{b}_1, \quad \mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)})$$ $$\mathbf{z}^{(2)} = W_2\mathbf{a}^{(1)} + \mathbf{b}_2, \quad \mathbf{a}^{(2)} = \sigma(\mathbf{z}^{(2)})$$ $$\mathbf{z}^{(3)} = W_3\mathbf{a}^{(2)} + \mathbf{b}_3, \quad \mathbf{a}^{(3)} = \sigma(\mathbf{z}^{(3)})$$

The error signals (deltas), created by the successive application of the chain rule for the many inner derivatives, are propagated backwards using the chain rule (with regards to the correct label y):

$$\boldsymbol{\delta}^{(3)} = \bigl(\mathbf{a}^{(3)} - \mathbf{y}\bigr) \odot \sigma'(\mathbf{z}^{(3)})$$ $$\boldsymbol{\delta}^{(2)} = \bigl(W_3^\top \boldsymbol{\delta}^{(3)}\bigr) \odot \sigma'(\mathbf{z}^{(2)})$$ $$\boldsymbol{\delta}^{(1)} = \bigl(W_2^\top \boldsymbol{\delta}^{(2)}\bigr) \odot \sigma'(\mathbf{z}^{(1)})$$

where $\odot$ is the element-wise (Hadamard) product: given two vectors of the same length, it multiplies corresponding entries rather than computing a dot product or matrix product. It appears here because the chain rule requires multiplying the upstream error signal $\boldsymbol{\delta}$ by the local derivative $\sigma'(\mathbf{z})$ at each neuron independently — a scalar multiplication per neuron, not a linear combination across neurons.

The gradient contributions for each layer $k \in {1, 2, 3}$ (indexing the three weight matrices $W_1, W_2, W_3$) are then:

$$\frac{\partial \mathcal{L}}{\partial W_k} = \frac{1}{n}\sum_{i=1}^n \boldsymbol{\delta}^{(k)}_i \bigl(\mathbf{a}^{(k-1)}_i\bigr)^\top, \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}_k} = \frac{1}{n}\sum_{i=1}^n \boldsymbol{\delta}^{(k)}_i$$

with the convention $\mathbf{a}^{(0)} = \mathbf{x}$.

Optimiser — Adam (Adaptive Moment Estimation)

The ordinary gradient descent $\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla\mathcal{L}$ uses the same step size for every parameter. Adam maintains per-parameter running estimates of the first and second moments of the gradient, effectively giving each weight its own adaptive learning rate.

Let

$$\mathbf{g}_t = \nabla_\mathbf{w}\mathcal{L}(\mathbf{w}_t)$$

be the gradient at step $t$. Adam maintains:

$$\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1),\mathbf{g}_t \quad \text{(1st moment / mean)}$$ $$\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2),\mathbf{g}_t^2 \quad \text{(2nd moment / uncentred variance)}$$

Because $\mathbf{m}_0 = \mathbf{v}_0 = \mathbf{0}$, the estimates are biased towards zero during the early steps. Bias correction removes this effect:

$$\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t}, \qquad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t}$$

The parameter update is:

$$\mathbf{w}_{t+1} = \mathbf{w}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \varepsilon},\hat{\mathbf{m}}_t$$

Intuition. The denominator $\sqrt{\hat{\mathbf{v}}_t}$ is a running estimate of the gradient's root-mean-square. Dividing by it normalises the effective step: parameters whose gradients are consistently large receive a smaller update (automatic dampening), while parameters with small, uncertain gradients receive a larger relative update. The numerator $\hat{\mathbf{m}}_t$ is a momentum term that smooths the noisy gradient signal, helping the optimiser maintain direction across iterations.

Hyperparameters used

Parameter	Value	Meaning
$\eta$	$10^{-3}$	global learning rate
$\beta_1$	$0.9$	exponential decay for 1st moment
$\beta_2$	$0.999$	exponential decay for 2nd moment
$\varepsilon$	$10^{-8}$	numerical stability constant
`tol`	$10^{-7}$	convergence threshold on $\|\mathbf{g}_t\|$
`iter`	$10^5$	maximum number of iterations

Usage

The script will:

Display the true Voronoi boundary with the random seeds and training points.
Run Adam training, printing loss and gradient norm every 1 000 steps.
Open an animated plot showing the network's decision region evolving over training.

Reference

Higham, C. F., & Higham, D. J. (2019). Deep learning: An introduction for applied mathematicians. SIAM Review, 61(4), 860–891. https://doi.org/10.1137/18M1165748

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
Neural Network 2d Classifier.py		Neural Network 2d Classifier.py
Output.png		Output.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Network 2D Classifier

Overview

Data Generation

Network Architecture

Loss Function

Backpropagation

Optimiser — Adam (Adaptive Moment Estimation)

Hyperparameters used

Usage

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Neural Network 2D Classifier

Overview

Data Generation

Network Architecture

Loss Function

Backpropagation

Optimiser — Adam (Adaptive Moment Estimation)

Hyperparameters used

Usage

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages