Skip to content

philippegall17/NeuralNetworkClassifierFromScratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Neural Network 2D Classifier

A fully manual implementation of a feedforward neural network that learns a 2D binary classification boundary using the Adam optimiser. No deep learning frameworks are used, meaning every forward pass, analytical backpropagation step, and weight update is written from scratch in NumPy.

Adapted from the worked example in Higham & Higham (2019).


Overview

Given a random set of labelled 2D points whose classes are defined by a Voronoi tessellation, the network learns to approximate the class boundary. Training progress is captured as an animation, showing the decision region evolve from noise to a well-fitted boundary.

This is an example output:

Training animation frames

Step 2 500 - loss ≈ 0.228 Step 21 700 - loss ≈ 0.00003
Network still undecided Boundary tightly fitted

Results may vary due to the random generation of voronoi patches and training samples.


Data Generation

Training points $\mathbf{x}_i \in [0,1]^2$ are drawn uniformly at random. Labels are assigned by a Voronoi rule: three green seeds $G = {g_1, g_2, g_3}$ and three blue seeds $B = {b_1, b_2, b_3}$ are fixed, and each point is assigned the colour of its nearest seed:

$$y_i = \begin{cases} [0,1]^\top & \text{if } \min_j |\mathbf{x}_i - g_j| < \min_j |\mathbf{x}_i - b_j| \\ [1,0]^\top & \text{otherwise} \end{cases}$$


Network Architecture

The network $h : \mathbb{R}^2 \to \mathbb{R}^2$ is a three-layer feedforward network with the topology $2 \to 2 \to 8 \to 2$:

$$h(\mathbf{x}) = \sigma\bigl(W_3,\sigma(W_2,\sigma(W_1\mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2) + \mathbf{b}_3\bigr)$$

where

Symbol Shape Role
$W_1 \in \mathbb{R}^{2 \times 2}$ $2\times2$ first weight matrix
$W_2 \in \mathbb{R}^{8 \times 2}$ $8\times2$ second weight matrix
$W_3 \in \mathbb{R}^{2 \times 8}$ $2\times8$ output weight matrix
$\mathbf{b}_1, \mathbf{b}_2, \mathbf{b}_3$ $2, 8, 2$ bias vectors

The sigmoid activation is applied element-wise at every layer:

$$\sigma(x) = \frac{1}{1 + e^{-x}}, \qquad \sigma'(x) = \sigma(x)\bigl(1 - \sigma(x)\bigr)$$

All weights and biases are concatenated into a single parameter vector $\mathbf{w} \in \mathbb{R}^p$ (where $p = 2{\cdot}2 + 8{\cdot}2 + 2{\cdot}8 + 2 + 8 + 2 = 46$) for uniform treatment during optimisation.


Loss Function

The mean squared error over the $n$ training pairs $(x_i, y_i)$ is

$$\mathcal{L}(\mathbf{w}) = \frac{1}{2n} \sum_{i=1}^{n} |h(\mathbf{x}_i;\mathbf{w}) - \mathbf{y}_i|_2^2$$

The factor $\tfrac{1}{2}$ is a standard convenience that cancels with the exponent when differentiating.


Backpropagation

Gradients are computed analytically via backpropagation. Defining the pre-activations $\mathbf{z}^{(k)}$ and post-activations $\mathbf{a}^{(k)}$ for a single sample $\mathbf{x}$:

$$\mathbf{z}^{(1)} = W_1\mathbf{x} + \mathbf{b}_1, \quad \mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)})$$ $$\mathbf{z}^{(2)} = W_2\mathbf{a}^{(1)} + \mathbf{b}_2, \quad \mathbf{a}^{(2)} = \sigma(\mathbf{z}^{(2)})$$ $$\mathbf{z}^{(3)} = W_3\mathbf{a}^{(2)} + \mathbf{b}_3, \quad \mathbf{a}^{(3)} = \sigma(\mathbf{z}^{(3)})$$

The error signals (deltas), created by the successive application of the chain rule for the many inner derivatives, are propagated backwards using the chain rule (with regards to the correct label y):

$$\boldsymbol{\delta}^{(3)} = \bigl(\mathbf{a}^{(3)} - \mathbf{y}\bigr) \odot \sigma'(\mathbf{z}^{(3)})$$ $$\boldsymbol{\delta}^{(2)} = \bigl(W_3^\top \boldsymbol{\delta}^{(3)}\bigr) \odot \sigma'(\mathbf{z}^{(2)})$$ $$\boldsymbol{\delta}^{(1)} = \bigl(W_2^\top \boldsymbol{\delta}^{(2)}\bigr) \odot \sigma'(\mathbf{z}^{(1)})$$

where $\odot$ is the element-wise (Hadamard) product: given two vectors of the same length, it multiplies corresponding entries rather than computing a dot product or matrix product. It appears here because the chain rule requires multiplying the upstream error signal $\boldsymbol{\delta}$ by the local derivative $\sigma'(\mathbf{z})$ at each neuron independently — a scalar multiplication per neuron, not a linear combination across neurons.

The gradient contributions for each layer $k \in {1, 2, 3}$ (indexing the three weight matrices $W_1, W_2, W_3$) are then:

$$\frac{\partial \mathcal{L}}{\partial W_k} = \frac{1}{n}\sum_{i=1}^n \boldsymbol{\delta}^{(k)}_i \bigl(\mathbf{a}^{(k-1)}_i\bigr)^\top, \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}_k} = \frac{1}{n}\sum_{i=1}^n \boldsymbol{\delta}^{(k)}_i$$

with the convention $\mathbf{a}^{(0)} = \mathbf{x}$.


Optimiser — Adam (Adaptive Moment Estimation)

The ordinary gradient descent $\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla\mathcal{L}$ uses the same step size for every parameter. Adam maintains per-parameter running estimates of the first and second moments of the gradient, effectively giving each weight its own adaptive learning rate.

Let

$$\mathbf{g}_t = \nabla_\mathbf{w}\mathcal{L}(\mathbf{w}_t)$$

be the gradient at step $t$. Adam maintains:

$$\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1),\mathbf{g}_t \quad \text{(1st moment / mean)}$$ $$\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2),\mathbf{g}_t^2 \quad \text{(2nd moment / uncentred variance)}$$

Because $\mathbf{m}_0 = \mathbf{v}_0 = \mathbf{0}$, the estimates are biased towards zero during the early steps. Bias correction removes this effect:

$$\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t}, \qquad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t}$$

The parameter update is:

$$\mathbf{w}_{t+1} = \mathbf{w}_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \varepsilon},\hat{\mathbf{m}}_t$$

Intuition. The denominator $\sqrt{\hat{\mathbf{v}}_t}$ is a running estimate of the gradient's root-mean-square. Dividing by it normalises the effective step: parameters whose gradients are consistently large receive a smaller update (automatic dampening), while parameters with small, uncertain gradients receive a larger relative update. The numerator $\hat{\mathbf{m}}_t$ is a momentum term that smooths the noisy gradient signal, helping the optimiser maintain direction across iterations.

Hyperparameters used

Parameter Value Meaning
$\eta$ $10^{-3}$ global learning rate
$\beta_1$ $0.9$ exponential decay for 1st moment
$\beta_2$ $0.999$ exponential decay for 2nd moment
$\varepsilon$ $10^{-8}$ numerical stability constant
tol $10^{-7}$ convergence threshold on $|\mathbf{g}_t|$
iter $10^5$ maximum number of iterations

Usage

The script will:

  1. Display the true Voronoi boundary with the random seeds and training points.
  2. Run Adam training, printing loss and gradient norm every 1 000 steps.
  3. Open an animated plot showing the network's decision region evolving over training.

Reference

Higham, C. F., & Higham, D. J. (2019). Deep learning: An introduction for applied mathematicians. SIAM Review, 61(4), 860–891. https://doi.org/10.1137/18M1165748

About

A basic 2d decision boundary classifier neural network, implemented from scratch in python for studying.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages