A neural network from first principles to FPGA hardware: train it, quantize it, synthesize it, prove it.
This is an educational repository. In ~200 lines of dependency-free Python and ~100 lines of Verilog-2005, it walks the complete journey: gradient descent on the CPU, fixed-point quantization, a hardware neuron, a synthesizable network, and an exhaustive testbench that proves the silicon-ready design matches the math, on every one of 293 test vectors.
The punchline is worth spoiling: the finished network is 832 LUT4s and zero flip-flops. A neural network that is pure combinational logic: inputs go in, the answer falls out, no clock required.
XOR is the classic "why we need hidden layers" function: no single straight line separates its classes, so a single neuron can't learn it. A hidden layer bends the space; four ReLU neurons are plenty.
| x0 | x1 | y |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
A neuron multiplies each input by a learned weight, sums the products,
adds a learned bias, and applies a nonlinearity (here ReLU:
max(0, v)). In hardware that's multipliers, an adder, a comparator and
a mux. Nothing else. The mystique of "neural" evaporates pleasantly when
you draw the schematic.
Pure-Python gradient descent, no frameworks, because at this size nothing is hidden behind a library call. Forward pass, mean-squared error, backpropagation by hand: ~40 lines.
One honest lesson baked in: ReLU networks can die. With an unlucky initialization every hidden pre-activation goes negative, gradients vanish, and the loss parks at 0.25 forever. The script searches seeds until training converges and the quantized network is still correct, which is both reproducible and true to real practice: initialization matters.
The trained float network draws this decision surface (green = "1"), the XOR diagonal band:
Hardware wants fixed point. Every weight, bias and activation becomes Q4.4: 8 bits, 4 fraction bits, range ±8, resolution 1/16 (why so few bits works, and when it wouldn't, is the subject of How many bits do you actually need?).
The quantized, integer-only model in train.py (fixed_forward) is
bit-exact with the Verilog: same shifts, same saturation, same
threshold. It regenerates two Verilog headers: weights.vh (the
learned parameters) and golden.vh (expected outputs for a 17×17 grid
over the input square).
The quantized decision boundary, now a hard yes/no:
neuron.v is one parameterized neuron following the golden rule of
fixed-point datapaths, multiply narrow, accumulate wide: Q4.4×Q4.4
products land in an 18-bit accumulator, ReLU clamps, and only then does
the result get resized back to Q4.4 with saturation.
xor_net.v instantiates four hidden neurons and a thresholded output
dot product:
The testbench checks the four canonical XOR cases and then sweeps the full 17×17 grid, comparing every output against the Python golden model:
TB PASS: xor_net (293 vectors)
Not "looks right on the waveform." Proven equal to the model, at every point we can enumerate. For a network this size, exhaustive verification is cheap, take it.
python3 py/train.py # train, quantize, emit weights.vh + golden.vh
python3 py/plots.py # regenerate the images from the artifacts
make # lint (Verilator), simulate (Icarus), synth (Yosys)Requires iverilog, verilator, yosys, all packaged on most distros.
No Python dependencies at all.
- Make the network pipelined: register each layer, trade latency for clock speed. (The zero-FF version is the fun fact; the pipelined version is what a real datapath looks like.)
- Grow it: MNIST-scale inference is the same ideas with more of everything; that's the libfpga library's upcoming neural micro-kit.
- Learn the building blocks interactively: the free
course and
Verilog playground at
libfpga.com, where you can paste
neuron.vand watch it work. - Why FPGAs suit neural networks in the first place: the fabric is already shaped like one.
Follow @libfpga for new modules, examples and releases.
MIT · Copyright (c) 2026 Antonio Roldao, Ph.D.

