This neural network is implemented entirely with pure 'numpy', without complex deep learning architectures, only intuitive mathematical principles.
This method is implemented using mathematical models and principles, so please use it only for small-scale calculations or teaching purposes to save your valuable time.
- Features
- Installation
- Quick Start
- How It Works
- API Documentation
- Examples
- XOR Problem
- Iris Classification
- Performance Notes
- License
- Zero-dependency black magic — Only depends on
numpy, with all gradients computed manually. - Arbitrary depth networks — The number of hidden layers can be freely set via
net_layer. - Numerical gradient verification — Uses the central difference method to ensure the correctness of backpropagation, without needing to derive the chain rule.
- Cross-entropy Softmax — Natively supports multi-class tasks.
- Lightweight and transparent — About 350 lines of code, easy to read and modify; removing error prompts reduces it to about 150 lines.
- Highly flexible — Many values can be overridden, and of course, direct overriding using
self.xxxis also possible. - Error localization — Custom bilingual error messages at critical positions, efficiently and accurately pinpointing errors.
Clone the repository and make sure NumPy is installed:
git clone https://github.com/linuxlemon256/Neural-Network.git
cd Neural-Network
pip install numpyimport numpy as np
from CustomLayerNet import CustomLayerNet
# Generate simulated data: 100 samples, 5 features, 3 classes
X = np.random.randn(100, 5)
y = np.eye(3)[np.random.randint(0, 3, 100)] # One-hot labels
# Create network: input 5 → hidden layer 10 → output 3, total 3 layers, train for 500 epochs
net = CustomLayerNet(input_size=5, hidden_size=10, output_size=3,
net_layer=3, learning_time=500, learning_rate=0.1)
# Train and get final prediction results
output = net.train(X, y)
pred_class = np.argmax(output, axis=1)
print("Predicted classes for the first 5 samples:", pred_class[:5])Each layer performs a linear transformation X @ W + b. All intermediate layers use the Sigmoid activation function, except the last layer which uses Softmax.
For each trainable parameter, compute the central difference approximate gradient of the loss function at that point:
grad ≈ (loss(W + h) - loss(W - h)) / (2h)
Then update the parameters using standard gradient descent:
W = W - learning_rate * grad
For training data, use derivatives to propagate backward, for example:
dL/dy * dy/dx = dL/dx
Similarly, propagate downstream derivatives to upstream functions in this manner. Only propagating through the loss-softmax function can efficiently compute gradients.
Note: Each iteration requires 2 × total number of parameters forward passes, so this implementation is only suitable for small networks and educational purposes.
| Parameter | Type | Default | Description |
|---|---|---|---|
input_size |
int |
- | Number of input features |
hidden_size |
int |
- | Number of neurons used in all hidden layers |
output_size |
int |
- | Number of output classes |
net_layer |
int |
2 |
Total number of network layers (including input and output layers) |
learning_time |
int |
1000 |
Total number of training iterations |
learning_rate |
float |
1 |
Learning rate for gradient descent |
print_every |
int |
100 |
How often to print loss during train() |
method |
dict |
{"activation":"relu","backpropagation":"errorback","train":"ordinary"} |
Neural network configuration dictionary: activation – activation function, options are "sigmoid" or "relu"; backpropagation – backpropagation method, options are "errorback" or "numerical_differentiation"; "train" reserved for future updates |
init_weights |
float or None |
0.1 |
Standard deviation for weight initialization; if set to None, automatically choose Xavier (sigmoid) or He (relu) initialization based on the activation function |
print_output |
bool |
True |
Whether to print loss and other information during training |
| Method | Return Value | Description |
|---|---|---|
forward(x) |
np.ndarray |
Performs forward propagation, returns the softmax probability output (shape (batch_size, output_size)), and saves activations of each layer in self.activations |
backword(x) |
list |
Computes the gradients of the loss with respect to each layer's parameters, returns a list of gradients (same structure as self.fact); automatically converts integer labels to one‑hot |
train(x, t, method=None) |
np.ndarray |
Trains using full-batch gradient descent for learning_time iterations, returns the final softmax output for the entire training set; can temporarily override configuration via the method parameter |
loss(out) |
float |
Computes cross-entropy loss (automatically converts self.t to one‑hot to match out) |
softmax(x) |
np.ndarray |
Computes softmax activation row-wise (for the output layer) |
sigmoid(x) |
np.ndarray |
Computes sigmoid activation element-wise |
accuracy(out, t=None) |
float |
Computes classification accuracy; can pass t as true labels (supports integer or one‑hot), otherwise uses self.t |
update(grad) |
- | Updates all layer weights and biases based on gradient grad and learning rate |
relu(x) |
np.ndarray |
Computes ReLU activation element-wise (max(0, x)) |
affine(x, fact) |
np.ndarray |
Affine transformation: np.dot(x, w) + b |
Note:
backwardautomatically converts integer labels to one-hot during backpropagation and uses Leaky ReLU (negative slope0.1) to avoid neuron death.accuracysupports passing in labelstdirectly without pre-settingself.t.
XOR is a classic linearly inseparable problem that requires at least one hidden layer to learn.
import numpy as np
from CustomLayerNet import CustomLayerNet
X = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([[1,0], [0,1], [0,1], [1,0]]) # two-class one-hot
# Create network: input 2 dimensions, hidden layer with 4 neurons, output 2 dimensions
net = CustomLayerNet(input_size=2, hidden_size=4, output_size=2,
net_layer=3, learning_time=2000, learning_rate=0.1)
# Training
out = net.train(X,y)
predicted_classes = np.argmax(out, axis=1)
print("Predicted class sequence:", predicted_classes)Using the classic Iris dataset for training and evaluation.
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder
from CustomLayerNet import CustomLayerNet
import numpy as np
# Load data
iris = load_iris()
X = iris.data
y = OneHotEncoder(sparse=False).fit_transform(iris.target.reshape(-1,1))
# Create network: input size 4, hidden layer with 8 neurons, output 3 classes
net = CustomLayerNet(input_size=4, hidden_size=8, output_size=3,
net_layer=3, learning_time=800, learning_rate=1)
# Train and get predictions
pred = net.train(X,y)
accuracy = np.mean(np.argmax(pred, axis=1) == iris.target)
print(f"Accuracy on the Iris dataset: {accuracy:.2%}")Since numerical gradients are used, the computational cost of this implementation is proportional to the number of network parameters. Each parameter requires two forward passes per training iteration (central difference method). Approximate performance for networks of different sizes is as follows:
-
Tiny Network (2-4-2) For example, the XOR problem, which has about 22 parameters.
-
Numerical Differentiation:
-
Training can be completed in a few seconds
-
Backpropagation:
-
Can be completed in fractions of a second
-
Small Network (4-8-3) For example, Iris classification, which has about 67 parameters.
-
Numerical Differentiation:
-
Training can be completed within a minute
-
Backpropagation:
-
Can be completed within a few to tens of seconds
-
Medium Network (784-128-10) For example, MNIST handwritten digit recognition, with over 100,000 parameters.
-
Numerical Differentiation:
-
Not tested; each iteration requires hundreds of thousands of forward passes, and a single iteration may take hours.
-
Backpropagation:
-
Takes several to tens of minutes to complete
Note: Numerical differentiation is computationally inefficient and memory-intensive.
Conclusion: This codebase is intended as a teaching tool to clearly demonstrate the internal mechanisms of gradient descent. For tasks of any practical scale, this method is computationally weak for large data processing, and it is strongly recommended to use frameworks based on analytical gradients (backpropagation algorithms), such as PyTorch or TensorFlow.
This project is licensed under the MIT License. You are free to use, modify, and distribute this code. For more details, please refer to the LICENSE file in the project's root directory.