# End-to-end development/deployment toolchain tutorial

This tutorial provides a comprehensive walkthrough of our toolchain, designed to streamline the development and deployment of Spiking Neural Networks (SNNs) onto our (unreleased) custom event-based neuromorphic accelerator hardware. Attendees will gain hands-on experience with model configuration, training, quantization, and hardware export, enabling them to optimize neural networks for real-world applications.

In [None]:
import os
import numpy as np
import torch

from extras import set_random_seed
set_random_seed(hash("NICEWorkshop2025") % (2**32))

In [None]:
from ecs_train.config import load_yaml
from ecs_train.training import initialize_experiment
from ecs_train.visualization import network_logger, draw_perf_log
from ecs_train.utils import export_nir
from ecs_common.hardware_config import from_metadata

from extras import check_constraints, quantize_weights

### Training

With our toolchain being an end-to-end frame work for developing and deploying SNNs, an integral part of this is the training of neural networks. In this case, we use PyTorch Lightning and Norse under the hood. However, the training framework is interchangeable, as long as it supports exporting to the NIR (Neuromorphic Intermediate Representation) format.

Let us first load our configuration for the N-MNIST dataset. This dataset is commonly used as an easily trainable benchmark dataset, which is why we use it in this time-constrained tutorial setting.

In [None]:
config_path = "config/nmnist_feed_forward.yaml"
config = load_yaml(config_path)
accelerator_config = from_metadata(
    config.hardware_cfg,
    config.model_cfg.network_cfg["quant_cfg"]
)

After loading the configuration, we initialize the experiment. Since we want to train a network from scratch, we don't load any checkpoint. To save time, we will be using only 5% of the dataset batches.

In [None]:
config.trainer_cfg.checkpoint_path = None   # Don't load checkpoint
config.trainer_cfg.num_epochs = 5           # Limit number of epochs

trainer, model, data_module = initialize_experiment(config, limit_batches=0.05)

Now we train the network for 5 epochs (defined in the configuration) and visualize the training progress.

In [None]:
trainer.fit(model, datamodule=data_module, ckpt_path=None)

network_logger.save_perf_log()
draw_perf_log(network_logger.output_path)

After training finishes, let's test the accuracy of our trained model (again only using 5% of the testing data). Note that we first quantize the model according to the fixed point scheme of our accelerator hardware.

In [None]:
model = quantize_weights(model, accelerator_config.weight_quant_scheme.wl, accelerator_config.weight_quant_scheme.fl)
_ = trainer.test(model, data_module)

### Hardware constraints

Hardware in general poses certain requirements and constraints on what trained networks may look like. Our neuromorphic accelerator is no exception to this rule. While we have flexible fan-in and fan-out of our neurons, the total number of neurons placed on a single core is limited. Also, the total amount of synapses (i.e., network weights) is limited by the memory available in a core.

These constraints need to be checked and respected by the training. Optionally, unstructured pruning can be used to tailor networks to match them. However, for the sake of time, this tutorial will not demonstrate this mechanism, but rather show how constraints might be checked and enforced in a future version of the toolchain.

The initial configuration meets all constraints:

In [None]:
(
    fan_in_out_pruning_neccessary,
    global_pruning_necessary,
    num_weights_hidden,
    input_dim,
    hidden_features,
    output_dim
) = check_constraints(config, accelerator_config)

print(f"Global pruning neccessary: {global_pruning_necessary}")

If we restrict the number of possible neurons per core, our input layer with shape of `CHW = [2, 17, 17]` and a resulting 578 input neurons is too large to fit.

In [None]:
# Too many input channels
config.hardware_cfg["num_neurons_core"] = 512

accelerator_config = from_metadata(
    config.hardware_cfg,
    config.model_cfg.network_cfg["quant_cfg"]
)

try:
    (
        fan_in_out_pruning_neccessary,
        global_pruning_necessary,
        num_weights_hidden,
        input_dim,
        hidden_features,
        output_dim
    ) = check_constraints(config, accelerator_config)
except Exception as e:
    print(f"Exception occurred: {str(e)}")

When limiting the number of synapses per neuron core, global pruning becomes neccessary. By reducing the overall number of weights in an unstructured way, constraints could be met without loosing much accuracy.

In [None]:
# Too many synapses
config.hardware_cfg["num_neurons_core"] = 1024
config.hardware_cfg["num_synapses_core"] = 16384
config.hardware_cfg["num_routes"] = 16384

accelerator_config = from_metadata(
    config.hardware_cfg,
    config.model_cfg.network_cfg["quant_cfg"]
)

(
    fan_in_out_pruning_neccessary,
    global_pruning_necessary,
    num_weights_hidden,
    input_dim,
    hidden_features,
    output_dim
) = check_constraints(config, accelerator_config)

print(f"Global pruning neccessary: {global_pruning_necessary}")

### Pruned networks

Our hardware architecture is fully event-based, which enables it to leverage both temporal and spacial (or structural) sparsity in network architectures to reduce computation time. To show that property, we brought two pre-trained networks for the N-MNIST and the Spiking Heidelberg Digits (SHD) dataset each. The first respective network is unpruned, while the second one has been pruned by 30% (without fine-tuning).

**N-MNIST:** for the N-MNIST dataset, the accuracy barely diminishes after pruning. This is due to the simplicity of the dataset.

In [None]:
from extras import SuppressOutput

config_path = "config/nmnist_feed_forward.yaml"
config = load_yaml(config_path)
accelerator_config = from_metadata(
    config.hardware_cfg,
    config.model_cfg.network_cfg["quant_cfg"]
)

checkpoints = [
    "checkpoints/nmnist.ckpt",
    "checkpoints/nmnist_30_pruning.ckpt",
]

pruning_levels = [0.0, 0.3]
accuracies = []

for checkpoint in checkpoints:
    with SuppressOutput() as s:
        s.print(f"Testing checkpoint {checkpoint}")
        config.trainer_cfg.checkpoint_path = checkpoint
        trainer, model, data_module = initialize_experiment(config)
        model = quantize_weights(model, accelerator_config.weight_quant_scheme.wl, accelerator_config.weight_quant_scheme.fl)
    
        [test_output] = trainer.test(model, data_module)
        accuracies.append(test_output["test_acc"])

In [None]:
print(f"No pruning:             {100 * accuracies[0]:.2f}% accuracy")
print(f"Pruned (30% sparsity):  {100 * accuracies[1]:.2f}% accuracy")

**Spiking Heidelberg Digits (SHD):** when using the SHD dataset, the accuracy diminishes a lot after pruning. Since we did not fine-tune the networks, the pruned networks are not able to solve the task well, loosing ~12% of accuracy.

In [None]:
from extras import SuppressOutput

config_path = "config/shd_feed_forward.yaml"
config = load_yaml(config_path)
accelerator_config = from_metadata(
    config.hardware_cfg,
    config.model_cfg.network_cfg["quant_cfg"]
)

checkpoints = [
    "checkpoints/shd.ckpt",
    "checkpoints/shd_30_pruning.ckpt",
]

pruning_levels = [0.0, 0.3]
accuracies = []

for checkpoint in checkpoints:
    with SuppressOutput() as s:
        s.print(f"Testing checkpoint {checkpoint}")
        config.trainer_cfg.checkpoint_path = checkpoint
        trainer, model, data_module = initialize_experiment(config)
        model = quantize_weights(model, accelerator_config.weight_quant_scheme.wl, accelerator_config.weight_quant_scheme.fl)
    
        [test_output] = trainer.test(model, data_module)
        accuracies.append(test_output["test_acc"])

In [None]:
print(f"No pruning:             {100 * accuracies[0]:.2f}% accuracy")
print(f"Pruned (30% sparsity):  {100 * accuracies[1]:.2f}% accuracy")

### Export to NIR

After training and evaluating our SNN in software using Norse, we want to convert it into a portable, standardized format. The Neuromorphic Intermediate Representation (NIR) format is already supported by Norse (and other frameworks), which is why we also opt into using it as the intermediate format for our toolchain.

The NIR export uses one sample from our data to trace the computational graph and then maps the modules in our model to standardized NIR nodes. We modified the export functions of Norse and NIRTorch (the torch plugin of NIR) in custom forks to support our custom quantized neuron implementation, but the generated NIR file still adheres to the standard defined by NIR.

To capture the hardware specific configuration like memory layouts and quantization schemes, we embed that information into the metadata of the NIR file. The next step in the toolchain (the deployment phase) can read that metadata and format the memory files for the accelerator accordingly.

In [None]:
config.trainer_cfg.checkpoint_path = "checkpoints/shd.ckpt"
trainer, model, data_module = initialize_experiment(config)

network_output_dir = os.path.join(config.trainer_cfg.output_path, "networks")
network_output_file = os.path.join(network_output_dir, "network.nir")
os.makedirs(network_output_dir, exist_ok=True)

sample_data = next(iter(data_module.train_dataloader()))[0][0, 0:1, :]

hardware_cfg = config.hardware_cfg
quant_cfg = config.model_cfg.network_cfg["quant_cfg"]
metadata = {"hardware_cfg": hardware_cfg, "quant_cfg": quant_cfg}

export_nir(model.network, metadata, network_output_file, sample_data, dt=config.model_cfg.network_cfg["dt"], broadcast_params=False)

### Deployment

As illuded to previously, this stage operates only on the generated NIR file and its embedded metadata. We parse the NIR graph and the hardware-specific information, simulate the network inference in a bit-accurate fixed-point simulation, and generate the memory files that define the topology, connectivity and weights of the network.

In [None]:
import nir
from torch.utils.data import random_split

from ecs_deploy.network import Network, generate_input_events
from ecs_common.hardware_config import from_metadata
from ecs_common.quant_options import options_from_config

First we load the NIR graph and parse the metadata.

In [None]:
deploy_output_dir = os.path.join(config.trainer_cfg.output_path, "deploy")
os.makedirs(deploy_output_dir, exist_ok=True)

# Load NIR graph and metadata
nir_graph = nir.read(network_output_file)
metadata = nir_graph.metadata
quant_options = options_from_config(metadata["quant_cfg"])
accelerator_config = from_metadata(metadata["hardware_cfg"], quant_options.weight_format)

Then we generate source encoded input events from one sample of the dataset. This can also be done using previously exported samples in the `.npy` format. The input events are fed into our simulator. The simulator matches our hardware implementation of a LIF neuron on the bit-level.

In [None]:
# Load input sample and network
test_set = data_module.test_dataloader().dataset
total_samples = len(test_set)
sample, _ = random_split(test_set, [1, total_samples - 1])
sample_loader = torch.utils.data.DataLoader(sample, batch_size=1)
sample_data, target = next(iter(sample_loader))
sample_data = sample_data.permute([1, 0, 2, 3, 4])

input_events = generate_input_events(sample_data.numpy())
network = Network(nir_graph, accelerator_config, quant_options)

# Simulate execution

# The +3 is due to differing latencies between the Norse and our simulation.
# Norse does not implement a pipeline, but rather propagates the spikes from
# input to output within 1 timestep, while we need x timesteps (for x layers).
# Also, one extra timestep gets added because of the core internal pipeline.
# Hence: 1 (hidden layer) + 1 (output layer) + 1 = 3 timesteps delay.
output_neuron_states = network.simulate(input_events, len(sample_data) + 3)
output_neuron_states_np = np.array([[[state.to_float() for state in states_ts]] for states_ts in output_neuron_states])

predicted_class = np.argmax(output_neuron_states_np[-1])
print(f"Predicted class: {predicted_class}")
print(f"Actual class:    {target.item()}")

Finally, we generate the the memory file artifacts. We also report the utilization of the accelerator memories.

In [None]:
# Generate memory files
network.generate_mem_files(deploy_output_dir, print_util=True)

### Generate files for FPGA experiments

The previous steps were quite verbose, which is why we provide an automated export function for this tutorial. It exports both the unpruned and pruned checkpoints for the N-MNIST and SHD datasets and 10 test samples (including the expected neuron state outputs) for evaluating on the FPGA later.

The outputs of our bit-accurate simulation do not exactly match the Norse implementation (yet). A fully aligned Norse neuron implementation is however planned for the very near future. We report the maximum error and MSE, which both are very low. The relations between the output neurons are preserved though, and as the tasks at hand are classification tasks, this does not alter the final network output.

Another interesting point to note is that the weight utilization does not decrease by 30% for the pruned network (as one might expect). This is due to the fact that we employ weight sharing in our deployment toolchain. While the total number of weight is cut by 30%, the number of unique weights is not.

In [None]:
from extras import training_export, deployment_export

print("Exporting N-MNIST")
training_export("nmnist", num_samples=10)
deployment_export("nmnist")

print("\n\nExporting SHD")
training_export("shd", num_samples=10)
deployment_export("shd")

### Copy to FPGA

Finally, we transfer the files generated in the previous step onto the FPGA. (This will fail if you are not connected to an FPGA with the setup used in the tutorial)

In [None]:
from extras import transfer_files
transfer_files()