Dataflow Modeling System + RoPE Kernel by jsmonson · Pull Request #60 · microsoft/brainsmith

jsmonson · 2025-09-18T22:12:47Z

Summary

This PR restructures four core systems: component registry, dataflow modeling, design space exploration, and composition-based testing. The work replaces hardcoded BERT-specific assumptions with extensible abstractions while maintaining backward compatibility with existing examples.

Major Components

1. Component Registry (`brainsmith/registry/`)

Decorator-based plugin architecture with automatic discovery:

from brainsmith.registry import kernel, backend, step

@kernel
class LayerNormOp(KernelOp):
    """Automatically discovered and registered"""
    pass

@backend(target_kernel='LayerNormOp', language='hls')
class LayerNorm_hls:
    """Linked to kernel via metadata"""
    pass

@step
def my_transformation(model, config):
    """Automatically discoverable"""
    return model

Features:

Automatic discovery via entry points
Manifest caching (10x faster CLI startup: 500ms → 50ms)
Namespace isolation with priority chain (brainsmith, finn, project, custom)
Type-safe lookup with validation
PEP 562 lazy loading

2. Dataflow Modeling (`brainsmith/dataflow/`)

Schema-driven kernel design with two-phase construction:

Design Principles:

Schemas define structure, not storage (ModelWrapper is single source of truth)
Two-phase construction: Schema → DesignSpace → DesignPoint
Immutable design points prevent mutation bugs during DSE
Unified constraints for ONNX + kernel validation

Two-Phase Construction (enables 2-50x DSE speedup):

Phase 1: Invariant Model (build once)
├─ Resolve tensor shapes from ONNX graph
├─ Compute valid parameter ranges (PE ∈ [1, 64], SIMD ∈ {1,2,4,8})
├─ Validate shape/dtype constraints
└─ Cache for reuse across design points

Phase 2: Configured Model (reconfigure per design point)
├─ Apply parallelization parameters (PE=32, SIMD=8)
├─ Derive stream shapes (input: 256×8, output: 256×32)
├─ Validate parallelization constraints
└─ ~1ms reconfiguration vs ~50ms full rebuild

Navigation API:

# Interface-based (recommended for stream parameters)
point = point.with_input_stream(0, pe=32)
point = point.with_output_stream(0, pe=16)

# Parameter-based (for generic DSE)
point = point.with_parameter("SIMD", 64)
point = point.increase_parameter("PE", factor=2)

Components:

KernelSchema: Declarative inputs, outputs, parameters, constraints
KernelDesignSpace: Valid parameter ranges from schema + ONNX context
KernelDesignPoint: Immutable configuration snapshot with navigation API
ConstraintSystem: Unified validation (divisibility, bounds, relationships)
BroadcastSemantics: NumPy-style broadcasting for elementwise operations

3. Design Space Exploration (`brainsmith/dse/`)

Extracted from monolithic core/ into modular architecture:

dse/
├── api.py: High-level explore_design_space() API
├── _parser/: Blueprint → DesignSpace parsing
├── _builder.py: DesignSpace → ExecutionTree construction
├── runner.py: Segment-based execution engine
├── segment.py: Computation reuse abstraction
└── tree.py: Tree traversal and result aggregation

Segment-Based Execution (reuses shared computation):

Design Space: [LayerNorm.PE ∈ {8,16,32}] × [Softmax.PE ∈ {4,8}]

Traditional:              Segment-Based:
9 full runs               5 segments
PE=8,PE=4  ──┐           ┌─ Common prefix (1×)
PE=8,PE=8  ──┤           ├─ LayerNorm.PE=8 (1×)
PE=16,PE=4 ──┤    →      ├─ LayerNorm.PE=16 (1×)
PE=16,PE=8 ──┤           ├─ LayerNorm.PE=32 (1×)
PE=32,PE=4 ──┤           └─ Softmax variants (6×)
PE=32,PE=8 ──┘

4. Test Framework (`tests/frameworks/`)

Composition-based architecture: implement 3-5 methods, inherit 18+ test cases.

KernelTestBase (abstract foundation)
├─ Pipeline execution, backend simulation, golden validation
└─ Fixture management

KernelTest (single implementation)
├─ Inherits: 6 tests (pipeline, python, cppsim, rtlsim, node creation, inference)
└─ Requires: 3 methods (get_kernel_op, make_test_model, get_num_inputs)

KernelParityTest (reference vs primary)
├─ Inherits: 18 tests (7 parity, 5 HW estimation, 6 golden validation)
└─ Requires: 5 methods (+ infer_kernel_reference, get_backend_variants_reference)

Example (18 tests from ~20 lines):

class TestLayerNormParity(KernelParityTest):
    def make_test_model(self, kernel_test_config):
        return model, ["input_name"]

    def get_kernel_op(self):
        from brainsmith.kernels.layernorm import LayerNormOp
        return LayerNormOp

    def infer_kernel_reference(self, model, target_node):
        model = model.transform(InferLayerNormFinn())
        nodes = model.get_nodes_by_op_type("LayerNorm_Finn")
        return getCustomOp(nodes[0]), model

    def get_backend_variants_reference(self):
        from finn.custom_op.fpgadataflow.hls.layernorm_hls import LayerNorm_hls
        return [LayerNorm_hls]

Progressive Disclosure Logging:

Default: High-level progress ("Executing cppsim for AddStreams...")
Verbose (-v): Detailed execution logs, shape info, parameter values

5. CLI (`brainsmith/cli/`)

Dual-command architecture with project management:

# brainsmith: Configuration and setup
brainsmith project init ~/my-fpga-project
brainsmith project info
brainsmith registry --verbose
brainsmith setup install-cppsim

# smith: Operational commands (dataflow core creation)
smith model.onnx blueprint.yaml --output-dir ./results

Features:

Project-based workflow with environment management
Registry introspection (view kernels, backends, transforms)
Setup automation (Vivado HLS, board files, dependencies)
Rich formatting (colored output, progress, tables)
Hierarchical config (user defaults → project → env vars → CLI args)

6. Settings System (`brainsmith/settings/`)

Pydantic-based hierarchical configuration:

# brainsmith.yaml (project root)
build_dir: /path/to/builds
xilinx_path: /tools/Xilinx/Vivado/2024.2
xilinx_version: "2024.2"
logging:
  level: info
  file_level: debug

Features:

Type validation (paths exist, versions match)
Environment variable expansion (${HOME}, ${FINN_ROOT})
Nested configuration sections
Export to shell environment for subprocess compatibility

Kernel Implementations

Added

ElementwiseBinary (brainsmith/kernels/elementwise_binary/)
Thresholding (brainsmith/kernels/thresholding/)
Channelwise (brainsmith/kernels/channelwise/)
AddStreams (brainsmith/kernels/addstreams/)
DuplicateStreams (brainsmith/kernels/duplicate_streams/)
RotaryEmbedding (brainsmith/kernels/rotaryembedding/)

Migrated

LayerNorm, Softmax, Crop

Migrated to KernelOp + dataflow modeling
Two-phase construction for DSE optimization
Parity tests vs FINN (deleted after confirmation due to legacy deprecation)

Infrastructure

Internal Utilities (`brainsmith/_internal/`)

lazy_imports.py: PEP 562 lazy module loading
logging.py: Progressive disclosure logging
io/yaml.py: YAML parsing with !include directives
io/dependencies.py: Package dependency resolution
finn/adapter.py: FINN integration layer

Documentation (`docs/`, 60+ files)

MkDocs site structure:

docs/
├── getting-started.md           # Quick start guide
├── api/                         # API reference (CLI, dataflow, DSE, registry)
├── developer-guide/             # In-depth guides (blueprints, kernels, registry)
└── tutorials/                   # Hands-on examples

Breaking Changes

Module Reorganization

# BEFORE (v0.1.0-alpha.1)
from brainsmith.core.design.builder import DesignBuilder
from brainsmith.core.dse.runner import FinnRunner
from brainsmith.custom_op.layernorm import LayerNorm

# AFTER
from brainsmith.dse import build_tree, execute_tree, SegmentRunner
from brainsmith.kernels.layernorm import LayerNormOp

Configuration Format

# BEFORE: Python-based
XILINX_PATH = "/opt/Xilinx"
BUILD_DIR = "./build"

# AFTER: YAML hierarchical
# brainsmith.yaml
xilinx_path: /opt/Xilinx
build:
  output_dir: ./build

CLI Workflow

# BEFORE: Docker wrapper
./run-docker.sh python -m brainsmith.flows.bert

# AFTER: Structured CLI
brainsmith project init .
smith examples/bert/model.onnx examples/bert/blueprint.yaml

Kernel Interface

# BEFORE: Manual HWCustomOp subclassing
class LayerNorm(HWCustomOp):
    def get_nodeattr(self, name):
        return self.onnx_node.attribute[name]

# AFTER: Declarative KernelOp with schema
@kernel
class LayerNormOp(KernelOp):
    @classmethod
    def get_schema_spec(cls):
        return KernelSchema(
            inputs=[...],
            outputs=[...],
            parameters=[...],
            constraints=[...]
        )

Performance Impact

CLI Startup

Before: 500ms (eager imports)
After: 50ms (lazy imports + manifest caching)
10× faster

DSE Execution

Before: Linear execution (N full runs)
After: Segment-based (amortized O(log N) for balanced trees)
2-5× faster for typical design spaces

Test Suite

Before: 120 tests, 45 minutes (monolithic)
After: 300+ tests, 2 minutes fast / 30 minutes full (pytest-xdist)
22× faster development loop

Testing

Coverage

60 test files added (15,871 lines)
100+ test cases for dataflow system
18+ inherited tests per kernel
Parity tests (Brainsmith vs FINN)

Test Categories

pytest -m "not slow" -v        # Fast tests only
pytest -m "cppsim" -v          # Vivado HLS simulation
pytest -m "rtlsim" -v          # RTL simulation
pytest -m "parity" -v          # Brainsmith vs FINN

Validation

BERT End-to-End:

cd examples/bert
./quicktest.sh  # Inference, DSE, synthesis, deployment

Status: All stages passing (verified in CI)

Parity Tests: All migrated kernels (LayerNorm, Softmax, Crop, Thresholding) passed 18 test cases each

Migration Guide

Kernel Developers

# 1. Update kernel class
@kernel
class MyKernelOp(KernelOp):
    @classmethod
    def get_schema_spec(cls):
        return KernelSchema(...)

# 2. Update backend class
@backend(target_kernel='MyKernelOp', language='hls')
class MyKernel_hls:
    pass

# 3. Add tests using KernelTest or KernelParityTest (inherit 6-18 tests)

Pipeline Developers

Update imports: brainsmith.core.dse → brainsmith.dse
Use new API: explore_design_space(model, blueprint)
Convert blueprints from Python to YAML

End Users

Initialize project: brainsmith project init ~/my-fpga-project
Configure environment (edit brainsmith.yaml)
Create blueprint (see examples/bert/blueprint.yaml)
Run DSE: smith model.onnx blueprint.yaml --output-dir ./results

Compatibility

Backward Compatibility

Maintained:

BERT example workflow continues to work
FINN integration layer remains stable
ONNX graph transformations use same FINN primitives

Broken (intentional):

Module paths (forces migration to new architecture)
Configuration format (Python → YAML)

@auphelia

* Initial commit * finn flow: pass absolute path names to finn * Added scripts for roofline analysis * Making the output save in the current directory * release v0.2.0 Enable 4 bits * Bringing up a branch that is just the plugin framework for the BERT ops that have been added * Initial cleanup script. Performs some simplification and does some surgery to remove the Dropout layer. For some reason the IdentityOps are not being removed * Added a simple input arg * Moving to bert_build * Added a transformation to reorder the inputs so that the remove IdentityOP transformation is effective. * Initial cut and laying the groundwork for plugin-based shuffle convert_to_hw operator * Getting stubs up for shuffle op and starting to populate some * Cleanup and some more asserts to check permutation list and shapes match up * Initial helper functions for shuffle work * Adding the input_generator for the cases where the inner dimension is not migrating. * Adding latest version of the onnx model and combining cleanup and bringup scripts into a single build script with multiple steps. * Added the infer QuantSoftMax to the pipecleaner build script, renamed the brevitas script * First cut at shuffle specialise layer * Registering Shuffle_hls * Added convert step that is currently skipped * Added a step that attempts to specialise layers on the pipecleaner model * Using fpgapart from the config instead * fixed model * adding some streamlining steps to the build flow which are passing through on the modified input model * Initial commit * finnbrainsmith integration * Added a simple README for now * fixing typoe thanks @auphelia * Initial build shuffle tests up" * populating member functions for getting the dtype and instream/outstream width for HLS generation * Adding the loop_coeffs to the attribute types dict * Needed to give nodes unique names to start generating hardware * Adding a custom HLSBackend where the tcl generation is overridden so that we can include the hlsextension directory * Fixing some portname issues in the generated HLS code * IP successfully building * Added cppsim support, passed suspiciously easily * Added some temporary stop-gaps with a brainsmith_templates so that we can support vector inputs before they appear in finn/dev * Fixing loop bound/coefficient zipping ordering * Reshaping now happening properly and avoiding cppsim segfault * removing IPgen step... for now... * Adding testing from pytorch for the shuffles * cppsim from pytorch to hw is passing * Ramping up testing for all the shuffle types * Removing redundant reshape in testing * First cut at rtlsim support for shuffles * First shuffle RTLSim tests passing * cleaning up the test a little * Cleaning up the InferShuffle transformation * shuffle cppsim codegen cleanup * fixing bug with shape of output when a reshape was present * Needed to increase liveness threshold to get all the rtlsim's to pass' * Bigger bump needed? * [BugFix] Fixed issue with using old Brevitas API for quant_act_scale. * Was including the file from the location * Using the plugin's template now * Removing test that doesn't make sense anymore * Removing INT16 for now focusing testing on INT8 for EoY goal * Adding the latest Brevitas bert build script and starting work on the cleanup scripts * Datatype name fix * cppsim integration * Fixing issues with the decapitation step * Added model tail removal custom step * Cleaning up the cleanup script * Removing redundant cleanup step * Adding an endtoend script and updating the README * Ensuring hash's and branches are consistent on the README * Added a minimal initial endtoend test * test fixed * Added a switch to end2end test to attempt IP generation (this is currently failing) * Extended the test to track how many ops have been successfully specialised and what percentage * Have the end2end test export a json dashboard file instead for tracking progress. * refactoring the endtoend test a bit to use fixtures and track progress through the build process * Updated testing to track various bits * RTLSim for QuantSoftMax * Removing prepare_rtlsim stub * QuantSoftMax RTLSim bugfixes (working now) * fix issue of passing datatypes instead of datatype strings * Adding template types to the treereduction operation * cppsim compiling, for the half it required some casting that I was not quite sure about. * ensure that the context array is np.float32 * Getting stuff working with the latest changes * Clean up remove head and add streamlining steps * Add streamlining steps for softmax * add gather to crop * Fixing linker library paths and include directories for 2024.2 compatibility * Cleanup * tracking individual steps now with fixtures dependencies, also added the ability to dump data to the dashboard json file * Refactored testing so that each step in the build flow is a separate pytest fixture. If we want to add a test at any point in the build flow we can just pass the step fixture in as an argument and then the cached build at that specific point will be picked up" * Starting to bring in the default steps * Generate a test for each step added automatically * Trying as much of the default flow as possible * removing tests that don't make sense right now * fixing the custom steps * Remove call to default convert_to_hw * Reverting back to old specialise layers * need dataflow partition, comment out for now * Removing duplication of the custom steps for BERT and duplicated scripts * updating endtoend script to include some of the default steps * commenting out the last few steps for now * Add a check at the end to see if hls synth went okay * dashboard json data update * Cleaning up the custom steps * Docstring explanations of the custom_steps required for BERT also cleaned up the flow a bit * bringing up validation testing of some of the steps * Adding python execution model for the shuffle * Added a small function for validation that when a test fails will examine the contexts and show what is the same and what differs * Silly mistake with the shuffle execute, it was not writing the result back into the context but was returning it * Elemwise integration * Adding UINT8 testcase which is the same as the BERT model * Increasing the timeout on softmax tests * Changing paths to match new 2024.2 directory structure * keep things float32 for now * Fixing case issue on SIMD attribute allowed the compilation to go further * boilerplate prepare_rtl sim is okay now, removing overridden version * Input int8, 2024.2 update * FuncLayerNorm bugfix and FLOAT32 testcase * "exec_mode" fix and code cleanup * Merge feature/plugin/layernorm_stf * support multiple lines * Added template parameter to enable/disable the quant stage at the end of the softmax * Adjusting the nodeattr for shuffle so that it is compatible with the set_target_fps transformation * QuantSoftMax nodeattr compatibility with set_fps_target transformation * Adding nodeattr so that layernorm is compatible with set_target_fps transformations * simd to SIMD * Non Quant softmax passing cppsim * Validation is having a lot more success with HWSoftMax rather than QuantSoftMax * reintroducing some essential streamlining steps, validation looking a lot better * Endtoend up without fps_target yet * integer cycles to stop issue in set_fifo_depths * Using the v80 part number for the softmax tests * Fix for the issue causing the stitched rtl sim stall * Setting reasonable fps target for initial pipecleaning * Fix for infering the datatypes in the shuffle node thanks @auphelia * Adding some configuration files for the bert end2end flow * Added some expected input and output npy files * Removing start step * Adding correct expected output * Adding an RTLSim node-by-node test to the pytests. Adjusting the configuration for a default build flow. * Adding more rtlsim based testing to the end2end pytests * Saving the context of the node-by-node runs under a different dir name * generate a reference IO each time due to randomly generated weights in brevitas script * Adding a custom step that generates the reference IO for each run for validation * SIMD parameter for shuffles in testing is now properly being set, some tests are now failing cppsim and need fixing * Not every loop coeff should be divided by simd * Fixed the shuffle SIMD issue * Making more command line arguments available for the parameter sweeping for the bert_build demo scripts * Woops left in note * Removing the custom debugging steps from the build flow * Adding an example bash script to sweep over some parameters. * Added a simple script to print the results of param sweep * Cleaning up to remove c++17 warning * Tidying up comments / warnings for demos * Using board instead of fpga_part * Making the output look a bit neater * Removing unused validation steps * fix param sweep * Slight tweak to example param sweep script * Adding a makefile and configs for some single layer and three layer configurations. * We have some large fifos in these builds that need to be split. * Updating the Brevitas model as per @nfraser suggestion * Fix circular make dependency * Works using later qonnx changes * New FIFO depth configurations for the three layers, folding configuration might not match the main plugin version though. * Added new preconfigured designs for latest brevitas changes. * Adding license file headers * updating to correct link in setup instructions * Tidying up QuantSoftMax/SoftMax * Cleaning up utils and testing * Cleaning up endtoend pytestingclear * Adding back in the bitwidth option for the parameter sweep with the new model generation * Added a parameter for changing the sequence length * Skipping LN test for now * Changed the artifact naming convention a little * Remove extraneous implementation of QuantizeLayerNormalization * Added a script to generate a config (pre FIFO depth sizing) for a particular folding configuration as we explore the DSE side of the Bert build * Added a makefile recipe for a maximum folding three layer design for passing to RW team * Adjusting number of layers on the design * Manually control the fifo depth stage instead of setting it if a param file is present * Need to come up with better arg naming for parameters, maybe just enforce longargs? * Makefile recipies use the generation script for various SIMD/PE configurations rather than prebaking them --------- Co-authored-by: aziz bahri <azizb@amd.com> Co-authored-by: azizb-xlnx <48930381+azizb-xlnx@users.noreply.github.com> Co-authored-by: root <root@TAFK> Co-authored-by: Thomas Keller <thomaskeller@microsoft.com> Co-authored-by: auphelia <jakobapk@web.de> Co-authored-by: Joshua Monson <joshmonson@microsoft.com> Co-authored-by: jsmonson <jsmonson@gmail.com>

* Added extra arguments to reflect latest change in finn/custom/transformer that enables you to override the number of inferences that the fifo depth sizing stage performs. * Fixing the recipies and simplifying

* Improvements to SoftMax hardware efficiency and also adding support for ap_float<W,I> datatypes. * Fixes and compiler integration for new SoftMax * fixing license header

@auphelia

…es on three layer designs (#9) * Adding check to make sure that we don't accidentally set SIMD for shuffleB yet, also updated the config generation so that we do not accidentally set the wrong shuffle in later layers * Cleaning up the build scripts a little thanks @auphelia * Moving the constraining of shuffle paramemters and pumpedCompute to temporary custom transformations so that they are more reliable * Removing the temporary check and relying on the custom pass for now until the parallel transpose op comes online * Fixed the return type of the custom transformations

* Added cycle testing to softmax test script Implemented cycle testing code, which compares the layer's rtlsim cycles with its expected cycles (found using QONNX's ModelWrapper.analysis). Copied from https://github.com/Xilinx/finn/blob/00bf8279f2ed20500f3046b395b24c08c8c82325/tests/fpgadataflow/test_fpgadataflow_fmpadding.py * Updated cycles test op type, imported exp_cycles_per_layer - The rtlsim cycles test for the softmax custom op was failing due to the incorrect op type string being used ("FMPadding" instead of "HWSoftmax"). - The FINN method, exp_cycles_per_layer, was not imported, causing the test to fail. * Implemented cycles test for Shuffle custom op - Implemented test to test_fpgadataflow_shuffle.py which compares the Shuffle node's expected cycles with the rtlsim's outputted cycles. - Ran this test, it currently fails. The expected cycles (12288) do not fall within a tolerance of 10 of the rtlsim cycles (23475). * Implemented alternate LayerNorm test script - The existing LayerNorm test is incomplete, and doesn't execute. To bridge the gap in testing, a new test was written based on other custom operations tests. - The new test, test_fpga_dataflow_layernorm_hw_custom_op(), is in the same file as the old test. - The cppsim version of the test currently passes. The rtlsim version fails due to the expected cycles (456) not matching the simulated cycles (63516). Testing was done using the [ifm_dim0-rtlsim-INT9-simd4-hls] configuration. * Removed rtlsim_trace from LayerNorm, updated comments Implemented reviewer suggested changes: - Removed rtlsim_trace attribute from the test's LayerNorm node. - Updated comments: - In construct_onnx_model()'s header comment, changed "Finn" -> "FINN", added info about the LayerNorm's Scale and Bias tensors. - In test_fpga_dataflow_layernorm_hw_custom_op()'s header comment, explained that this test is missing the inferred eltwise operations.

…flow (#14)

…flow (#15) * Removing the accidentally included startstep in the endtoend flow * Restoring the default to 8 for bitwidth

Co-authored-by: Thomas Keller <thomaskeller@microsoft.com>

* Include the reference IO as part of the metadata handover * typo fix

@STFleming

* Added cycle testing to softmax test script Implemented cycle testing code, which compares the layer's rtlsim cycles with its expected cycles (found using QONNX's ModelWrapper.analysis). Copied from https://github.com/Xilinx/finn/blob/00bf8279f2ed20500f3046b395b24c08c8c82325/tests/fpgadataflow/test_fpgadataflow_fmpadding.py * Updated cycles test op type, imported exp_cycles_per_layer - The rtlsim cycles test for the softmax custom op was failing due to the incorrect op type string being used ("FMPadding" instead of "HWSoftmax"). - The FINN method, exp_cycles_per_layer, was not imported, causing the test to fail. * Implemented cycles test for Shuffle custom op - Implemented test to test_fpgadataflow_shuffle.py which compares the Shuffle node's expected cycles with the rtlsim's outputted cycles. - Ran this test, it currently fails. The expected cycles (12288) do not fall within a tolerance of 10 of the rtlsim cycles (23475). * Implemented alternate LayerNorm test script - The existing LayerNorm test is incomplete, and doesn't execute. To bridge the gap in testing, a new test was written based on other custom operations tests. - The new test, test_fpga_dataflow_layernorm_hw_custom_op(), is in the same file as the old test. - The cppsim version of the test currently passes. The rtlsim version fails due to the expected cycles (456) not matching the simulated cycles (63516). Testing was done using the [ifm_dim0-rtlsim-INT9-simd4-hls] configuration. * Removed rtlsim_trace from LayerNorm, updated comments Implemented reviewer suggested changes: - Removed rtlsim_trace attribute from the test's LayerNorm node. - Updated comments: - In construct_onnx_model()'s header comment, changed "Finn" -> "FINN", added info about the LayerNorm's Scale and Bias tensors. - In test_fpga_dataflow_layernorm_hw_custom_op()'s header comment, explained that this test is missing the inferred eltwise operations. * Created OpTest class for abstracting CustomOp tests - This class helps reduce shared boilerplate code between tests for custom FINN ops. - The OpTest class is designed to be inherited by custom test classes. These custom test classes will inherit pre-written commonly used tests, and helper functions to make writing tests easier. - An example of a test designed using OpTest can be found at the end of `./test/fpgadataflow/test_fpgadataflow_layernorm.py`. - While functional, the class is still a work in progress, and more functionality will be added in alignment with the needs of the engineers who use it. * Applied linting - Applied linting using black's default settings. * Created target_fpga fixture, removed prints, added SIMD ids - Target FPGA, as used by the model_specialise fixture, is now a fixture, which can be overridden by a test class. - Removed print statements in op_test.py that were used for debugging - Added IDs to TestLayerNorms SIMD parameters. Pytest now displays SIMD1, SIMD2, SIMD4, instead of 1, 2, 4. More human-readable! * Implemented reviewer suggestions, new 'target_node' fixture, improved typing - Implemented @STFleming 's suggestions: - The `exec_mode` comparsisons at lines 65 and 68 now use `==` instead of `is`. - The reference to `LayerNorm` in the comment at line 173 has been removed. - `apply_transforms()` no longer uses an `assert`, instead it raises a `RuntimeError`. - Implemented a new fixture, `target_node()`. This fixture returns an integer, specifiying the index in the model of the node we're testing. This means a model can contain nodes/layers other than the the one we want to test. - Improved typing consistency throughout 'op_test.py': `input_tensors()` and `apply_transforms()` were missing parameter type hints.

* Formatting bert_build as a job * Further iteration/brainstorming * Initial FINN docker transplant * Adding deps to git ignore * [Deps] Restructure python github repo installs (#8) Co-authored-by: auphelia <jakobapk@web.de> * Initial docker structuring for BrainSmith * entrypoint path bugfix * [Docker] Enable interactive mode for docker container (#10) * Added model profiling scripts * Hotpatch to remove pyverilator * Normalize line endings in SUPPORT.md * finnbrainsmith --> brainsmith/finnlib paths * Tools folder restructure * Fix gen_bert paths & name in expand_norms * Custom QONNX branch to fix is_finn * Removed old QuantLayerNorm func * Initial job runner structuring * Job structure v0, structure for profiling improvements * Updated readme * Template path fix * Unsued import and formatting cleanup * FP IP import fix * Docker updates for pyxsi * Pyxsi path fix * Onnx path + linting fixes * Removed finnlib, moving up sub folders * Moved run_job to core for consistency * Linting cleanup * Updated README * Added RTL placeholder * Typo & gitignore fixes * Updated finnlib to brainsmith in tests * bert_steps path fix in tests * Fix punctuation in README instructions. * Update LICENSE: Brainsmith name fix Co-authored-by: auphelia <56755897+auphelia@users.noreply.github.com> * Update LICENSE: Brainsmith name fix 2 Co-authored-by: auphelia <56755897+auphelia@users.noreply.github.com> * Update README.md - typo fix Co-authored-by: auphelia <56755897+auphelia@users.noreply.github.com> * Brainsmith name fix Co-authored-by: auphelia <56755897+auphelia@users.noreply.github.com> * Update brainsmith/tools/README.md: Brainsmith name fix Co-authored-by: auphelia <56755897+auphelia@users.noreply.github.com> * Update docker/entrypoint.sh: Brainsmith name fix Co-authored-by: auphelia <56755897+auphelia@users.noreply.github.com> * Update docker/entrypoint.sh: Brainsmith name fix Co-authored-by: auphelia <56755897+auphelia@users.noreply.github.com> * Removed exec from fetch_repos * Copyright typo fix --------- Co-authored-by: Thomas Keller <thomaskeller@microsoft.com> Co-authored-by: auphelia <jakobapk@web.de> Co-authored-by: auphelia <56755897+auphelia@users.noreply.github.com>

* add custom onnxscript branch * Add TODO for reconciling onnxscript dependencies --------- Co-authored-by: Joshua Monson <joshmonson@microsoft.com> Co-authored-by: Thomas Keller <tkeller787@gmail.com>

This reverts commit 15fb647.

* Initial attempt at docker build action * Added branch name to action * PR & weekly tests for dev/ci-actions * Added self-hosted runner * Adjusted runs-on label * path fix * Added debug to orient pwd * Added pytest keyword through run-docker.sh * Fixed license path * Updated upload-artifats to v4 * Reorganize bert demo for github action * Updated run-docker CLI args * Added e2e test to actions * Removed build artifacts * Fix ci.yml run-docker statement * Removed "push" trigger * Merge with develop changes and add num workers env variable * Re-added push trigger for testing * Fix merge * Temporarily disabled docker and pytest for e2e validation * Fix BSMITH_BUILD_DIR env variable * Remove push trigger, since PR trigger is sufficient * Remove tesing branches and triggers for PR * Remove auto-gen docs * Delete demos/bert/configs/l1_simd12_pe8.json Removed extraneous config from test --------- Co-authored-by: Ubuntu <azureuser@brainsmith-dev2.woh15gx5mv0exiu0m5xe0hjytg.dx.internal.cloudapp.net>

* add custom onnxscript branch * fix torch error * readd todo --------- Co-authored-by: Joshua Monson <joshmonson@microsoft.com>

* fix formatting with copilot * fix dynamic matmul config when sizing is not divisble by 3 --------- Co-authored-by: Joshua Monson <joshmonson@microsoft.com>

Co-authored-by: Joshua Monson <joshmonson@microsoft.com>

…me (#31) * fix argparse arg that could never be false * update fifosizing arg in hw compiler to match new argument name --------- Co-authored-by: Joshua Monson <joshmonson@microsoft.com>

Co-authored-by: Joshua Monson <joshmonson@microsoft.com>

* Added cleanup steps and job * Made num_default_worker env variable

Co-authored-by: Joshua Monson <joshmonson@microsoft.com>

…r guides Add complete documentation suite including: - API reference for all core modules (dataflow, DSE, registry, kernel_op, settings) - Developer guides covering foundations (dataflow accelerators, kernel concepts), core systems (component registry, DSE, kernel modeling), and reference materials (blueprints, CLI, kernels) - Getting started guides for installation, configuration, and quickstart - Technical diagrams and images for visual explanations - Custom CSS styling for documentation site - Remove commented-out PR preview deployment code from docs workflow

- Remove VectorVectorActivation (VVAU) kernel implementation and tests - Delete obsolete ptranspose.sv SystemVerilog file - Enhance ElementwiseBinary kernel with float type support - Fix integer division to use truncating semantics matching hardware - Remove overly restrictive integer-only datatype constraint - Add comprehensive kernel documentation structure with 11 markdown files - Enhance test framework with ONNX utilities and improved base classes - Update MkDocs configuration for new documentation sections

Core dataflow improvements: - Add lift_scalar_to_rank1() for ONNX scalar normalization - Add constant_datatype() for fixed output types - Optimize design_point caching (regenerate from nodeattrs) - Add execution initialization guard for QONNX compatibility - Handle rank-0 tensors in template resolution - Auto-configure FIFO depths from schema interface counts Kernel enhancements: - Add ElementwiseBinary kernel with scalar broadcast support - Refactor Crop to use model shapes directly - Update Softmax and LayerNorm for new schema patterns - Improve InferCropFromGather documentation Test framework v2: - Add composition-based architecture (v2 base classes) - Add support modules: data_generation, golden_reference, quant_insertion - Add datatype annotation system for parameterized tests - Refactor dual/single kernel tests for cleaner inheritance - Add ElementwiseBinary v2 test suite with comprehensive coverage Documentation: - Add AddStreams HLS backend documentation - Update test framework README with v2 patterns

Optimize LayerNorm HLS implementation: - Remove redundant var static variable from var_stage - Move variance computation outside loop (division by constant N) - Replace division by sqrt with multiplication by reciprocal in inv_sqrt_stage - Fix datatype specification to FLOAT32 constant Add test framework improvements: - Add dual_kernel_test_v2.py for FINN vs Brainsmith parity testing - Enhance kernel_test_base_v2.py and single_kernel_test_v2.py - Add test_addstreams_parity_poc.py for AddStreams validation - Remove deprecated golden_reference.py and tensor_mapping.py Add build_hw_graph step: - Combine partitioning and specialization phases - Unify dataflow partition creation and backend specialization - Update blueprint to use unified build_hw_graph step Update .gitignore to exclude .ignore files

Reorganize test support utilities into tests/fixtures/ with improved structure: - Move kernel_test_helpers.py → model_builders.py - Create model_annotation.py (consolidates datatype_annotation.py and quant_insertion.py) - Create test_data.py (consolidates data_generation.py) - Update imports in conftest.py and test files - Clean up tests/support/ (remove consolidated files) - Update test framework base classes for new import paths Update bert_quicktest.yaml target_fps from 1 to 30000. Update tests/frameworks and tests/kernels for new fixture organization.

- Move blueprints.py and design_spaces.py to tests/fixtures/dse/ - Create tests/fixtures/dse/__init__.py with comprehensive exports - Update imports in tests/conftest.py and integration tests - Remove outdated fixture unit tests (test_kernel_test_helpers.py, test_model_annotation_quant.py, test_test_data.py) - Re-enable apply_parallelization_config step in BERT blueprint

- Add KernelTestConfig unified configuration system with stage-aware parameter handling - Update all test methods to accept kernel_test_config parameter for tolerances and model creation - Restructure installation documentation with improved prerequisite formatting - Add elementwise binary test infrastructure and parity tests - Deprecate legacy configure_kernel_node/configure_backend_node in favor of stage-aware configure_parameters

- Remove stdout redirect override in FINN adapter to align with upstream logging - Update FINN dependency to feature/logging-integration-transformer branch - Migrate test framework files: remove v1 variants, keep v2 as canonical - Add pytest-cases dependency for improved test parameterization - Expand pytest markers for better test categorization (single_kernel, dual_kernel, basic_validation) - Improve installation documentation formatting - Add git fetch to resolve_ref_to_commit for up-to-date remote refs

…twise_binary tests Logging system: - Replace simple logging with orchestrated system supporting quiet/normal/verbose/debug levels - Add file logging with rotation to output directories - Integrate FINN logging without handler conflicts (verbose=False means "don't add handlers") - Support per-tool FINN log levels via configuration CLI changes: - Rename --logs to --log-level with new verbosity levels - Rename 'project show' to 'project info' Test framework: - Move elementwise_binary tests from brainsmith/kernels/ to tests/kernels/ - Add shared test base class in kernel package for reuse - Introduce certification/validation test structure - Add model caching and certification sweep fixtures Cleanup: - Remove 70+ legacy test artifacts and planning docs - Remove unused prerelease-docs images - Update documentation and examples to reflect CLI changes

…ernel ops documentation Test Infrastructure Changes: - Split test base class into separate file (test_elementwise_binary.py) - Move shared test cases from tests/ to kernel package (test_cases.py) - Update __init__.py with comprehensive exports for test utilities - Remove redundant test case variants (mixed dtype/broadcasting duplicates) - Expand validation cases with DSE dimension coverage (RAM styles, memory modes, PE+RAM combinations) - Add narrow/binary datatype tests and 3D shape coverage Documentation: - Add comprehensive kernel ops guide (7 chapters covering introduction through best practices) - Simplify settings.md by removing redundant content sections - Update installation and configuration guides - Add tutorials index Build Configuration: - Update pyproject.toml with mkdocs plugins and navigation structure - Refine bert quicktest configuration Test Organization: - Consolidate Add tests (remove certification module, keep validation only) - Update validation test to use new test case imports

Phase 1 of KernelParityTest v6.0 migration - Move _prepare_model_with_annotations to base class - Move _generate_test_inputs to base class - Eliminates ~80 lines of duplication - No functional changes (all tests pass with identical results) Impact: - kernel_test_base_v2.py: +85 lines (new helper methods) - kernel_parity_test.py: -79 lines (removed duplicates) - single_kernel_test_v2.py: -85 lines (removed duplicates) - Net: -79 lines removed Test results: 10 passed, 4 failed (expected), 4 errors (expected) Identical to baseline - no regressions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Phase 2 of KernelParityTest v6.0 migration - Non-breaking additive changes New asymmetric design: - Primary (unqualified): Uses inherited base class methods - Reference (qualified): Explicit comparison target with "_reference" suffix Added methods: - infer_kernel_reference() - Delegates to infer_kernel_a() for compatibility - get_backend_variants_reference() - Delegates to get_backend_variants_a() - configure_kernel_reference() - Delegates to configure_kernel_a() - specialize_to_backend_reference() - NO METHOD SWAPPING! Uses explicit backends Added fixtures: - stage2_model (primary, unqualified) - Delegates to stage2_model_b - stage2_model_reference (reference, qualified) - Uses infer_kernel_reference() - stage3_model (primary backend) - Uses inherited specialize_to_backend() - stage3_model_reference (reference backend) - Uses specialize_to_backend_reference() Benefits: - Eliminates method swapping anti-pattern (explicit backends) - Clearer semantics (primary vs reference, not a vs b) - Non-breaking (delegates to old API for compatibility) - Prepares for eventual deprecation of kernel_a/kernel_b methods Impact: - kernel_parity_test.py: +261 lines (4 methods + 4 fixtures) - No functional changes (all tests pass identically) - Old kernel_a/b API still works Test results: 10 passed, 4 failed (expected), 4 errors (expected) Identical to baseline - no regressions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove brainsmith/kernels/shuffle/ module (legacy and modern implementations) - Move project config from .brainsmith/config.yaml to brainsmith.yaml at project root - Consolidate getting-started docs into single docs/getting-started.md - Relocate elementwise_binary tests from tests/kernel-migration/ to tests/kernels/ - Remove dual_kernel_test_v2.py (superseded by kernel_parity_test.py) - Update CLI commands and documentation to reflect new config path - Allow kernels without backends in DSE parser (log warning instead of error) - Add "regsitry" typo alias in CLI for convenience

Rename OrderedDimension → OrderedParameter and DSEDimension → ParameterSpec throughout dataflow system for clarity. Maintain backward compatibility aliases for external code. Changes: - Rename ordered_dimension.py → ordered_parameter.py - Update KernelDesignSpace.dimensions → parameters - Add deprecation aliases (DimensionSpec, OrderedDimension) - Update all method names (dim_min → param_min, etc.) - Consolidate test framework (v2 → base, remove deprecated files) - Add documentation images (BERT DFC, simple MHA)

- Rename DimensionSpec to ParameterSpec in schema definitions - Rename OrderedDimension to OrderedParameter for DSE navigation - Rename dse_dimensions to dse_parameters in schema fields - Update all dimension-related methods to parameter-based naming - Consolidate _resolve_input_datatype and _resolve_output_datatype into unified _resolve_datatype method - Remove backward compatibility aliases (Custom, DimensionSpec, OrderedDimension) - Update test framework to use parameter-based terminology - Streamline test inheritance hierarchy and remove deprecated test utilities - Update finn-hlslib dependency to latest commit - Remove obsolete documentation files from docs/developer-guide and docs/kernels - Simplify getting-started and index documentation - Update API reference documentation for dataflow, dse, registry, and settings - Rename test_ordered_dimension.py to test_ordered_parameter.py - Rename single_kernel_test.py to kernel_test.py

- Add comprehensive parity test between LegacyCrop and modern Crop - Test both height and width axis cropping with 8 configurations - Fix design space initialization in Crop_hls.execute_node() - Ensure QONNX-created instances properly initialize design space Test coverage: - 4 height axis configs (int8/int16/int4, symmetric/asymmetric/large) - 4 width axis configs (int8/int16/int4, symmetric/asymmetric/large) - Golden validation, parity checks, HW estimation tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Core changes: - Add stream parameter helpers to KernelDesignPoint (reduce duplication) - Add type inference for ParameterSpec with validation - Optimize duplicate detection and validation in schemas - Extract binary operation datatype helper - Improve KernelOp execution compatibility docs Crop kernel: - Remove legacy Crop implementations (LegacyCrop, LegacyCrop_hls, InferCropFromGather) - Fix Crop_hls width calculation and execution - Simplify crop kernel exports Documentation: - Update README with new project description - Reorganize docs from prerelease-docs to docs/ - Add experimental kernel ops tutorial series - Add API docs for CLI, dataflow, DSE, registry - Add test framework guides (QUICKSTART, KERNEL_PARITY_TEST_GUIDE) Tests: - Reorganize test structure (migration folder for legacy tests) - Move elementwise tests to top level - Update crop tests to use new implementation

…king Replace configuration-based source detection with runtime discovery tracking. Introduces _discovered_sources set populated during component discovery and new domain utilities for bidirectional ONNX domain resolution. Key changes: - Add _discovered_sources set to track sources from entrypoints and component_sources - Replace module prefix matching with discovered sources matching in _detect_source() - Add _domain_utils.py module with derive_domain_from_module() and match_domain_to_source() - Update get_domain_for_backend() to derive domain from __module__ attribute - Deprecate source_module_prefixes configuration field with warning - Add tests for domain derivation and matching utilities - Remove cli-architecture.md documentation (outdated) - Update component registry and CLI documentation references

- Delete brainsmith/codegen module (HLSCodeBuilder and tests) - Replace HLSCodeBuilder with direct string construction in elementwise_binary_hls.py - Change logging from info to debug level in FINN adapter, DSE, and dataflow builder - Disable FINN progress display (show_progress: False)

microsoft-github-operations bot and others added 30 commits October 8, 2024 00:07

Initial commit

f5a8dad

CODE_OF_CONDUCT.md committed

1b7ec21

SECURITY.md committed

7a66bac

LICENSE committed

019a370

README.md committed

3c8a7da

SUPPORT.md committed

095e4e9

Created branch, added codeowners

426e89d

BERT builder flow arguments for fifosim n_inferences (#6)

5ba98f9

* Added extra arguments to reflect latest change in finn/custom/transformer that enables you to override the number of inferences that the fifo depth sizing stage performs. * Fixing the recipies and simplifying

[SoftMax] New Improved SoftMax (#11)

ccd023b

* Improvements to SoftMax hardware efficiency and also adding support for ap_float<W,I> datatypes. * Fixes and compiler integration for new SoftMax * fixing license header

Added a custom step that extracts metadata for the shell integration …

fcd7bc3

…flow (#14)

[TinyBERT] Removing accidentally included start_step in the endtoend …

fab2842

…flow (#15) * Removing the accidentally included startstep in the endtoend flow * Restoring the default to 8 for bitwidth

Removing rtlsim_backend after pyverilator deprecation (#16)

d7fb002

Name stylize BrainSmith --> Brainsmith (#17)

0c72dda

Co-authored-by: Thomas Keller <thomaskeller@microsoft.com>

[TinyBERT] Add ref IO to stitched_ip as part of metadata handover (#18)

dbfbe67

* Include the reference IO as part of the metadata handover * typo fix

Add Custom ONNXSCRIPT repository to BrainSmith (#21)

15fb647

* add custom onnxscript branch * Add TODO for reconciling onnxscript dependencies --------- Co-authored-by: Joshua Monson <joshmonson@microsoft.com> Co-authored-by: Thomas Keller <tkeller787@gmail.com>

Revert "Add Custom ONNXSCRIPT repository to BrainSmith (#21)" (#22)

752bd39

This reverts commit 15fb647.

[CustomOps] Update brainsmith custom ops with changes on finn side (#25)

17fc5ca

Revert onnxscript add Revert (#26)

ff45805

* add custom onnxscript branch * fix torch error * readd todo --------- Co-authored-by: Joshua Monson <joshmonson@microsoft.com>

Fix Dynamic Matmul Initial Config For BERT-Large (#28)

ec39f0d

* fix formatting with copilot * fix dynamic matmul config when sizing is not divisble by 3 --------- Co-authored-by: Joshua Monson <joshmonson@microsoft.com>

fix argparse arg that could never be false (#30)

fc73217

Co-authored-by: Joshua Monson <joshmonson@microsoft.com>

Patch Pull Request #30: Update args variable to match new argument na…

4530385

…me (#31) * fix argparse arg that could never be false * update fifosizing arg in hw compiler to match new argument name --------- Co-authored-by: Joshua Monson <joshmonson@microsoft.com>

update pytorch to 2.7 (#34)

7a410b2

Co-authored-by: Joshua Monson <joshmonson@microsoft.com>

[Hotfix] Cleanup CI runner artifacts (#33)

30e48ad

* Added cleanup steps and job * Made num_default_worker env variable

update brevitas commit hash (#36)

40abeed

Co-authored-by: Joshua Monson <joshmonson@microsoft.com>

tafk7 and others added 26 commits November 4, 2025 17:18

Slim docs, move folding transform

c2a95e6

Merge crop parity test and parallelization refactor

5ef9b41

Clean up log/print statements

f798e0a

Freeze brevitas + qonnx commits

c5e2eca

Update CI/CD actions for larger pytest suite

552c346

Update docker python to 3.11

9746e97

Pytest marks update

f868ad0

tafk7 marked this pull request as ready for review November 9, 2025 09:35

tafk7 requested a review from a team as a code owner November 9, 2025 09:36

Correct finn log config assert

a3781b8

tafk7 merged commit c4c924f into develop Nov 9, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataflow Modeling System + RoPE Kernel#60

Dataflow Modeling System + RoPE Kernel#60
tafk7 merged 148 commits intodevelopfrom
dev/joshmonson/rope-kernel

jsmonson commented Sep 18, 2025 •

edited by tafk7

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jsmonson commented Sep 18, 2025 • edited by tafk7 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Major Components

1. Component Registry (brainsmith/registry/)

2. Dataflow Modeling (brainsmith/dataflow/)

3. Design Space Exploration (brainsmith/dse/)

4. Test Framework (tests/frameworks/)

5. CLI (brainsmith/cli/)

6. Settings System (brainsmith/settings/)

Kernel Implementations

Added

Migrated

Infrastructure

Internal Utilities (brainsmith/_internal/)

Documentation (docs/, 60+ files)

Breaking Changes

Module Reorganization

Configuration Format

CLI Workflow

Kernel Interface

Performance Impact

CLI Startup

DSE Execution

Test Suite

Testing

Coverage

Test Categories

Validation

Migration Guide

Kernel Developers

Pipeline Developers

End Users

Compatibility

Backward Compatibility

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jsmonson commented Sep 18, 2025 •

edited by tafk7

Loading

1. Component Registry (`brainsmith/registry/`)

2. Dataflow Modeling (`brainsmith/dataflow/`)

3. Design Space Exploration (`brainsmith/dse/`)

4. Test Framework (`tests/frameworks/`)

5. CLI (`brainsmith/cli/`)

6. Settings System (`brainsmith/settings/`)

Internal Utilities (`brainsmith/_internal/`)

Documentation (`docs/`, 60+ files)