openvx-mark

openvx-mark is a vendor-agnostic benchmark suite for OpenVX implementations (1.0 through 1.3+). It measures the performance of individual vision kernels, multi-node pipelines, immediate-mode operations, and the OpenVX graph framework itself across configurable resolutions, producing composite scores, conformance reports, and detailed analytics.

openvx-mark works with any conformant OpenVX implementation — AMD OpenVX (MIVisionX), Intel OpenVX, Khronos Sample Implementation, or any other vendor's runtime. It is designed to answer two complementary questions: how fast are this implementation's kernels? and how much value does this implementation's graph framework add over a kernel-only baseline?

Features

60 standard OpenVX kernels across vision and enhanced vision feature sets
Graph mode and immediate mode benchmarking
Framework benchmarks (opt-in) — measure what only the OpenVX graph runtime can do: graph-vs-immediate dividend, virtual-image fusion, scheduling parallelism on independent branches, async dispatch overhead and concurrency, verify-cost-vs-depth scaling, and per-node VX_NODE_PERFORMANCE attribution. See Framework Benchmarks below.
Multi-resolution testing — VGA, HD, FHD, 4K, 8K, or custom
Composite scoring — geometric mean of megapixels/sec (OpenVX Vision Score) plus a dimensionless OpenVX Framework Score (geomean of framework-dividend metrics; > 1.0 means the graph framework is adding value)
Conformance checking — verifies all available kernels produce valid results
Stability gating — CV% threshold with automatic retries for unstable results
Multi-resolution scaling analysis — measures throughput scaling efficiency across resolutions
Peak vs sustained performance — compares best-case to typical latency
Cross-vendor comparison — C++ (--compare) and Python (scripts/compare_reports.py) generate side-by-side reports including a direction-aware Framework Metrics Comparison section
Reports — JSON, CSV, and Markdown output with glossary

See CHANGELOG.md for release notes; the framework benchmark suite landed in v1.0.

Important

It is recommended that the OpenVX implementation first passes the Khronos OpenVX Conformance Test Suite before running openvx-mark. Benchmarking results are only meaningful when the underlying implementation is conformant — non-conformant implementations may produce incorrect outputs, which will be flagged by openvx-mark's output verification and excluded from composite scores.

Prerequisites

C++17 compiler
CMake 3.10+
An OpenVX implementation with libopenvx and libvxu libraries

Building

Auto-detect OpenVX (recommended)

If your OpenVX implementation is installed in a standard location (/opt/rocm, /usr/local, /usr), CMake will find it automatically:

mkdir build && cd build
cmake ..
cmake --build .

AMD OpenVX (MIVisionX)

mkdir build && cd build
cmake -DROCM_PATH=/opt/rocm ..
cmake --build .

Custom OpenVX installation

Point CMake to your OpenVX headers and libraries:

mkdir build && cd build
cmake -DOPENVX_INCLUDES=/path/to/openvx/include \
      -DOPENVX_LIB_DIR=/path/to/openvx/lib ..
cmake --build .

Khronos Sample Implementation

mkdir build && cd build
cmake -DOPENVX_INCLUDES=/path/to/OpenVX-sample-impl/include \
      -DOPENVX_LIB_DIR=/path/to/OpenVX-sample-impl/build/lib ..
cmake --build .

Usage

./openvx-mark [OPTIONS]

Quick start

# Default run: graph mode, VGA+FHD+4K, 100 iterations
./openvx-mark

# Quick test run
./openvx-mark --resolution VGA --iterations 10 --warmup 3

# Full benchmark with all feature sets
./openvx-mark --all --iterations 200

# Include immediate-mode benchmarks
./openvx-mark --mode both --resolution FHD

CLI Options

Benchmark Selection

Option	Description	Default
`--all`	Run all benchmarks (vision + enhanced_vision)
`--feature-set SET[,SET,...]`	Feature sets: `vision`, `enhanced_vision`, `all`	`vision`
`--category CAT[,CAT,...]`	Filter by category	all
`--kernel NAME[,NAME,...]`	Filter by kernel name	all
`--mode graph\|immediate\|both`	Execution mode	`graph`
`--skip-pipelines`	Skip multi-node pipeline benchmarks

Resolution

Option	Description	Default
`--resolution RES[,RES,...]`	Preset: `VGA`, `HD`, `FHD`, `4K`, `8K`	`VGA,FHD,4K`
`--width W --height H`	Custom resolution

Timing

Option	Description	Default
`--iterations N`	Measurement iterations per benchmark	`100`
`--warmup N`	Warm-up iterations (not measured)	`10`
`--seed N`	PRNG seed for reproducible test data	`42`
`--stability-threshold N`	CV% threshold for stability warnings	`15`
`--max-retries N`	Max retries for unstable benchmarks (2x iterations each retry)	`0`
`--framework-chain-depths N,N,...`	Chain depths swept by `VerifyChain_Box3x3`	`1,4,16,64`

Output

Option	Description	Default
`--output-dir DIR`	Output directory for reports	`./benchmark_results`
`--format json,csv,markdown`	Output formats (comma-separated)	all three
`--verbose`	Verbose output with per-benchmark warnings
`--quiet`	Minimal output (suppress per-benchmark lines)
`--compare file1.json,file2.json`	Compare two or more JSON reports

Benchmarked Kernels

Vision Feature Set (41 kernels)

Category	Kernels
Pixelwise	And, Or, Xor, Not, AbsDiff, Add, Subtract, Multiply
Filters	Box3x3, Gaussian3x3, Median3x3, Erode3x3, Dilate3x3, Sobel3x3, CustomConvolution, NonLinearFilter
Color	ColorConvert (RGB2IYUV, RGB2NV12), ChannelExtract, ChannelCombine, ConvertDepth
Geometric	ScaleImage (Half, Double), WarpAffine, WarpPerspective, Remap
Statistical	Histogram, EqualizeHist, MeanStdDev, MinMaxLoc, IntegralImage
Multi-scale	GaussianPyramid, LaplacianPyramid, HalfScaleGaussian
Feature Detection	CannyEdgeDetector, HarrisCorners, FastCorners, OpticalFlowPyrLK
Misc	Magnitude, Phase, TableLookup, Threshold (Binary, Range), WeightedAverage

Enhanced Vision Feature Set (19 kernels)

Category	Kernels
Pixelwise	Min, Max, Copy
Extraction	MatchTemplate, LBP, NonMaxSuppression, HOGCells, HOGFeatures, HoughLinesP
Tensor	TensorAdd, TensorSub, TensorMul, TensorTranspose, TensorConvertDepth, TensorMatMul, TensorTableLookup
Misc	BilateralFilter, Select, ScalarOperation

Multi-Node Pipelines

Pipeline	Nodes
EdgeDetection	ColorConvert + ChannelExtract + Gaussian3x3 + CannyEdgeDetector
SobelMagnitudePhase	Sobel3x3 + Magnitude + Phase
MorphologyOpen	Erode3x3 + Dilate3x3
MorphologyClose	Dilate3x3 + Erode3x3
DualFilter	Box3x3 + Median3x3
HistogramEqualize	ColorConvert + ChannelExtract + EqualizeHist
HarrisTracker	ColorConvert + ChannelExtract + HarrisCorners
ThresholdedEdge	Sobel3x3 + Magnitude + ConvertDepth + Threshold

Framework Benchmarks (opt-in)

Kernel benchmarks measure how fast a single OpenVX node executes; framework benchmarks measure what only the OpenVX graph runtime can do — verifying a DAG, managing virtual intermediates, fusing/aliasing buffers, scheduling work across targets. They are the metrics that distinguish an OpenVX implementation from a kernel library.

Framework benchmarks are opt-in — they are not in the default run and do not contribute to the OpenVX Vision Score. Enable them with --feature-set framework (only framework benchmarks) or --feature-set everything (kernels + framework). See docs/framework-mark-plan.md for the roadmap.

Benchmark	Chain	What it measures
`GraphDividend_Box3x3_x4`	Box3x3 × 4	Pure framework overhead (same kernel, isolates orchestration cost)
`GraphDividend_MixedFilters`	Gaussian3x3 → Box3x3 → Median3x3 → Erode3x3	Realistic 4-stage filter pipeline
`VerifyChain_Box3x3`	Box3x3 × N (sweeps `--framework-chain-depths`, default 1, 4, 16, 64)	Graph build / verify cost vs N nodes; first-process lazy-alloc tax
`ParallelBranches_Box3x3`	4 independent Box3x3 branches sharing one input	Whether the graph runtime exploits scheduling parallelism on K branches with no data dependency
`Async_Single_Box3x3_x4`	One Box3x3 × 4 chain timed with `vxProcessGraph` and with `vxScheduleGraph`+`vxWaitGraph`	Cost of the async dispatch API on a single graph
`Async_Concurrent_Box3x3_x2`	Two independent Box3x3 × 4 chain graphs	Whether the runtime overlaps independent graphs when scheduled concurrently

Each GraphDividend_* case times the same chain three ways and emits five metrics:

Metric	Unit	Meaning
`sum_immediate_ms`	ms	Sum of N back-to-back `vxu*` immediate-mode calls per chain pass
`graph_real_ms`	ms	One verified graph; intermediates are real (host-visible) buffers
`graph_virtual_ms`	ms	One verified graph; intermediates are `vxCreateVirtualImage` (runtime is free to fuse / alias / tile)
`graph_speedup`	×	`sum_immediate_ms / graph_virtual_ms`. >1 means the graph form beats summed immediate calls — the headline framework dividend
`virtual_dividend`	×	`graph_real_ms / graph_virtual_ms`. >1 means virtual intermediates help (runtime did something useful with the freedom)

When the implementation populates VX_NODE_PERFORMANCE and VX_GRAPH_PERFORMANCE for every node in the virtual-chain run, four additional fusion-attribution metrics are emitted (skipped silently on implementations that don't expose those counters, so the headline metrics above remain comparable):

Metric	Unit	Meaning
`node_count`	count	Number of nodes in the chain
`node_sum_ms`	ms	Sum over all nodes of `vxQueryNode(VX_NODE_PERFORMANCE).avg` — what the runtime says it spent inside individual kernels
`graph_perf_ms`	ms	`vxQueryGraph(VX_GRAPH_PERFORMANCE).avg` — what the runtime says it spent on the whole graph
`fusion_ratio`	×	`node_sum_ms / graph_perf_ms`. ≈ 1.0 = strict back-to-back execution, no fusion. > 1.0 = graph runs faster than the sum of node times — strong evidence the runtime fused, pipelined, or overlapped nodes (this is graph framework value the per-kernel benchmarks cannot see). < 1.0 = per-node accounting under-reports vs. the graph total (e.g. excludes shared setup); rare but possible. ≈ `node_count` = the implementation is reporting whole-graph time on every node (i.e. it doesn't attribute per-node performance) — useful signal even though it doesn't tell you about fusion

fusion_ratio is intentionally not included in the OpenVX Framework Score yet — only graph_dividend benchmarks emit it, so weighting it equally with score metrics that span every framework benchmark would over-index on this one scenario. It also depends on the implementation populating VX_NODE_PERFORMANCE correctly, which not every conformant runtime does. The score gating may be revisited once we have data from more implementations.

VerifyChain_Box3x3 rebuilds a chain of N Box3x3 nodes for each requested depth and reports per-N timings plus three aggregate metrics:

Metric	Unit	Meaning
`n{N}_create_ms`	ms	`vxCreateGraph` + N node creations at depth N
`n{N}_verify_ms`	ms	`vxVerifyGraph` cost at depth N
`n{N}_first_process_ms`	ms	First `vxProcessGraph` call (often pays a one-shot lazy-allocation / kernel-init tax)
`n{N}_steady_process_ms`	ms	Median `vxProcessGraph` cost after warmup
`verify_per_node_ms`	ms/node	Linear-regression slope of verify cost over N — the per-node verify tax
`verify_intercept_ms`	ms	Linear-regression intercept — fixed verify cost independent of chain length
`first_process_overhead_ms`	ms	`first_process_ms - steady_process_ms` at the deepest chain — the cost of the first execution beyond steady state

Use --framework-chain-depths 1,4,16,64,256 to sweep custom depths (defaults to 1,4,16,64).

ParallelBranches_Box3x3 builds one graph with K = 4 independent Box3x3 nodes that share a single input image and write to K independent outputs. The K nodes have no data dependency on each other, so a competent scheduler is free to dispatch them concurrently across cores or targets. The strict-serial baseline is K back-to-back vxuBox3x3 immediate-mode calls, which admit no parallelism.

Metric	Unit	Meaning
`branches`	count	K — number of independent branches (4 in v1)
`serial_immediate_ms`	ms	K back-to-back `vxuBox3x3` calls — strict-serial reference
`parallel_graph_ms`	ms	One graph with K independent Box3x3 nodes — graph runtime is free to parallelize
`parallelism_speedup`	×	`serial_immediate_ms / parallel_graph_ms`. K = perfect parallelism, 1 = none
`parallelism_efficiency`	×	`parallelism_speedup / K`. 1.0 = perfect K-way parallelism, 1/K = none

Interpreting parallelism_efficiency:

≈ 1.0 at FHD or larger — the runtime is exploiting the K-way opportunity well (modulo memory bandwidth).
> 1.0 at small resolutions — graph framework dispatch savings (the same effect measured by graph_dividend) compound with parallelism, since the immediate-mode baseline pays per-call dispatch tax K times.
< 1/K at very large resolutions — memory bandwidth saturates before the cores do; the K branches contend for the same input image and fight for L2/L3.

Async_Single_Box3x3_x4 runs one verified Box3x3 × 4 chain graph and times it both with synchronous vxProcessGraph and with the async pair vxScheduleGraph + vxWaitGraph. The point is to surface the cost of the async dispatch API itself.

Metric	Unit	Meaning
`sync_ms`	ms	Median `vxProcessGraph` time
`async_ms`	ms	Median `vxScheduleGraph` + `vxWaitGraph` time
`async_overhead_ratio`	×	`async_ms / sync_ms`. Lower is better; 1.0 = no tax, > 1 = the async API path is more expensive (typically thread-pool / signaling cost), < 1 = async path actually wins (rare but possible)

Async_Concurrent_Box3x3_x2 builds two independent Box3x3 × 4 chain graphs (no shared data) and times the pair two ways. The async form lets the runtime overlap the two graphs; the sync form does not.

Metric	Unit	Meaning
`graphs`	count	Number of independent graphs (2 in v1)
`sync_sequential_ms`	ms	`vxProcessGraph(g0); vxProcessGraph(g1)` — strict serial
`async_concurrent_ms`	ms	`vxScheduleGraph(g0); vxScheduleGraph(g1); vxWaitGraph(g0); vxWaitGraph(g1)` — runtime is free to overlap
`concurrency_speedup`	×	`sync_sequential_ms / async_concurrent_ms`. >1 = the runtime overlapped graphs, ≈ 1 = it serialized them, < 1 = async overhead exceeded any concurrency gain

concurrency_speedup < 1 at small resolutions is a real and useful signal: it means the implementation's async dispatch overhead exceeds any concurrency gain at that work size. The metric only becomes positive when the per-graph work is large enough to amortize the async path.

Pipelined streaming via the optional vx_khr_pipelining extension is a future enhancement and is intentionally not implemented in this release; the two scenarios above use only standard OpenVX APIs and run on every conformant implementation.

Output

Terminal Summary

=============================================================
  Summary: 156 total | 156 passed | 0 skipped | 0 failed
  OpenVX Vision Score: 1586.05 MP/s (156 benchmarks)
  OpenVX Framework Score: 4.872x (geomean of 18 framework metrics)
  vision Conformance: PASS (41/41)
  vision Top-5 Fastest:
    1. Not                           26835.8 MP/s (graph, FHD)
    2. Threshold_Binary              25550.0 MP/s (graph, VGA)
    3. Threshold_Binary              25037.7 MP/s (graph, FHD)
    4. Threshold_Range               21545.9 MP/s (graph, FHD)
    5. Not                           21533.7 MP/s (graph, VGA)
  vision Top-5 Slowest:
    1. LaplacianPyramid              727.501 ms (graph, 4K)
    2. NonLinearFilter               580.589 ms (graph, 4K)
    3. LaplacianPyramid              225.209 ms (graph, FHD)
    4. FastCorners                   191.288 ms (graph, 4K)
    5. HarrisTracker                 160.251 ms (graph, 4K)
=============================================================

Report Files

File	Description
`benchmark_results.json`	Full results with scores, conformance, scaling analysis, per-result timing stats
`benchmark_results.csv`	Tabular data for spreadsheet analysis
`benchmark_results.md`	Human-readable report with tables, top-10 lists, glossary

Composite Scores

OpenVX Vision Score — Geometric mean of MP/s across all passing graph-mode vision benchmarks
Enhanced Vision Score — Geometric mean when enhanced_vision benchmarks are included
Category Sub-Scores — Per-category geometric mean (pixelwise, filters, color, etc.)
OpenVX Framework Score — Equal-weight geometric mean (dimensionless, ×) of all graph_speedup, virtual_dividend, parallelism_efficiency, and concurrency_speedup values produced by the framework benchmarks. >1.0 means the OpenVX graph framework adds aggregate value over a kernel-only baseline. Lower-is-better metrics (e.g. verify_per_node_ms, async_overhead_ratio) are intentionally excluded so the score has a single monotonic interpretation. Only emitted when framework benchmarks are run (--feature-set framework or --feature-set everything).

Conformance Summary

Checks whether all available kernels in each feature set produced valid graph-mode results. Reports PASS/FAIL with a list of missing kernels.

Cross-Vendor Comparison

C++ (built-in)

Run the benchmark on two different implementations, then compare the JSON reports:

# Run on Vendor A
./openvx-mark --output-dir results_vendor_a

# Run on Vendor B (different machine/implementation)
./openvx-mark --output-dir results_vendor_b

# Compare
./openvx-mark --compare results_vendor_a/benchmark_results.json,results_vendor_b/benchmark_results.json

This generates a comparison.md with a side-by-side table showing median latency, throughput, and % change for each benchmark.

Python

A Python comparison script is also provided for more flexibility:

python3 scripts/compare_reports.py results_vendor_a/benchmark_results.json \
                                    results_vendor_b/benchmark_results.json \
                                    --output comparison

Glossary

Metric	Description
Median (ms)	Median wall-clock execution time across all iterations (50th percentile). More stable than mean for benchmarking.
Mean (ms)	Arithmetic mean of wall-clock execution times.
Min (ms)	Fastest observed execution time (best case).
Max (ms)	Slowest observed execution time (worst case).
StdDev (ms)	Standard deviation of execution times after IQR outlier removal.
P5/P95/P99 (ms)	5th, 95th, and 99th percentile execution times from the raw (pre-outlier-removal) samples.
CV%	Coefficient of Variation — `(stddev / mean) * 100`. Lower values indicate more stable/repeatable results.
MP/s	Megapixels per second — `(width * height) / median_time / 1e6`. Primary throughput metric.
Samples	Number of timing samples after IQR outlier removal.
Outliers	Number of samples removed by the IQR (Interquartile Range) method.
Peak (ms)	Best-case execution time (`min_ns`). Represents peak achievable performance.
Sustained (ms)	Typical execution time (`median_ns`). Represents sustained real-world performance.
Sustained Ratio	`min_ns / median_ns`. Values near 1.0 indicate consistent performance; lower values suggest variance from caching, scheduling, or thermal effects.
Scaling Efficiency	`(MP/s at high res) / (MP/s at low res)`. 1.0 = perfect scaling; values below 1.0 indicate memory or bandwidth bottlenecks at higher resolutions.
Vision Score	Geometric mean of MP/s across all passing graph-mode vision benchmarks. Single-number summary for cross-vendor comparison.
Framework Score	Equal-weight geometric mean (×, dimensionless) of all `graph_speedup`, `virtual_dividend`, `parallelism_efficiency`, and `concurrency_speedup` values produced by framework benchmarks. >1.0 means the OpenVX graph framework adds aggregate value over a kernel-only baseline. Only emitted when framework benchmarks are run.
Stability Warning	Flagged when CV% exceeds the stability threshold (default: 15%). Indicates the result may not be reliable — increase iterations or reduce system load.
Conformance	Whether all available kernels in a feature set produced valid graph-mode results. PASS = all kernels benchmarked successfully.

Project Structure

openvx-mark/
├── CMakeLists.txt              # Build system
├── cmake/
│   └── FindOpenVX.cmake        # Vendor-agnostic OpenVX discovery
├── include/
│   ├── benchmark_config.h      # Configuration and defaults
│   ├── benchmark_context.h     # OpenVX context wrapper
│   ├── benchmark_report.h      # Report generation + analytics
│   ├── benchmark_runner.h      # Benchmark execution engine
│   ├── benchmark_stats.h       # Statistical computation
│   ├── benchmark_timer.h       # High-resolution timing
│   ├── kernel_registry.h       # OpenVX kernel catalog + availability probing
│   ├── resource_tracker.h      # RAII resource management
│   ├── system_info.h           # Host system information
│   └── test_data_generator.h   # Deterministic test data generation
├── scripts/
│   └── compare_reports.py      # Python cross-vendor comparison tool
└── src/
    ├── main.cpp                # CLI entry point
    ├── benchmark_context.cpp
    ├── benchmark_runner.cpp     # Graph/immediate mode execution + stability gating
    ├── benchmark_report.cpp     # JSON/CSV/Markdown generation + analytics
    ├── benchmark_stats.cpp      # Percentiles, IQR outlier removal
    ├── benchmark_timer.cpp
    ├── kernel_registry.cpp      # 60 standard kernel definitions
    ├── system_info.cpp          # Cross-platform system info collection
    ├── test_data_generator.cpp  # Random image/tensor/auxiliary object creation
    └── benchmarks/
        ├── node_pixelwise.cpp   # And, Or, Xor, Not, AbsDiff, Add, Subtract, Multiply, Min, Max, Copy
        ├── node_filters.cpp     # Box3x3, Gaussian3x3, Median3x3, Erode3x3, Dilate3x3, Sobel3x3, CustomConvolution, NonLinearFilter
        ├── node_color.cpp       # ColorConvert, ChannelExtract, ChannelCombine, ConvertDepth
        ├── node_geometric.cpp   # ScaleImage, WarpAffine, WarpPerspective, Remap
        ├── node_statistical.cpp # Histogram, EqualizeHist, MeanStdDev, MinMaxLoc, IntegralImage
        ├── node_multiscale.cpp  # GaussianPyramid, LaplacianPyramid, HalfScaleGaussian
        ├── node_feature.cpp     # CannyEdgeDetector, HarrisCorners, FastCorners, OpticalFlowPyrLK
        ├── node_extraction.cpp  # MatchTemplate, LBP, NonMaxSuppression
        ├── node_tensor.cpp      # TensorAdd, TensorSub, TensorMul, TensorTranspose, TensorConvertDepth, TensorTableLookup
        ├── node_misc.cpp        # Magnitude, Phase, TableLookup, Threshold, WeightedAverage, Select
        ├── immediate_benchmarks.cpp  # vxu* immediate-mode variants
        ├── pipeline_vision.cpp  # EdgeDetection, SobelMagnitudePhase, MorphologyOpen/Close, DualFilter
        └── pipeline_feature.cpp # HistogramEqualize, HarrisTracker, ThresholdedEdge

License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
cmake		cmake
docs		docs
include		include
opencv-mark		opencv-mark
scripts		scripts
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

openvx-mark

Features

Important

Prerequisites

Building

Auto-detect OpenVX (recommended)

AMD OpenVX (MIVisionX)

Custom OpenVX installation

Khronos Sample Implementation

Usage

Quick start

CLI Options

Benchmark Selection

Resolution

Timing

Output

Benchmarked Kernels

Vision Feature Set (41 kernels)

Enhanced Vision Feature Set (19 kernels)

Multi-Node Pipelines

Framework Benchmarks (opt-in)

Output

Terminal Summary

Report Files

Composite Scores

Conformance Summary

Cross-Vendor Comparison

C++ (built-in)

Python

Glossary

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages