openvx-mark is a vendor-agnostic benchmark suite for OpenVX implementations (1.0 through 1.3+). It measures the performance of individual vision kernels, multi-node pipelines, immediate-mode operations, and the OpenVX graph framework itself across configurable resolutions, producing composite scores, conformance reports, and detailed analytics.
openvx-mark works with any conformant OpenVX implementation — AMD OpenVX (MIVisionX), Intel OpenVX, Khronos Sample Implementation, or any other vendor's runtime. It is designed to answer two complementary questions: how fast are this implementation's kernels? and how much value does this implementation's graph framework add over a kernel-only baseline?
- 60 standard OpenVX kernels across vision and enhanced vision feature sets
- Graph mode and immediate mode benchmarking
- Framework benchmarks (opt-in) — measure what only the OpenVX graph runtime can do: graph-vs-immediate dividend, virtual-image fusion, scheduling parallelism on independent branches, async dispatch overhead and concurrency, verify-cost-vs-depth scaling, and per-node
VX_NODE_PERFORMANCEattribution. See Framework Benchmarks below. - Multi-resolution testing — VGA, HD, FHD, 4K, 8K, or custom
- Composite scoring — geometric mean of megapixels/sec (OpenVX Vision Score) plus a dimensionless OpenVX Framework Score (geomean of framework-dividend metrics;
> 1.0means the graph framework is adding value) - Conformance checking — verifies all available kernels produce valid results
- Stability gating — CV% threshold with automatic retries for unstable results
- Multi-resolution scaling analysis — measures throughput scaling efficiency across resolutions
- Peak vs sustained performance — compares best-case to typical latency
- Cross-vendor comparison — C++ (
--compare) and Python (scripts/compare_reports.py) generate side-by-side reports including a direction-aware Framework Metrics Comparison section - Reports — JSON, CSV, and Markdown output with glossary
See CHANGELOG.md for release notes; the framework benchmark suite landed in v1.0.
It is recommended that the OpenVX implementation first passes the Khronos OpenVX Conformance Test Suite before running openvx-mark. Benchmarking results are only meaningful when the underlying implementation is conformant — non-conformant implementations may produce incorrect outputs, which will be flagged by openvx-mark's output verification and excluded from composite scores.
- C++17 compiler
- CMake 3.10+
- An OpenVX implementation with
libopenvxandlibvxulibraries
If your OpenVX implementation is installed in a standard location (/opt/rocm, /usr/local, /usr), CMake will find it automatically:
mkdir build && cd build
cmake ..
cmake --build .mkdir build && cd build
cmake -DROCM_PATH=/opt/rocm ..
cmake --build .Point CMake to your OpenVX headers and libraries:
mkdir build && cd build
cmake -DOPENVX_INCLUDES=/path/to/openvx/include \
-DOPENVX_LIB_DIR=/path/to/openvx/lib ..
cmake --build .mkdir build && cd build
cmake -DOPENVX_INCLUDES=/path/to/OpenVX-sample-impl/include \
-DOPENVX_LIB_DIR=/path/to/OpenVX-sample-impl/build/lib ..
cmake --build ../openvx-mark [OPTIONS]# Default run: graph mode, VGA+FHD+4K, 100 iterations
./openvx-mark
# Quick test run
./openvx-mark --resolution VGA --iterations 10 --warmup 3
# Full benchmark with all feature sets
./openvx-mark --all --iterations 200
# Include immediate-mode benchmarks
./openvx-mark --mode both --resolution FHD| Option | Description | Default |
|---|---|---|
--all |
Run all benchmarks (vision + enhanced_vision) | |
--feature-set SET[,SET,...] |
Feature sets: vision, enhanced_vision, all |
vision |
--category CAT[,CAT,...] |
Filter by category | all |
--kernel NAME[,NAME,...] |
Filter by kernel name | all |
--mode graph|immediate|both |
Execution mode | graph |
--skip-pipelines |
Skip multi-node pipeline benchmarks |
| Option | Description | Default |
|---|---|---|
--resolution RES[,RES,...] |
Preset: VGA, HD, FHD, 4K, 8K |
VGA,FHD,4K |
--width W --height H |
Custom resolution |
| Option | Description | Default |
|---|---|---|
--iterations N |
Measurement iterations per benchmark | 100 |
--warmup N |
Warm-up iterations (not measured) | 10 |
--seed N |
PRNG seed for reproducible test data | 42 |
--stability-threshold N |
CV% threshold for stability warnings | 15 |
--max-retries N |
Max retries for unstable benchmarks (2x iterations each retry) | 0 |
--framework-chain-depths N,N,... |
Chain depths swept by VerifyChain_Box3x3 |
1,4,16,64 |
| Option | Description | Default |
|---|---|---|
--output-dir DIR |
Output directory for reports | ./benchmark_results |
--format json,csv,markdown |
Output formats (comma-separated) | all three |
--verbose |
Verbose output with per-benchmark warnings | |
--quiet |
Minimal output (suppress per-benchmark lines) | |
--compare file1.json,file2.json |
Compare two or more JSON reports |
| Category | Kernels |
|---|---|
| Pixelwise | And, Or, Xor, Not, AbsDiff, Add, Subtract, Multiply |
| Filters | Box3x3, Gaussian3x3, Median3x3, Erode3x3, Dilate3x3, Sobel3x3, CustomConvolution, NonLinearFilter |
| Color | ColorConvert (RGB2IYUV, RGB2NV12), ChannelExtract, ChannelCombine, ConvertDepth |
| Geometric | ScaleImage (Half, Double), WarpAffine, WarpPerspective, Remap |
| Statistical | Histogram, EqualizeHist, MeanStdDev, MinMaxLoc, IntegralImage |
| Multi-scale | GaussianPyramid, LaplacianPyramid, HalfScaleGaussian |
| Feature Detection | CannyEdgeDetector, HarrisCorners, FastCorners, OpticalFlowPyrLK |
| Misc | Magnitude, Phase, TableLookup, Threshold (Binary, Range), WeightedAverage |
| Category | Kernels |
|---|---|
| Pixelwise | Min, Max, Copy |
| Extraction | MatchTemplate, LBP, NonMaxSuppression, HOGCells, HOGFeatures, HoughLinesP |
| Tensor | TensorAdd, TensorSub, TensorMul, TensorTranspose, TensorConvertDepth, TensorMatMul, TensorTableLookup |
| Misc | BilateralFilter, Select, ScalarOperation |
| Pipeline | Nodes |
|---|---|
| EdgeDetection | ColorConvert + ChannelExtract + Gaussian3x3 + CannyEdgeDetector |
| SobelMagnitudePhase | Sobel3x3 + Magnitude + Phase |
| MorphologyOpen | Erode3x3 + Dilate3x3 |
| MorphologyClose | Dilate3x3 + Erode3x3 |
| DualFilter | Box3x3 + Median3x3 |
| HistogramEqualize | ColorConvert + ChannelExtract + EqualizeHist |
| HarrisTracker | ColorConvert + ChannelExtract + HarrisCorners |
| ThresholdedEdge | Sobel3x3 + Magnitude + ConvertDepth + Threshold |
Kernel benchmarks measure how fast a single OpenVX node executes; framework benchmarks measure what only the OpenVX graph runtime can do — verifying a DAG, managing virtual intermediates, fusing/aliasing buffers, scheduling work across targets. They are the metrics that distinguish an OpenVX implementation from a kernel library.
Framework benchmarks are opt-in — they are not in the default run and do not contribute to the OpenVX Vision Score. Enable them with --feature-set framework (only framework benchmarks) or --feature-set everything (kernels + framework). See docs/framework-mark-plan.md for the roadmap.
| Benchmark | Chain | What it measures |
|---|---|---|
GraphDividend_Box3x3_x4 |
Box3x3 × 4 | Pure framework overhead (same kernel, isolates orchestration cost) |
GraphDividend_MixedFilters |
Gaussian3x3 → Box3x3 → Median3x3 → Erode3x3 | Realistic 4-stage filter pipeline |
VerifyChain_Box3x3 |
Box3x3 × N (sweeps --framework-chain-depths, default 1, 4, 16, 64) |
Graph build / verify cost vs N nodes; first-process lazy-alloc tax |
ParallelBranches_Box3x3 |
4 independent Box3x3 branches sharing one input | Whether the graph runtime exploits scheduling parallelism on K branches with no data dependency |
Async_Single_Box3x3_x4 |
One Box3x3 × 4 chain timed with vxProcessGraph and with vxScheduleGraph+vxWaitGraph |
Cost of the async dispatch API on a single graph |
Async_Concurrent_Box3x3_x2 |
Two independent Box3x3 × 4 chain graphs | Whether the runtime overlaps independent graphs when scheduled concurrently |
Each GraphDividend_* case times the same chain three ways and emits five metrics:
| Metric | Unit | Meaning |
|---|---|---|
sum_immediate_ms |
ms | Sum of N back-to-back vxu* immediate-mode calls per chain pass |
graph_real_ms |
ms | One verified graph; intermediates are real (host-visible) buffers |
graph_virtual_ms |
ms | One verified graph; intermediates are vxCreateVirtualImage (runtime is free to fuse / alias / tile) |
graph_speedup |
× | sum_immediate_ms / graph_virtual_ms. >1 means the graph form beats summed immediate calls — the headline framework dividend |
virtual_dividend |
× | graph_real_ms / graph_virtual_ms. >1 means virtual intermediates help (runtime did something useful with the freedom) |
When the implementation populates VX_NODE_PERFORMANCE and VX_GRAPH_PERFORMANCE for every node in the virtual-chain run, four additional fusion-attribution metrics are emitted (skipped silently on implementations that don't expose those counters, so the headline metrics above remain comparable):
| Metric | Unit | Meaning |
|---|---|---|
node_count |
count | Number of nodes in the chain |
node_sum_ms |
ms | Sum over all nodes of vxQueryNode(VX_NODE_PERFORMANCE).avg — what the runtime says it spent inside individual kernels |
graph_perf_ms |
ms | vxQueryGraph(VX_GRAPH_PERFORMANCE).avg — what the runtime says it spent on the whole graph |
fusion_ratio |
× | node_sum_ms / graph_perf_ms. ≈ 1.0 = strict back-to-back execution, no fusion. > 1.0 = graph runs faster than the sum of node times — strong evidence the runtime fused, pipelined, or overlapped nodes (this is graph framework value the per-kernel benchmarks cannot see). < 1.0 = per-node accounting under-reports vs. the graph total (e.g. excludes shared setup); rare but possible. ≈ node_count = the implementation is reporting whole-graph time on every node (i.e. it doesn't attribute per-node performance) — useful signal even though it doesn't tell you about fusion |
fusion_ratiois intentionally not included in the OpenVX Framework Score yet — onlygraph_dividendbenchmarks emit it, so weighting it equally with score metrics that span every framework benchmark would over-index on this one scenario. It also depends on the implementation populatingVX_NODE_PERFORMANCEcorrectly, which not every conformant runtime does. The score gating may be revisited once we have data from more implementations.
VerifyChain_Box3x3 rebuilds a chain of N Box3x3 nodes for each requested depth and reports per-N timings plus three aggregate metrics:
| Metric | Unit | Meaning |
|---|---|---|
n{N}_create_ms |
ms | vxCreateGraph + N node creations at depth N |
n{N}_verify_ms |
ms | vxVerifyGraph cost at depth N |
n{N}_first_process_ms |
ms | First vxProcessGraph call (often pays a one-shot lazy-allocation / kernel-init tax) |
n{N}_steady_process_ms |
ms | Median vxProcessGraph cost after warmup |
verify_per_node_ms |
ms/node | Linear-regression slope of verify cost over N — the per-node verify tax |
verify_intercept_ms |
ms | Linear-regression intercept — fixed verify cost independent of chain length |
first_process_overhead_ms |
ms | first_process_ms - steady_process_ms at the deepest chain — the cost of the first execution beyond steady state |
Use --framework-chain-depths 1,4,16,64,256 to sweep custom depths (defaults to 1,4,16,64).
ParallelBranches_Box3x3 builds one graph with K = 4 independent Box3x3 nodes that share a single input image and write to K independent outputs. The K nodes have no data dependency on each other, so a competent scheduler is free to dispatch them concurrently across cores or targets. The strict-serial baseline is K back-to-back vxuBox3x3 immediate-mode calls, which admit no parallelism.
| Metric | Unit | Meaning |
|---|---|---|
branches |
count | K — number of independent branches (4 in v1) |
serial_immediate_ms |
ms | K back-to-back vxuBox3x3 calls — strict-serial reference |
parallel_graph_ms |
ms | One graph with K independent Box3x3 nodes — graph runtime is free to parallelize |
parallelism_speedup |
× | serial_immediate_ms / parallel_graph_ms. K = perfect parallelism, 1 = none |
parallelism_efficiency |
× | parallelism_speedup / K. 1.0 = perfect K-way parallelism, 1/K = none |
Interpreting parallelism_efficiency:
- ≈ 1.0 at FHD or larger — the runtime is exploiting the K-way opportunity well (modulo memory bandwidth).
- > 1.0 at small resolutions — graph framework dispatch savings (the same effect measured by
graph_dividend) compound with parallelism, since the immediate-mode baseline pays per-call dispatch tax K times. - < 1/K at very large resolutions — memory bandwidth saturates before the cores do; the K branches contend for the same input image and fight for L2/L3.
Async_Single_Box3x3_x4 runs one verified Box3x3 × 4 chain graph and times it both with synchronous vxProcessGraph and with the async pair vxScheduleGraph + vxWaitGraph. The point is to surface the cost of the async dispatch API itself.
| Metric | Unit | Meaning |
|---|---|---|
sync_ms |
ms | Median vxProcessGraph time |
async_ms |
ms | Median vxScheduleGraph + vxWaitGraph time |
async_overhead_ratio |
× | async_ms / sync_ms. Lower is better; 1.0 = no tax, > 1 = the async API path is more expensive (typically thread-pool / signaling cost), < 1 = async path actually wins (rare but possible) |
Async_Concurrent_Box3x3_x2 builds two independent Box3x3 × 4 chain graphs (no shared data) and times the pair two ways. The async form lets the runtime overlap the two graphs; the sync form does not.
| Metric | Unit | Meaning |
|---|---|---|
graphs |
count | Number of independent graphs (2 in v1) |
sync_sequential_ms |
ms | vxProcessGraph(g0); vxProcessGraph(g1) — strict serial |
async_concurrent_ms |
ms | vxScheduleGraph(g0); vxScheduleGraph(g1); vxWaitGraph(g0); vxWaitGraph(g1) — runtime is free to overlap |
concurrency_speedup |
× | sync_sequential_ms / async_concurrent_ms. >1 = the runtime overlapped graphs, ≈ 1 = it serialized them, < 1 = async overhead exceeded any concurrency gain |
concurrency_speedup < 1 at small resolutions is a real and useful signal: it means the implementation's async dispatch overhead exceeds any concurrency gain at that work size. The metric only becomes positive when the per-graph work is large enough to amortize the async path.
Pipelined streaming via the optional
vx_khr_pipeliningextension is a future enhancement and is intentionally not implemented in this release; the two scenarios above use only standard OpenVX APIs and run on every conformant implementation.
=============================================================
Summary: 156 total | 156 passed | 0 skipped | 0 failed
OpenVX Vision Score: 1586.05 MP/s (156 benchmarks)
OpenVX Framework Score: 4.872x (geomean of 18 framework metrics)
vision Conformance: PASS (41/41)
vision Top-5 Fastest:
1. Not 26835.8 MP/s (graph, FHD)
2. Threshold_Binary 25550.0 MP/s (graph, VGA)
3. Threshold_Binary 25037.7 MP/s (graph, FHD)
4. Threshold_Range 21545.9 MP/s (graph, FHD)
5. Not 21533.7 MP/s (graph, VGA)
vision Top-5 Slowest:
1. LaplacianPyramid 727.501 ms (graph, 4K)
2. NonLinearFilter 580.589 ms (graph, 4K)
3. LaplacianPyramid 225.209 ms (graph, FHD)
4. FastCorners 191.288 ms (graph, 4K)
5. HarrisTracker 160.251 ms (graph, 4K)
=============================================================
| File | Description |
|---|---|
benchmark_results.json |
Full results with scores, conformance, scaling analysis, per-result timing stats |
benchmark_results.csv |
Tabular data for spreadsheet analysis |
benchmark_results.md |
Human-readable report with tables, top-10 lists, glossary |
- OpenVX Vision Score — Geometric mean of MP/s across all passing graph-mode vision benchmarks
- Enhanced Vision Score — Geometric mean when enhanced_vision benchmarks are included
- Category Sub-Scores — Per-category geometric mean (pixelwise, filters, color, etc.)
- OpenVX Framework Score — Equal-weight geometric mean (dimensionless, ×) of all
graph_speedup,virtual_dividend,parallelism_efficiency, andconcurrency_speedupvalues produced by the framework benchmarks. >1.0 means the OpenVX graph framework adds aggregate value over a kernel-only baseline. Lower-is-better metrics (e.g.verify_per_node_ms,async_overhead_ratio) are intentionally excluded so the score has a single monotonic interpretation. Only emitted when framework benchmarks are run (--feature-set frameworkor--feature-set everything).
Checks whether all available kernels in each feature set produced valid graph-mode results. Reports PASS/FAIL with a list of missing kernels.
Run the benchmark on two different implementations, then compare the JSON reports:
# Run on Vendor A
./openvx-mark --output-dir results_vendor_a
# Run on Vendor B (different machine/implementation)
./openvx-mark --output-dir results_vendor_b
# Compare
./openvx-mark --compare results_vendor_a/benchmark_results.json,results_vendor_b/benchmark_results.jsonThis generates a comparison.md with a side-by-side table showing median latency, throughput, and % change for each benchmark.
A Python comparison script is also provided for more flexibility:
python3 scripts/compare_reports.py results_vendor_a/benchmark_results.json \
results_vendor_b/benchmark_results.json \
--output comparison| Metric | Description |
|---|---|
| Median (ms) | Median wall-clock execution time across all iterations (50th percentile). More stable than mean for benchmarking. |
| Mean (ms) | Arithmetic mean of wall-clock execution times. |
| Min (ms) | Fastest observed execution time (best case). |
| Max (ms) | Slowest observed execution time (worst case). |
| StdDev (ms) | Standard deviation of execution times after IQR outlier removal. |
| P5/P95/P99 (ms) | 5th, 95th, and 99th percentile execution times from the raw (pre-outlier-removal) samples. |
| CV% | Coefficient of Variation — (stddev / mean) * 100. Lower values indicate more stable/repeatable results. |
| MP/s | Megapixels per second — (width * height) / median_time / 1e6. Primary throughput metric. |
| Samples | Number of timing samples after IQR outlier removal. |
| Outliers | Number of samples removed by the IQR (Interquartile Range) method. |
| Peak (ms) | Best-case execution time (min_ns). Represents peak achievable performance. |
| Sustained (ms) | Typical execution time (median_ns). Represents sustained real-world performance. |
| Sustained Ratio | min_ns / median_ns. Values near 1.0 indicate consistent performance; lower values suggest variance from caching, scheduling, or thermal effects. |
| Scaling Efficiency | (MP/s at high res) / (MP/s at low res). 1.0 = perfect scaling; values below 1.0 indicate memory or bandwidth bottlenecks at higher resolutions. |
| Vision Score | Geometric mean of MP/s across all passing graph-mode vision benchmarks. Single-number summary for cross-vendor comparison. |
| Framework Score | Equal-weight geometric mean (×, dimensionless) of all graph_speedup, virtual_dividend, parallelism_efficiency, and concurrency_speedup values produced by framework benchmarks. >1.0 means the OpenVX graph framework adds aggregate value over a kernel-only baseline. Only emitted when framework benchmarks are run. |
| Stability Warning | Flagged when CV% exceeds the stability threshold (default: 15%). Indicates the result may not be reliable — increase iterations or reduce system load. |
| Conformance | Whether all available kernels in a feature set produced valid graph-mode results. PASS = all kernels benchmarked successfully. |
openvx-mark/
├── CMakeLists.txt # Build system
├── cmake/
│ └── FindOpenVX.cmake # Vendor-agnostic OpenVX discovery
├── include/
│ ├── benchmark_config.h # Configuration and defaults
│ ├── benchmark_context.h # OpenVX context wrapper
│ ├── benchmark_report.h # Report generation + analytics
│ ├── benchmark_runner.h # Benchmark execution engine
│ ├── benchmark_stats.h # Statistical computation
│ ├── benchmark_timer.h # High-resolution timing
│ ├── kernel_registry.h # OpenVX kernel catalog + availability probing
│ ├── resource_tracker.h # RAII resource management
│ ├── system_info.h # Host system information
│ └── test_data_generator.h # Deterministic test data generation
├── scripts/
│ └── compare_reports.py # Python cross-vendor comparison tool
└── src/
├── main.cpp # CLI entry point
├── benchmark_context.cpp
├── benchmark_runner.cpp # Graph/immediate mode execution + stability gating
├── benchmark_report.cpp # JSON/CSV/Markdown generation + analytics
├── benchmark_stats.cpp # Percentiles, IQR outlier removal
├── benchmark_timer.cpp
├── kernel_registry.cpp # 60 standard kernel definitions
├── system_info.cpp # Cross-platform system info collection
├── test_data_generator.cpp # Random image/tensor/auxiliary object creation
└── benchmarks/
├── node_pixelwise.cpp # And, Or, Xor, Not, AbsDiff, Add, Subtract, Multiply, Min, Max, Copy
├── node_filters.cpp # Box3x3, Gaussian3x3, Median3x3, Erode3x3, Dilate3x3, Sobel3x3, CustomConvolution, NonLinearFilter
├── node_color.cpp # ColorConvert, ChannelExtract, ChannelCombine, ConvertDepth
├── node_geometric.cpp # ScaleImage, WarpAffine, WarpPerspective, Remap
├── node_statistical.cpp # Histogram, EqualizeHist, MeanStdDev, MinMaxLoc, IntegralImage
├── node_multiscale.cpp # GaussianPyramid, LaplacianPyramid, HalfScaleGaussian
├── node_feature.cpp # CannyEdgeDetector, HarrisCorners, FastCorners, OpticalFlowPyrLK
├── node_extraction.cpp # MatchTemplate, LBP, NonMaxSuppression
├── node_tensor.cpp # TensorAdd, TensorSub, TensorMul, TensorTranspose, TensorConvertDepth, TensorTableLookup
├── node_misc.cpp # Magnitude, Phase, TableLookup, Threshold, WeightedAverage, Select
├── immediate_benchmarks.cpp # vxu* immediate-mode variants
├── pipeline_vision.cpp # EdgeDetection, SobelMagnitudePhase, MorphologyOpen/Close, DualFilter
└── pipeline_feature.cpp # HistogramEqualize, HarrisTracker, ThresholdedEdge
This project is licensed under the MIT License. See LICENSE for details.