Skip to content

Add XeGPU matrix multiplication benchmark #15

@tkarna

Description

@tkarna

This issue outlines the required steps to add an XeGPU matrix multiplication benchmark.

  1. mlir-gen should be a Python module.
    • Move to python dir tree as a module, e.g., python/lighthouse/payload_generator/payload_generator.py
    • Add command line interface, e.g., python/lighthouse/payload_generator/cli.py
    • Map to an executable script, e.g., mlir-gen in pyproject.toml:
      [project.scripts]
      mlir-gen = "lighthouse.python.ingress.payload_generator:cli"
  2. Generic infrastructure to define and execute workloads
    • A Workload object to hold the payload IR and fixed metadata (e.g. problem size).
    • Provides methods for getting payload IR, schedule IR, input arguments for calling payload, (optionally) correctness verification, etc.
    • A generic execution_engine wrapper that can execute a Workload and can, e.g., run it in a timer loop for benchmarking.
  3. XeGPU matmul Workload and benchmarking tools
    • Currently supports only matmul + elementwise post ops.
    • Location: python/lighthouse/benchmark/xegpu/matmul.py (?)
    • Generates payload, defines lowering schedule, compiles, executes, measures performance.
    • Uses existing functionality: workload, payload generator, execution toos, etc.
    • Exposed as another CLI command, e.g., benchmark_xegpu_matmul
  4. Installation mechanism for XeGPU support:
    • Compile LLVM with LevelZero runtime and necessary flags.
    • Hook up LLVM Python bindings with Lighthouse.
      • Simply set PYTHONPATH="$LLVM_INSTALL_PATT/python_packages/mlir_core/"
    • Provide an easy-to-use install mechanism.
      • Build instructions in README(?).
      • Or a generic build script, invoked on install(?).
        • mlir-python-bindings Python package is not installed in this case(?).
      • When executing kernels, raise a descriptive error if Xe GPU device/drivers are missing.

Examples of the benchmark command line tool usage:

$ benchmark_xegpu_matmul
sizes=4096,4096,4096 dt=f16,f32 wg-tile=256,256 sg-tile=32,32 ... time(ms): 1.76 GFLOPS: 78072
$ benchmark_xegpu_matmul --sizes 4096 2048 1024 --wg-tile-size 256 256 ...
...
$ benchmark_xegpu_matmul --dump-kernel {intial,tiled,vectorized,bufferized,xegpu-wg,...}
<prints payload IR at this stage of lowering>
$ benchmark_xegpu_matmul --dump-kernel xegpu-wg --dump-schedule
<also dumps transform schdule used to lower to this level>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions