Skip to content

v3.4-rc

Pre-release
Pre-release
Compare
Choose a tag to compare
@harrymao2022 harrymao2022 released this 13 Feb 22:08
· 78 commits to rls-v3.4 since this release

Performance Optimizations

  • Intel Architecture Processors:

    • Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
    • Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
    • Improved RNN primitive performance with LBR_GRU cell.
    • Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
    • Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
    • Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
    • Improved int8 matmul performance with transposed A tensor.
    • Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
    • Improved performance of int8 convolution with post-ops.
    • Optimized batch matmul with binary post-op and broadcast mask 1 and 14.
    • Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
    • Improved performance of subgraphs including matmul and add operations and mixed int8 and bfloat16 data types with Graph API.
    • [experimental] Improved performance of reduction, softmax and layernorm operations with experimental Graph Compiler backend.
    • [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
  • Intel Graphics Products:

    • Introduced initial optimizations for Processor Graphics based on Xe2 architecture.
    • Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
    • Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
    • Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
    • Improved convolution performance for cases relevant to the Stable Diffusion model.
    • Improved RNN primitive performance.
    • Improved pooling forward propagation performance.
    • Improved batched matmul performance for cases with 5 dimensions or more.
  • AArch64-based Processors:

    • Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
    • Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
    • Improved bf16 inner product product primitive performance with ACL.

Functionality

  • Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support matmul with int8 wight compression.
  • Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
  • Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
  • [experimental] Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
  • Intel Graphics Products
    • Introduced PReLU post-op support for inner product and matmul primitives.

Usability

  • Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
  • Introduced accumulation mode control.
  • Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
  • Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
  • Reduced RNN primitive memory consumption on GPUs.
  • Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
  • Extended tensor constructor in Graph API to support memory allocation and management by the library.
  • Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
  • Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
  • Changed default optimization flags for AArch64 builds to -mcpu=generic to improve portability.

Validation

  • Improved benchdnn performance by optimizing bottlenecks in validation code.
  • Introduced --num-streams knob in benchdnn to support benchmarking in multi-stream scenarios.

Breaking Changes

  • Updated minimal supported ACL version to 23.11 (was 23.02.1).

Thanks to these Contributors

This release contains contributions from the project core team as well as Alexander Grund @Flamefire, David Svantesson @davsva01, Fadi Arafeh @fadara01, Hugh Delaney @hdelan, Ilya Lavrenov @ilya-lavrenov, Jacob Kahn @jacobkahn, Nathan John Sircombe @nSircombe, Renato Barros Arantes @renato-arantes, Sergey Shalnov @shssf, Sunita Nadampalli @snadampal, and Svetlozar Georgiev @sgeor255. We would also like to thank everyone who asked questions and reported issues.