Skip to content

@m4rs-mt m4rs-mt released this Feb 14, 2021 · 27 commits to master since this release

The new stable version offers significant performance improvements of the generated kernel programs and contains critical resource deallocation fixes (get the Nuget package).

It is strongly recommended to upgrade to this version as soon as possible to avoid resource and GC related deallocation issues.

Breaking changes

  • The inheritance hierarchy of the ExchangeBuffer class has been changed to avoid exposing internal memory buffers. If you previously relied on the immediate inheritance from ExchangeBufferBase on MemoryBuffer, you have to adapt your program to use the intermediate base class MemoryBuffer<T, TIndex> instead (see diff).
  • Properties exposing internal memory buffers of the high-level MemoryBufferXD classes have been removed to avoid ownership related GC-free issues (see diff).

Why are there breaking changes?

We have decided to remove dangerous properties from several memory buffer classes. The use of these properties can lead to program crashes, since buffers could be disposed asynchronously in the background by the GC without further notice.

Changes

  • Improved performance of kernel launchers by passing packed argument structures (#358, #372).
  • Graduated different optimizations from O2 to O1 (release mode) to improve performance in release builds using an additional of stable optimization passes (#344).
  • Graduated O2 optimizations in the Cuda backend to O1 pipeline to generate vectorized IO operations in release builds (#350).
  • Added support for managed sizeof IL instruction (#380).
  • Added PrintInformation method to Accelerator instances to print detailed accelerator information (#389).
  • Added enhanced assertions and out-of-bounds checks to all ArrayView accesses on GPU devices (Use flag ContextFlags.EnableAsserations or attach a debugger to your application to enable assertion checks. Make sure to use the portable debug information format for detailed source location information) (#375).
  • Added support for printf-like output in Kernels for CPU, Cuda and OpenCL accelerators (#342).
  • Added new utility Launch/LaunchAutoGrouped methods to immediately launch kernels using a separate strong-reference cache (#336).
  • Added new AlignTo alignment methods to explicitly align ArrayView instances to a particular alignment in bytes (#316).
  • Added enhanced support for local memory via a new LocalMemory class (#316).
  • Added support for several PopCount, CLZ and CTZ operations (#324).
  • Added new MemSet functions to all memory buffers (#338).
  • Added new IfConditionalConversion to fold nested and-also and or-else block chains to O2 pipeline (#328).
  • Added new local memory optimizations to simplify array accesses (#317).
  • Added simple 64-bit-based LongGlobalIndex helper to simplify correct computations using 64-bit integers (#337).
  • Added new CLPlatformVersion and fixed OpenCL 1.2 compatibility issues (#335).
  • Removed support for .NET Core 2.0 (#353).
  • Prevent using SharedMemory in implicitly grouped kernels (#354).
  • Prevent using CudaAccelerator and CLAccelerator instances to run on non-native OS .NET versions (#396).
  • Fixed critical GC-related resource deallocation issues (#376, #393).
  • Fixed returning correct length of dynamic shared memory buffers (#357).
  • Fixed invalid alignment information in the presence of reinterpret casts (#386).
  • Fixed invalid address computations of fixed array buffers (#361).
  • Fixed invalid PTX calling convention (#362).
  • Fixed edge cases in LoopUnrolling (#373).
  • Fixed invalid printf formats for int64 and uintX types (#391).
  • Fixed invalid DebugArrayView implementations (#345).
  • Fixed invalid initializations of local memory arrays (#287).

Major internal changes:

  • Removed singleton instance of RuntimeSystem to avoid concurrency/reflection-API issues (#393).
  • Updated default optimizations for ILGPU debug builds (#384).
  • Added support for unity tests running on. NET Framework 4.7 (#355).
  • Migrated from FxCop analyzers to .NET analyzers. (#352).
  • Redesigned internal address-space inference passes (#364).

Special thanks

Special thanks to @MoFtZ, @Ruberik and @jgiannuzzi for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Assets 2
Pre-release
Pre-release

@m4rs-mt m4rs-mt released this Jan 25, 2021 · 34 commits to master since this release

This new beta version offers important bug fixes and performance improvements of the generated kernel programs and a set of new features (get the Nuget package).

  • Improved performance of kernel launchers by passing packed argument structures (#358, #372).
  • Added support for managed sizeof IL instruction (#380).
  • Added PrintInformation method to Accelerator instances to print detailed accelerator information (#389).
  • Added enhanced assertions and out-of-bounds checks to all ArrayView accesses on GPU devices (Use flag ContextFlags.EnableAsserations or attach a debugger to your application to enable assertion checks. Make sure to use the portable debug information format for detailed source location information) (#375).
  • Removed support for .NET Core 2.0 (#353).
  • Prevent using SharedMemory in implicitly grouped kernels (#354).
  • Prevent using CudaAccelerator and CLAccelerator instances to run on non-native OS .NET versions (#396).
  • Fixed critical GC-related resource deallocation issues (#376, #393).
  • Fixed returning correct length of dynamic shared memory buffers (#357).
  • Fixed invalid alignment information in the presence of reinterpret casts (#386).
  • Fixed invalid address computations of fixed array buffers (#361).
  • Fixed invalid PTX calling convention (#362).
  • Fixed edge cases in LoopUnrolling (#373).
  • Fixed invalid printf formats for int64 and uintX types (#391).

Major internal changes:

  • Removed singleton instance of RuntimeSystem to avoid concurrency/reflection-API issues (#393).
  • Updated default optimizations for ILGPU debug builds (#384).
  • Added support for unity tests running on. NET Framework 4.7 (#355).
  • Migrated from FxCop analyzers to .NET analyzers. (#352).
  • Redesigned internal address-space inference passes (#364).

Special thanks to @MoFtZ, @Ruberik for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Assets 2
Pre-release
Pre-release

@m4rs-mt m4rs-mt released this Dec 10, 2020 · 148 commits to master since this release

This new beta version offers significant performance improvements of the generated kernel programs and a set of new features (get the Nuget package).

  • Graduated different optimizations from O2 to O1 (release mode) to improve performance in release builds using an additional of stable optimization passes (#344).
  • Graduated O2 optimizations in the Cuda backend to O1 pipeline to generate vectorized IO operations in release builds (#350).
  • Added support for printf-like output in Kernels for CPU, Cuda and OpenCL accelerators (#342).
  • Added new utility Launch/LaunchAutoGrouped methods to immediately launch kernels using a separate strong-reference cache (#336).
  • Added new AlignTo alignment methods to explicitly align ArrayView instances to a particular alignment in bytes (#316).
  • Added enhanced support for local memory via a new LocalMemory class (#316).
  • Added support for several PopCount, CLZ and CTZ operations (#324).
  • Added new MemSet functions to all memory buffers (#338).
  • Added new IfConditionalConversion to fold nested and-also and or-else block chains to O2 pipeline (#328).
  • Added new local memory optimizations to simplify array accesses (#317).
  • Added simple 64-bit-based LongGlobalIndex helper to simplify correct computations using 64-bit integers (#337).
  • Added new CLPlatformVersion and fixed OpenCL 1.2 compatibility issues (#335).
  • Fixed invalid DebugArrayView implementations (#345).
  • Fixed invalid initializations of local memory arrays (#287).

Special thanks to @MoFtZ and @jgiannuzzi for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Assets 2

@m4rs-mt m4rs-mt released this Nov 22, 2020 · 234 commits to master since this release

The new stable version offers significant performance improvements of the generated kernel programs (get the Nuget package).

  • Added new convenience Launch methods to Accelerator class to launch kernels without pre-loading/compiling them (#319).
  • Changed default inling behavior to AggressiveInlining to improve performance of (usually) performance critical GPU programs (#294).
  • Significantly improved performance of Cuda programs in many cases using a new control-flow scheduling algorithm that can be enabled via O2 or the flag ContextFlags.EnhancedPTXBackendFeatures (#274, #303).
  • Added support for RTX 30xx cards (#302, #305, #311).
  • Added support for tuple-types in kernel functions (#266).
  • Added support for Span<T> in the scope of MemoryBuffer copy operations (#122, #276).
  • Added new Capability API to enable specific extensions in the scope of OpenCL programs and to provide better error messages (#103, #279).
  • Added new arithmetic simplifications to enhance the optimization potential of the ILGPU optimization pipeline (#278, #283).
  • Added support for unrolling of loop nests to improve performance (#281).
  • Added new loop invariant code motion (LICM) code transformation to reduce the code size and enable more aggressive optimizations in O2 mode (#291).
  • Enhanced alignment of local and shared-memory allocations in the PTX backend to emit fast vectorized instructions in a huge variety of additional cases (#304).
  • Improved alignment of padding in fixed-size structures (#315).
  • Fixed invalid Unix OpenCL library names (#327).
  • Fixed calling ambiguous OpenCL 64-bit atomic functions (#321).
  • Fixed invalid unrolling of loops in some cases (#292).
  • Fixed invalid loading of unsigned fields from structures (#314).
  • Fixed invalid handling of FP16 types on unsupported devices (#312).
  • Fixed invalid constant folding of LHS constants in compare operations (#326).

Major internal changes:

  • Enhanced unreachable code elimination to be compatible with the latest optimization pipeline (#300).
  • Fixed invalid detection of entry and exit blocks in Loop analysis (#293).
  • Added additional debugging capabilities via new dumper methods (#282).

Special thanks to @MoFtZ for his contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Assets 2
Pre-release
Pre-release

@m4rs-mt m4rs-mt released this Nov 1, 2020 · 247 commits to master since this release

This new beta version offers significant performance improvements of the generated kernel programs (get the Nuget package).

  • Changed default inling behavior to AggressiveInlining to improve performance of (usually) performance critical GPU programs (#294).
  • Significantly improved performance of Cuda programs in many cases using a new control-flow scheduling algorithm that can be enabled via O2 or the flag ContextFlags.EnhancedPTXBackendFeatures (#274, #303).
  • Added support for RTX 30xx cards (#302, #305).
  • Added support for tuple-types in kernel functions (#266).
  • Added support for Span<T> in the scope of MemoryBuffer copy operations (#122, #276).
  • Added new Capability API to enable specific extensions in the scope of OpenCL programs and to provide better error messages (#103, #279).
  • Added new arithmetic simplifications to enhance the optimization potential of the ILGPU optimization pipeline (#278, #283).
  • Added support for unrolling of loop nests to improve performance (#281).
  • Added new loop invariant code motion (LICM) code transformation to reduce the code size and enable more aggressive optimizations in O2 mode (#291).
  • Enhanced alignment of local and shared-memory allocations in the PTX backend to emit fast vectorized instructions in a huge variety of additional cases (#304).
  • Fixed invalid unrolling of loops in some cases (#292).

Major internal changes:

  • Enhanced unreachable code elimination to be compatible with the latest optimization pipeline (#300).
  • Fixed invalid detection of entry and exit blocks in Loop analysis (#293).
  • Added additional debugging capabilities via new dumper methods (#282).

Special thanks to @MoFtZ for his contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Assets 2

@m4rs-mt m4rs-mt released this Oct 1, 2020 · 305 commits to master since this release

The new stable version offers significant performance improvements of the generated kernel programs (get the Nuget package).

  • Added initial loop unrolling capabilities for innermost loops (#259).
  • Added new address-space specializer to infer the actual address spaces of memory accesses (#247).
  • Added several code simplification techniques to improve generated kernel programs (#268, #270, #271).
  • Added support for FP16x2 (Half2) types (#273).
  • Added support for non-capturing lambda kernels (#186).
  • Added additional copy operations to ExchangeBuffer (#255).
  • Enhanced generation of vectorized IO instructions in the PTX backend using new alignment rules (#247, #260).
  • Fixed invalid accelerator synchronization in OpenCL (#246).
  • Fixed invalid sign extension of byte and ushort values in the context of method calls (#239).
  • Fixed invalid handling of unsafe array buffers in several cases (#262, #263, #285).

Major internal changes:

  • Added new enhanced loop-analyses classes to get detailed insights about loops in ILGPU programs (#259).
  • Refactored the internal static-program analysis framework (#247).
  • Updated native DLL-interop API (#249).
  • Fixed code analysis warnings (#248).

Special thanks to @MoFtZ, @Yey007 and @LxBos for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Assets 2
Pre-release
Pre-release

@m4rs-mt m4rs-mt released this Sep 21, 2020 · 307 commits to master since this release

This new beta version offers significant performance improvements of the generated kernel programs (get the Nuget package).

  • Added initial loop unrolling capabilities for innermost loops (#259).
  • Added new address-space specializer to infer the actual address spaces of memory accesses (#247).
  • Added several code simplification techniques to improve generated kernel programs (#268, #270, #271).
  • Added support for FP16x2 (Half2) types (#273).
  • Added support for non-capturing lambda kernels (#186).
  • Added additional copy operations to ExchangeBuffer (#255).
  • Enhanced generation of vectorized IO instructions in the PTX backend using new alignment rules (#247, #260).
  • Fixed invalid accelerator synchronization in OpenCL (#246).
  • Fixed invalid sign extension of byte and ushort values in the context of method calls (#239).
  • Fixed invalid handling of unsafe array buffers in several cases (#262, #263).

Major internal changes:

  • Added new enhanced loop-analyses classes to get detailed insights about loops in ILGPU programs (#259).
  • Refactored the internal static-program analysis framework (#247).
  • Updated native DLL-interop API (#249).
  • Fixed code analysis warnings (#248).

Special thanks to @MoFtZ, @Yey007 and @LxBos for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

Assets 2

@m4rs-mt m4rs-mt released this Jan 3, 2021 · 368 commits to master since this release

This new stable version offers significant performance and code quality improvements of the generated kernel programs.

  • Fixed invalid range checks in memory buffer implementations.
  • Fixed invalid 32-bit offsets in memory buffer implementations.
  • Fixed if-conversion transformation generating invalid programs in some cases (#232, #233).
  • Fixed code-analyses issues that could cause invalid analysis results (#220).
  • Added support for 64-bit length buffers and views (#196, #210, #215, #216).
    Note that this feature includes breaking changes that might affect existing code bases. Please refer to the upgrade guide for more information.
  • Added new if-conversion transformation to improve performance (#183).
  • Added support for 16-bit float (Half) types (#180, #208).
  • Added initial support for fixed array buffers (#200).
  • Added support for non-capturing lambda kernels (#79, #136).
  • Added support for multidimensional ExchangeBuffers (#148).
  • Extended ExchangeBuffers to support conversions to Span and Memory instances (#122).
  • Fixed invalid lowering of arrays in divergent control flow (#201).
  • Fixed invalid handling of prefixed IL instructions (#204, #211).

Special thanks to @MoFtZ, @Yey007 and @jgiannuzzi for contributing to this release.

Assets 2
Pre-release
Pre-release

@m4rs-mt m4rs-mt released this Jan 3, 2021 · 383 commits to master since this release

  • Added support for 64-bit length buffers and views (#196, #210, #215, #216).
    Note that this feature includes breaking changes that might affect existing code bases. Please refer to the upgrade guide for more information.
  • Added new if-conversion transformation to improve performance (#183).
  • Added support for 16-bit float (Half) types (#180, #208).
  • Added initial support for fixed array buffers (#200).
  • Added support for non-capturing lambda kernels (#79, #136).
  • Added support for multidimensional ExchangeBuffers (#148).
  • Extended ExchangeBuffers to support conversions to Span and Memory instances (#122).
  • Fixed invalid lowering of arrays in divergent control flow (#201).
  • Fixed invalid handling of prefixed IL instructions (#204, #211).

Special thanks to @MoFtZ, @Yey007 and @jgiannuzzi for contributing to this release.

Assets 2

@m4rs-mt m4rs-mt released this Jan 3, 2021 · 461 commits to master since this release

The new stable version offers significant performance and code quality improvements of the generated kernel programs.

  • Fixed related to Trace and Debug asserts (#176).
  • Fixed related to Trace and Debug asserts (#176).
  • Improved compile-time performance by up to 4X (#110).
  • Reduced memory footprint by up to 3X (#109, #118).
  • Added new optimization level O2 to enable expensive and aggressive optimizations (#70, #110, #111, #121).
  • No compiler release builds in Nuget package to improve runtime performance (#130).
  • Added new IR verifier that can be enabled via ContextFlags.EnableVerifier (#121).
  • Added generation of vectorized instructions to PTX backend (#111).
  • Fixed critical code-generation issue on Unix platforms (#116).
  • Added dynamic shared memory support for all platforms (#97, #98).
  • Added new KernelInfo objects to kernel loaders in order to query detailed kernel statistics (e.g. amount of local memory in bytes) (#104).
Assets 2