Skip to content

Release v0.8.0

Compare
Choose a tag to compare
@m4rs-mt m4rs-mt released this 03 Jan 02:58
· 1719 commits to master since this release

The new stable version offers significant performance and code quality improvements of the generated kernel programs.

  • Added support for on-the-fly specialization of kernels using dynamic partial evaluation.
  • Added support for dynamic shared memory (CPU & Cuda backends).
  • Added new KernelConfig structure to specify launch dimensions for explicitly grouped kernels.
  • Added new Index1 structure to avoid name clashes with new System.Index structure.
  • Added additional tuple conversion methods to Index2 and Index3 types.
  • Added new EntryPointDescription structure to specify an entry point and its index type.
  • Added RuntimeKernelConfig structure to combine static and dynamic information about a particular kernel launch.
  • Added support for linear arrays in local memory.
  • Added support for enum-value interop (#66).
  • Reworked explicitly grouped kernel launchers to use the new KernelConfig structure instead of GroupedIndex types.
  • Simplified static Grid and Group properties.
  • Removed all GroupedIndex types.
  • Updated the whole compilation pipeline to enable more aggressive optimizations.
  • Significantly improved performance of emitted PTX and OpenCL code by enabling more aggressive optimizations and clever code generation (#70).
  • Added Support for "unmanaged" C# structures in the scope of buffers and views.
  • Reworked PTX backend to support all API changes and to fix several critical code-generation issues. This also includes emission of PTX instructions that mimic the Cuda compiler (#68).
  • Reworked OpenCL backend to support all API changes and to fix several
    critical code-generation issues (#67, #72, #73, #74, #78, #85, #88, #91, #92).
  • New debug information input module to support the latest PDB format updates.
  • Considerably improved error messages using debug information. (#86)
  • Reduced memory consumption during the compilation process.
  • Performance improvements of the internal compilation pipeline.
  • Improved performance of kernel launchers.
  • Extended CudaAPI to supported paged-lock host-memory allocation functions.
  • Extended ExchangeBuffer to use new page-locked memory allocation (if available).
  • Added new IR-rewriter API to perform more advanced IR transformations.
  • Adapted all existing transformations to use the new rewriter API.
  • Reduced memory consumption of all nodes by compressing information.
  • Redesigned several IR nodes to support global program transformations.
  • Reworked implementation of GetSubView in the context of generic and multidimensional array views (#19).
  • Fixed several issues in the scope of address-space inference.
  • Fixed critical code generation issues that could occur when replacing values.

Special thanks to @MoFtZ for contributing to this release.