Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: allow lambdas in kernels when they can be evaluated at compile time #463

Open
lostmsu opened this issue Apr 16, 2021 · 16 comments
Labels
difficulty:advanced A task that requires advanced knowledge difficulty:intermediate A task with intermediate difficulty feature A new feature (or feature request)
Milestone

Comments

@lostmsu
Copy link
Contributor

lostmsu commented Apr 16, 2021

Rationale

This request is syntax sugar for creating C# classes, that provide some GPGPU capabilities.

Imagine you are trying to implement a ISqlCalc, that needs to be able to perform a few ops on arrays using ILGPU.

interface ISqlCalc {
  int[] Neg(int[] a);
  int[] BitwiseComplement(int[] a);
}

class GpuSqlCalc: ISqlCalc {
  static void UnaryOpKernel(Index1 i, ArrayView<int> data, Func<int, int> op)
    => data[i] = op(data[i]);

  static Action<int[]> UnaryOp(Func<int, int> op) {
    return accelerator.LoadAutoGroupedStreamKernel<
                Index1,
                ArrayView<int>
                >((i, d) => GenericUnaryOp(i, d, op));
  }

  public int[] Neg(int[] v) => UnaryOp(v => -v);
  public int[] BitwiseComplement(int[] v) => UnaryOp(v => ~v);
}

Point is it should be possible to inline v => -v. The delegate instance will have MethodInfo pointing to a body, and that method will never reference this, so it is essentially static.

Workaround

Currently the best way to have something analogous to UnaryOpKernel shared for all unary ops I came up with is to use generic monomorphization like this:

interface IUnaryOp<T> { T Apply(T val); }

static void UnaryOpKernel<TOp>(Index1 i, ArrayView<int> data)
  where TOp: struct, // this fails with a class, but really should not in this particular scenario
             IUnaryOp<int>
{
  return data[i] = default(TOp).Apply(data[i]);
}

struct Neg: IUnaryOp<int> { int Apply(int val) => -val; }

accelerator.LoadAutoGroupedStreamKernel<
                Index1,
                ArrayView<int>
                >(UnaryOpKernel<Neg>)

While this works, it is ugly and unnecessarily wordy.

The struct restriction also prevents me from at least doing

class Neg: BaseOp, IUnaryOp<int> {
  ... overrides of BaseOp stuff, that call into UnaryOpKernel<Neg> ...
  
  public int Apply(int val) => -val;
}

This fails due to "Class type 'Neg' is not supported" even though this is never used and Apply is essentially static.

@lostmsu
Copy link
Contributor Author

lostmsu commented Apr 17, 2021

Hm, I started working on this, and I am seeing existing pieces of code that look very relevant: MethodExtensions.IsNotCapturingLambda.

@MoFtZ MethodExtensions.GetParameterOffset seems to be returning wrong value for a simple class with no fields or properties. What was the reasoning for it to return 0 for lambdas? AFAIK lambdas are implemented as instance methods on a hidden class, so it should have returned 1.

@m4rs-mt
Copy link
Owner

m4rs-mt commented Apr 17, 2021

@lostmsu Thank you for your feature request. We have already discussed the feature in our weekly talk-to-dev sessions. We currently believe that we should add support for lambdas via ILGPU's dynamic specialization features. Also, we can translate calls to lambda functions into calls to "opaque" functions annotated with specific attributes. This avoids inlining and modifying these stubs that we generate.

However, adding support for arbitrary lambdas also requires special care in capturing values and returning lambda closures within kernel functions. Moreover, we can add this feature to the v1.1 feature list 🚀

@m4rs-mt m4rs-mt added difficulty:advanced A task that requires advanced knowledge feature A new feature (or feature request) labels Apr 17, 2021
@m4rs-mt m4rs-mt added this to the v1.01 milestone Apr 17, 2021
@m4rs-mt m4rs-mt added the difficulty:intermediate A task with intermediate difficulty label Apr 17, 2021
@lostmsu
Copy link
Contributor Author

lostmsu commented Apr 17, 2021

@m4rs-mt thanks for the promising response. Is there anyone already working on that feature?

I started my own take at implementing it by replacing the key type in this dictionary:

Dictionary<MethodBase, CompilationStackLocation> detectedMethods,
to a composite of MethodBase + Value?[] array of arguments whose values are known at compile time (in this case a delegate pointing to a known method). This approach does not seem to align with the idea of "dynamic specialization features". Should I pause it?

@MoFtZ
Copy link
Collaborator

MoFtZ commented Apr 18, 2021

@lostmsu
Thanks for looking into this topic.

Yes, you are correct that lambdas are implemented as instance methods on a hidden class. Originally, ILGPU only supported static methods, which do not have a this pointer. When adding support for non-capturing lambdas, we are removing the this pointer from the lambda and treating it like a static method. This means that arguments are shifted, and the parameter offset is 0, the same as for a static method.

If you find that it is easier to make your changes if the parameter offset is 1, then it is fine to change.

@MoFtZ
Copy link
Collaborator

MoFtZ commented Apr 19, 2021

@m4rs-mt thanks for the promising response. Is there anyone already working on that feature?

I started my own take at implementing it by replacing the key type in this dictionary:

Dictionary<MethodBase, CompilationStackLocation> detectedMethods,

to a composite of MethodBase + Value?[] array of arguments whose values are known at compile time (in this case a delegate pointing to a known method). This approach does not seem to align with the idea of "dynamic specialization features". Should I pause it?

@lostmsu There is no one currently working on this feature, so if you have the time and passion, we would wholeheartedly welcome your contributions.

We have previously discussed how to support lambda functions to provide the functionality requested. In your example, you have supplied the lambda function as a method parameter to UnaryOp, which then calls LoadAutoGroupedStreamKernel using a lambda function that captures Func<int, int> op. This is related, but different, to #415 which uses a static member variable as the technique for supplying the lambda function.

Regarding "dynamic specialization features", I believe @m4rs-mt is referring to a technique similar to SpecializedValue in ILGPU: https://github.com/m4rs-mt/ILGPU/wiki/Dynamically-Specialized-Kernels
The idea is that calling LoadXxxKernel does an initial compilation of the kernel. Then, when actually launching a kernel that uses SpecializedValue, a further compilation phase is performed that will "dynamically specialize" the kernel. With regards to lambda functions, it could be something like having SpecializedFunc (or more generically, SpecializedDelegate) as a kernel parameter.

Note that this is still an open-ended discussion. For example, should we support lambdas that are static member variables like #415? Is dynamic specialization the correct approach for how it will be used? Should capturing lambdas be supported? And if so, to what extent? Also note that is is not necessary to solve all these questions now - we can slowly build some functionality while deferring other more "problematic" functionality, like capturing lambdas.

@lostmsu
Copy link
Contributor Author

lostmsu commented Apr 19, 2021

@MoFtZ the problem I see with the LoadXxxKernel followed by its launch with a SpecializedValue is that the original kernel would need to support non-specialized lambda values, and I currently do not see how they could be compiled: their usage involves IL opcode ldftn, and eventually boils down to an indirect function call, which AFAIK (I am not expert on GPGPU) is only available in very recent hardware.

That was my reasoning behind the idea to propagate lambda at the initial compile time.

@m4rs-mt
Copy link
Owner

m4rs-mt commented Apr 19, 2021

@lostmsu @lostmsu I don't think we'll run into any problems with respect to the ldftn opcode when translating it into an IR function call to an opaque function. Consequently, we can resolve the call target at kernel launch time by providing a function to the kernel and leaving the specialization work to the ILGPU compiler. However, this generally does not cover all use cases 😄

@lostmsu Regarding your suggestion and implementation: I have experimented with different ways to implement lambdas in the compiler, as they involve handling class types inside kernels. I still believe that mapping these OpCodes to partial function calls + dynamic specialization of the call sites might be the best way to implement them. Anyway, we are always open to PRs that add new features 🤓👍

I was wondering about changing the mapping

to a composite of MethodBase + Value?[] array of arguments whose values are known at compile time (in this case a delegate pointing to a known method). This approach does not seem to align with the idea of "dynamic specialization features". Should I pause it?

to a tuple of a MethodBase and a Value array. Is the value array intended to represent captured variables from the environment of the function? And where do these values come from? Are they created by the IRBuilder from .Net values? If yes, how do we compare them "properly" for equality? I ask about equality checking because primitive constants are instantiated multiple times and are not treated as the same value in the compiler for efficiency. In other words, the integer constant 1 and another constant 1 will not be the same value in memory.

lostmsu added a commit to losttech/ILGPU that referenced this issue Apr 19, 2021
PC-Crashes pushed a commit to PC-Crashes/ILGPU that referenced this issue May 21, 2021
PC-Crashes pushed a commit to PC-Crashes/ILGPU that referenced this issue May 21, 2021
PC-Crashes pushed a commit to PC-Crashes/ILGPU that referenced this issue May 21, 2021
@lostmsu
Copy link
Contributor Author

lostmsu commented May 21, 2021

Sorry for a delay here @MoFtZ @m4rs-mt . Have you guys given any thought to this? Do you have notes?

I checked out current code, that handles SpecializedValue, and as-is it seems to be tailored to the scenarios where the value being specialized is already one of the supported values (which delegate instances are not). It might be possible to rework it a bit to get identical behavior, but disallow running generic kernels, that have unspecialized parameters of reference types. Or just explicitly add a different GenericValue<T>, which behaves exactly like SpecializedValue<T>, but must always be specialized.

@m4rs-mt mentioned dynamic specialization. Can you elaborate on the idea? Is it different from the above?

I have not looked at it, but if ILGPU already has cross-function constant propagation that might be another way to approach the problem.

@MoFtZ
Copy link
Collaborator

MoFtZ commented May 24, 2021

@lostmsu We have not defined a preferred API, so you are welcome to design it as you see fit.

I believe that "dynamic specialization" is referring to the concept used by SpecializedValue<T>. That is, when the kernel is launched, it will be provided with the delegate as a parameter. This delegate will then be integrated into the final kernel that runs on the GPU.

@lostmsu
Copy link
Contributor Author

lostmsu commented Jun 21, 2021

@MoFtZ @m4rs-mt is there some architectural description of ILGPU? I find it hard to wrap my head around existing translation phases, values, and IR without one.

@MoFtZ
Copy link
Collaborator

MoFtZ commented Jun 22, 2021

There is no such documentation at the moment. If you'd like to join us on Discord, we will try to answer any questions you have:
https://discord.com/invite/X6RBCff

At a very high level, ILGPU follows a typical compiler design, with a Frontend that decodes MSIL into an Intermediate Representation (IR):
https://github.com/m4rs-mt/ILGPU/blob/v1.0-beta1/Src/ILGPU/Frontend/DisassemblerDriver.cs
https://github.com/m4rs-mt/ILGPU/blob/v1.0-beta1/Src/ILGPU/Frontend/ILFrontend.cs#L473
https://github.com/m4rs-mt/ILGPU/blob/v1.0-beta1/Src/ILGPU/Frontend/CodeGenerator/Driver.cs

Several optimisation phases are performed on this IR:
https://github.com/m4rs-mt/ILGPU/blob/v1.0-beta1/Src/ILGPU/IR/Transformations/Optimizer.cs

And finally, the IR is transformed using the Backends, to target Cuda or OpenCL:
https://github.com/m4rs-mt/ILGPU/blob/v1.0-beta1/Src/ILGPU/Backends/CodeGeneratorBackend.cs#L72

Additional resources:
https://www.tutorialspoint.com/compiler_design/index.htm
https://en.wikipedia.org/wiki/Static_single_assignment_form

@m4rs-mt m4rs-mt modified the milestones: v1.1, v1.X Mar 4, 2022
@m4rs-mt m4rs-mt modified the milestones: v1.X, vX.X (Future) Jun 9, 2022
@lostmsu
Copy link
Contributor Author

lostmsu commented Feb 3, 2023

This now might be easier with new C# static abstract interface members. Relevant IL changes: https://github.com/dotnet/runtime/pull/49558/files

@MoFtZ
Copy link
Collaborator

MoFtZ commented Feb 3, 2023

@lostmsu We recently added support for Generic Math, which makes use of Static Abstract Interface members. If you would like to try it out, it is available in a preview release of ILGPU.

@Darelbi
Copy link

Darelbi commented Mar 27, 2024

I need exactly that, assume I have a dynamic composition of different algorithms ( NeuraSharp). Also something like that would be usefull:

Declare the interfaces with static methods:

public interface IAlgorithm1
{
public static abstract void DoAlgorithm(float[] input, float[] ouput) ;
}

public interface IFunction1
{
public static abstract float DoSum(float[] input);
}
And then implement them:

public class MyAlgorithm1 : IAlgorithm1 where T : IFunction1
{
public static void DoAlgorithm(float[] input, float[] output)
{
for(int j=0;j<output.Length;j++)
{
output[j] = 2.0f* T.DoSum(input); // call to the static method of the generic type
}
}
}

public class NormalSum1 : IFunction1
{
public static float DoSum(float[] input)
{
float sum = 0.0f;
for (int i = 0; i < input.Length; i++)
sum += input[i];
return sum;
}
}

// load this as kernel
MyAlgorithm1.DoAlgorithm;

Actually I'm lookin at how to generate automatically inlined IL code but is a daunting task, if the feature is already there that would be great...

What kinda of syntax is exactly supported in the preview just out of curiosity?

@MoFtZ
Copy link
Collaborator

MoFtZ commented Mar 27, 2024

hi @Darelbi.

This is a long-running thread, so the information is outdated.

Currently, using lambdas within a kernel is still not supported.

On the plus side, Generic Math and Static Abstract Interface Member support (for net.70 onwards) is no longer in preview, and is available in the latest version of ILGPU - currently v1.5.1.

There is also some sample code that might meet your requirements for using interfaces:
https://github.com/m4rs-mt/ILGPU/blob/master/Samples/StaticAbstractInterfaceMembers/Program.cs

@En3Tho
Copy link

En3Tho commented Apr 17, 2024

Generic math works really well! Here is a small snippet in F# if you're interested.

module ILGpu.GenericKernels

open System
open System.Numerics
open ILGPU
open ILGPU.Runtime
open En3Tho.FSharp.Extensions

// define a set of constraints, INumber + ILGpu default ones
type Number<'TNumber
    when 'TNumber: unmanaged
    and 'TNumber: struct
    and 'TNumber: (new: unit -> 'TNumber)
    and 'TNumber :> ValueType
    and 'TNumber :> INumber<'TNumber>> = 'TNumber

module Kernels =

    // use this constraint for generic parameter in the kernel
    let inline executeSomeNumericOperations<'TNumber when Number<'TNumber>> (index: Index1D) (input: ArrayView<'TNumber>) (output: ArrayView<'TNumber>) (scalar: 'TNumber) =
        if index.X < input.Length.i32 then
            output[index] <- (input[index] * scalar + scalar) / scalar - scalar

let runKernel<'T when Number<'T>> (accelerator: Accelerator) scalar (data: 'T[]) =
    use deviceData = accelerator.Allocate1D(data)
    let kernel = accelerator.LoadAutoGroupedStreamKernel(Kernels.executeSomeNumericOperations<'T>)

    kernel.Invoke(Index1D(deviceData.Length.i32), deviceData.View, deviceData.View, scalar)
    deviceData.CopyToCPU(accelerator.DefaultStream, data)

    data |> Array.iteri ^ fun index element -> Console.WriteLine($"{index} = {element}")

let genericMap() =
    use context = Context.CreateDefault()
    let device = context.Devices |> Seq.find ^ fun x -> x.Name.Contains("GTX 1070")
    use accelerator = device.CreateAccelerator(context)

    // run with ints
    runKernel accelerator 10 [| 0; 1; 2; 3; 4; 5; 6; 7; 8; 9; |]
    // and with floats
    runKernel accelerator 10.1f [| 0.1f; 1.1f; 2.1f; 3.1f; 4.1f; 5.1f; 6.1f; 7.1f; 8.1f; 9.1f; |]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty:advanced A task that requires advanced knowledge difficulty:intermediate A task with intermediate difficulty feature A new feature (or feature request)
Projects
None yet
Development

No branches or pull requests

5 participants