Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator Fusion #194

Open
FL33TW00D opened this issue May 9, 2024 · 1 comment
Open

Operator Fusion #194

FL33TW00D opened this issue May 9, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@FL33TW00D
Copy link
Collaborator

FL33TW00D commented May 9, 2024

Crucial and ties into Code Generation.

allocations

The above graph demonstrates the success of our current inplacing algorithm.

However, we need to take this a step further and go from Inplacing to Inlining.

fn main(...) {
    let x_offset = group_id.x * 64u;
    var dst_offset = (group_id.y * num_groups.x * 64u) + x_offset + local_index;

    //Convert 1D offset into 4D index
    let dst_index = offsetToNdIndex(dst_offset, metadata.dst_stride);

    var src_index = vec4<u32>(0u);
    src_index[metadata.perm[0]] = dst_index[0]; 
    src_index[metadata.perm[1]] = dst_index[1];
    src_index[metadata.perm[2]] = dst_index[2];
    src_index[metadata.perm[3]] = dst_index[3];
    
    //Convert 4D index into 1D offset
    let src_offset = ndIndexToOffset(src_index, metadata.src_offsets, metadata.src_stride);

    Y[dst_offset] = X[src_offset];
}

The above is our current permute shader. Instead of performing subsequent injective operations on the output buffer of permute, we could inline all of the injective operations like so:

fn main(...) {
    //omit
    Y[dst_offset] = cos(exp(gelu(X[src_offset])
}

This (contrived) example would cause everything to be collapsed to a single node, and is super important.

@FL33TW00D FL33TW00D added the enhancement New feature or request label May 9, 2024
@philpax
Copy link
Contributor

philpax commented May 9, 2024

Sharing my thoughts from our conversation:

  • you'll want to introduce an IR that keeps track of the size of each tensor and the "type" of each operation
  • you can coalesce operations with the same "type" - for the example you've given, you have elementwise operations of cos / exp / gelu - you can bundle these into a single node
  • for this, runtime code generation will be needed for each IR node, as you will no longer know ahead of time what your final execution environment will look like

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants