Skip to content

Refactor LLaMA 4 kernels to modular nn/ structure #190

@m96-chan

Description

@m96-chan

Problem

The current LLaMA 4 CUDA kernel implementation is monolithic:

native/ops/nn/
├── llama4/           # ← All kernels bundled together
└── llama4_kernels.cuh

This violates the modular architecture defined in CLAUDE.md where NN operations should be separated by function:

native/ops/nn/
├── activation/   # GELU, SiLU, etc.
├── attention/    # SDPA
├── norm/         # RMSNorm, LayerNorm
└── rope/         # RoPE

Impact

  • LLaMA 4 specific kernels cannot be reused by other models
  • Violates the principle of modular, composable operations
  • Makes testing individual components harder

Proposed Solution

  1. Extract common operations from llama4/ into their respective directories
  2. Keep only LLaMA 4 specific logic (if any) in llama4/
  3. Update bindings to use modular kernel paths

Related

  • Added in commit 5fcf3c3 with note about needing refactor

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions