[NPU]: NPU-optimized rms_norm kernel by TianHao324 · Pull Request #1099 · linkedin/Liger-Kernel

TianHao324 · 2026-02-12T02:10:08Z

Summary

Due to the limitations of the NPU device, when the n_cols value is too large, loading an entire row for execution will result in an overflow of the ub. Therefore, in this case, we use column partitioning to avoid it. When the number of columns (n_cols) is small, we still follow the logic of loading the entire row. In fact, for most large models, the n_cols (i.e., the hidden size) is not very large, typically ranging from 2^10 to 2^11. The maximum hidden size in the test is also 2^9. Therefore, we test the performance within the data size range of 2^9 to 2^11. We will consider how to modify the benchmark later.
Grid size is limited to NPU core count to avoid resource overflow
Each program handles multiple rows

Testing Done

Hardware Type: Atlas 800I A2
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

TianHao324 · 2026-02-12T02:13:25Z

benchmark：

**************************************
     BENCHMARKING SPEED for RMS_NORM
**************************************
********** Benchmark Data **********
[
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      0.2583000063896179,
      0.25293999910354614,
      0.26662999391555786
    ],
    "y_values_20": [
      0.25063201785087585,
      0.24488800764083862,
      0.2616159915924072
    ],
    "y_values_80": [
      0.2695560157299042,
      0.26193201541900635,
      0.2717519998550415
    ],
    "timestamp": "2026-02-11 07:44:49",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      0.15997999906539917,
      0.1638599932193756,
      0.1911199986934662
    ],
    "y_values_20": [
      0.1535000056028366,
      0.1588200032711029,
      0.18996000289916992
    ],
    "y_values_80": [
      0.16821999847888947,
      0.16962000727653503,
      0.19249999523162842
    ],
    "timestamp": "2026-02-11 07:44:51",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      0.8722299933433533,
      0.8751800060272217,
      0.8167200088500977
    ],
    "y_values_20": [
      0.8654080033302307,
      0.8672599792480469,
      0.8060960173606873
    ],
    "y_values_80": [
      0.8836359977722168,
      0.8859999775886536,
      0.8307200074195862
    ],
    "timestamp": "2026-02-11 07:44:53",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      0.7480800151824951,
      0.734470009803772,
      0.7401300072669983
    ],
    "y_values_20": [
      0.7207080125808716,
      0.7244679927825928,
      0.7370799779891968
    ],
    "y_values_80": [
      0.7986599802970886,
      0.7477200031280518,
      0.7451000213623047
    ],
    "timestamp": "2026-02-11 07:44:54",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      0.3915899991989136,
      0.38552001118659973,
      0.4566799998283386
    ],
    "y_values_20": [
      0.38571199774742126,
      0.38040000200271606,
      0.4496000111103058
    ],
    "y_values_80": [
      0.4015119969844818,
      0.39298000931739807,
      0.4654200077056885
    ],
    "timestamp": "2026-02-11 07:44:56",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      0.40516000986099243,
      0.4075999855995178,
      0.6054199934005737
    ],
    "y_values_20": [
      0.3917959928512573,
      0.3993520140647888,
      0.6028439998626709
    ],
    "y_values_80": [
      0.4301320016384125,
      0.4168680012226105,
      0.6079919934272766
    ],
    "timestamp": "2026-02-11 07:44:58",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  }
]
**************************************
     BENCHMARKING MEMORY for RMS_NORM
**************************************
********** Benchmark Data **********
[
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "liger",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      6.177734375,
      12.341796875,
      24.669921875
    ],
    "y_values_20": [
      6.177734375,
      12.341796875,
      24.669921875
    ],
    "y_values_80": [
      6.177734375,
      12.341796875,
      24.669921875
    ],
    "timestamp": "2026-02-11 07:44:58",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "huggingface",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      40.021484375,
      80.02734375,
      160.0390625
    ],
    "y_values_20": [
      40.021484375,
      80.02734375,
      160.0390625
    ],
    "y_values_80": [
      40.021484375,
      80.02734375,
      160.0390625
    ],
    "timestamp": "2026-02-11 07:44:58",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  }
]

TianHao324 · 2026-02-12T02:14:00Z

@Tcc0403 would you mind having a preview?

Tcc0403

LGTM

[NPU]: NPU-optimized rms_norm kernel

1eb20a4

TianHao324 force-pushed the rms_npu branch from c5fe825 to 1eb20a4 Compare February 12, 2026 02:11

Tcc0403 approved these changes Feb 12, 2026

View reviewed changes

Tcc0403 added this pull request to the merge queue Feb 12, 2026

Merged via the queue into linkedin:main with commit ec3a9d1 Feb 12, 2026
3 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU]: NPU-optimized rms_norm kernel#1099

[NPU]: NPU-optimized rms_norm kernel#1099
Tcc0403 merged 1 commit intolinkedin:mainfrom
TianHao324:rms_npu

TianHao324 commented Feb 12, 2026

Uh oh!

TianHao324 commented Feb 12, 2026

Uh oh!

TianHao324 commented Feb 12, 2026

Uh oh!

Tcc0403 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TianHao324 commented Feb 12, 2026

Summary

Testing Done

Uh oh!

TianHao324 commented Feb 12, 2026

Uh oh!

TianHao324 commented Feb 12, 2026

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants