Skip to content

[NPU]: NPU-optimized rms_norm kernel#1099

Merged
Tcc0403 merged 1 commit intolinkedin:mainfrom
TianHao324:rms_npu
Feb 12, 2026
Merged

[NPU]: NPU-optimized rms_norm kernel#1099
Tcc0403 merged 1 commit intolinkedin:mainfrom
TianHao324:rms_npu

Conversation

@TianHao324
Copy link
Contributor

Summary

  1. Due to the limitations of the NPU device, when the n_cols value is too large, loading an entire row for execution will result in an overflow of the ub. Therefore, in this case, we use column partitioning to avoid it. When the number of columns (n_cols) is small, we still follow the logic of loading the entire row. In fact, for most large models, the n_cols (i.e., the hidden size) is not very large, typically ranging from 2^10 to 2^11. The maximum hidden size in the test is also 2^9. Therefore, we test the performance within the data size range of 2^9 to 2^11. We will consider how to modify the benchmark later.
  2. Grid size is limited to NPU core count to avoid resource overflow
  3. Each program handles multiple rows

Testing Done

image
  • Hardware Type: Atlas 800I A2
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

@TianHao324
Copy link
Contributor Author

benchmark:

**************************************
     BENCHMARKING SPEED for RMS_NORM
**************************************
********** Benchmark Data **********
[
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      0.2583000063896179,
      0.25293999910354614,
      0.26662999391555786
    ],
    "y_values_20": [
      0.25063201785087585,
      0.24488800764083862,
      0.2616159915924072
    ],
    "y_values_80": [
      0.2695560157299042,
      0.26193201541900635,
      0.2717519998550415
    ],
    "timestamp": "2026-02-11 07:44:49",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      0.15997999906539917,
      0.1638599932193756,
      0.1911199986934662
    ],
    "y_values_20": [
      0.1535000056028366,
      0.1588200032711029,
      0.18996000289916992
    ],
    "y_values_80": [
      0.16821999847888947,
      0.16962000727653503,
      0.19249999523162842
    ],
    "timestamp": "2026-02-11 07:44:51",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      0.8722299933433533,
      0.8751800060272217,
      0.8167200088500977
    ],
    "y_values_20": [
      0.8654080033302307,
      0.8672599792480469,
      0.8060960173606873
    ],
    "y_values_80": [
      0.8836359977722168,
      0.8859999775886536,
      0.8307200074195862
    ],
    "timestamp": "2026-02-11 07:44:53",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      0.7480800151824951,
      0.734470009803772,
      0.7401300072669983
    ],
    "y_values_20": [
      0.7207080125808716,
      0.7244679927825928,
      0.7370799779891968
    ],
    "y_values_80": [
      0.7986599802970886,
      0.7477200031280518,
      0.7451000213623047
    ],
    "timestamp": "2026-02-11 07:44:54",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      0.3915899991989136,
      0.38552001118659973,
      0.4566799998283386
    ],
    "y_values_20": [
      0.38571199774742126,
      0.38040000200271606,
      0.4496000111103058
    ],
    "y_values_80": [
      0.4015119969844818,
      0.39298000931739807,
      0.4654200077056885
    ],
    "timestamp": "2026-02-11 07:44:56",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      0.40516000986099243,
      0.4075999855995178,
      0.6054199934005737
    ],
    "y_values_20": [
      0.3917959928512573,
      0.3993520140647888,
      0.6028439998626709
    ],
    "y_values_80": [
      0.4301320016384125,
      0.4168680012226105,
      0.6079919934272766
    ],
    "timestamp": "2026-02-11 07:44:58",
    "kernel_operation_mode": "backward",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  }
]
**************************************
     BENCHMARKING MEMORY for RMS_NORM
**************************************
********** Benchmark Data **********
[
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "liger",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      6.177734375,
      12.341796875,
      24.669921875
    ],
    "y_values_20": [
      6.177734375,
      12.341796875,
      24.669921875
    ],
    "y_values_80": [
      6.177734375,
      12.341796875,
      24.669921875
    ],
    "timestamp": "2026-02-11 07:44:58",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "rms_norm",
    "kernel_provider": "huggingface",
    "metric_name": "memory",
    "metric_unit": "MB",
    "gpu_name": "Ascend910B4",
    "x_name": "H",
    "x_label": "hidden size",
    "x_values": [
      512,
      1024,
      2048
    ],
    "y_values_50": [
      40.021484375,
      80.02734375,
      160.0390625
    ],
    "y_values_20": [
      40.021484375,
      80.02734375,
      160.0390625
    ],
    "y_values_80": [
      40.021484375,
      80.02734375,
      160.0390625
    ],
    "timestamp": "2026-02-11 07:44:58",
    "kernel_operation_mode": "full",
    "extra_benchmark_config_str": "{\"M\": 2048, \"dtype\": \"torch.bfloat16\", \"eps\": 1e-06}",
    "liger_version": "0.0.0"
  }
]

@TianHao324
Copy link
Contributor Author

@Tcc0403 would you mind having a preview?

Copy link
Collaborator

@Tcc0403 Tcc0403 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Tcc0403 Tcc0403 added this pull request to the merge queue Feb 12, 2026
Merged via the queue into linkedin:main with commit ec3a9d1 Feb 12, 2026
3 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants