LATX, opt: Inline PCLMULQDQ/VPCLMULQDQ#293
Merged
Merged
Conversation
luzeng87
approved these changes
May 14, 2026
Contributor
luzeng87
left a comment
There was a problem hiding this comment.
审查总结
将 PCLMULQDQ/VPCLMULQDQ 从 helper 函数调用替换为 CTZ 循环内联实现。GF(2^128) 多项式乘法,CTZ 跳过零位,迭代次数从 64 次降为 popcount(lhs) 次。GCM 吞吐量提升约 2 倍,收益显著。
问题
1. translate_pclmulqdq 128 位路径缺少寄存器别名保护
256 位路径在 s0 == s1 时会显式创建副本:
if (s0 == s1) {
src1_copy = ra_alloc_ftemp();
la_xvori_b(src1_copy, src1, 0);
}但 128 位路径 (translate_pclmulqdq) 直接传 dest 作为 v 参数。虽然当前实现中读取在写入之前(函数先 vpickve2gr_d 读 v,最后才 vinsgr2vr_d 写 d),运行结果正确,但与 256 位路径不一致。建议添加相同保护逻辑,防止后续重构引入 bug。
2. CTZ 循环体重复 3 次
emit_pclmulqdq_ctz、emit_vpclmulqdq_ctz_128、emit_vpclmulqdq_ctz_256 包含几乎相同的 ~20 行 CTZ 循环。建议抽取公共函数 emit_pclmul_ctz_loop(lhs, rhs, res_lo, res_hi)。
3. 提交标题大小写
按项目规范应为 LATX, opt: Inline PCLMULQDQ/VPCLMULQDQ(Inline 首字母大写)。
小问题
- 变量重命名
a→lhs、b→rhs、resl→res_lo清晰度提升明显 - 旧代码
cal_pclmulqdq有ra_free_temp(ctrlp),新代码去掉了——符合 LATX 依赖 TB 结束时统一释放的惯例,无泄漏
结论
通过。 性能收益真实且显著,算法正确,无阻塞性问题。建议后续迭代时抽取公共 CTZ 循环体并统一别名保护。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Using simple ctz loop to accel.
Not using openssl's 4bit table-lookup, which have larger latency but higher gcm bandwidth.