-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[MLIR][Pygments] Refine the pygments MLIR lexer #166406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@llvm/pr-subscribers-mlir Author: Twice (PragmaTwice) ChangesRecently, the MLIR website added API documentation for the Python bindings generated via Sphinx (https://mlir.llvm.org/python-bindings/). In llvm/mlir-www#245, I introduced the Pygments lexer from the MLIR repository to enable syntax highlighting for MLIR code blocks in these API docs. However, since the existing Pygments lexer was fairly minimal, it didn’t fully handle all aspects of the MLIR syntax, leading to imperfect highlighting in some cases. In this PR, I used ChatGPT to rewrite the lexer by combining it with the TextMate grammar for MLIR (https://github.com/llvm/llvm-project/blob/main/mlir/utils/textmate/mlir.json). After some manual adjustments, the results look quite good—so I’m submitting this to improve the syntax highlighting experience in the Python bindings API documentation. Full diff: https://github.com/llvm/llvm-project/pull/166406.diff 1 Files Affected:
diff --git a/mlir/utils/pygments/mlir_lexer.py b/mlir/utils/pygments/mlir_lexer.py
index 179a058e9110c..ebe29e083387c 100644
--- a/mlir/utils/pygments/mlir_lexer.py
+++ b/mlir/utils/pygments/mlir_lexer.py
@@ -2,37 +2,129 @@
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-from pygments.lexer import RegexLexer
+from pygments.lexer import RegexLexer, bygroups, include, combined
from pygments.token import *
+import re
class MlirLexer(RegexLexer):
+ """Pygments lexer for MLIR.
+
+ This lexer focuses on accurate tokenization of common MLIR constructs:
+ - SSA values (%%... / %...)
+ - attribute and type aliases (#name =, !name =)
+ - types (builtin and dialect types, parametric types)
+ - attribute dictionaries and nested containers to a reasonable depth
+ - numbers (ints, floats with exponents, hex)
+ - strings with common escapes
+ - line comments (// ...)
+ - block labels (^foo) and operations
+ """
+
name = "MLIR"
aliases = ["mlir"]
filenames = ["*.mlir"]
+ flags = re.MULTILINE
+
tokens = {
"root": [
- (r"%[a-zA-Z0-9_]+", Name.Variable),
- (r"@[a-zA-Z_][a-zA-Z0-9_]+", Name.Function),
- (r"\^[a-zA-Z0-9_]+", Name.Label),
- (r"#[a-zA-Z0-9_]+", Name.Constant),
- (r"![a-zA-Z0-9_]+", Keyword.Type),
- (r"[a-zA-Z_][a-zA-Z0-9_]*\.", Name.Entity),
- (r"memref[^.]", Keyword.Type),
- (r"index", Keyword.Type),
- (r"i[0-9]+", Keyword.Type),
- (r"f[0-9]+", Keyword.Type),
+ # Comments
+ (r"//.*?$", Comment.Single),
+
+ # Attribute alias definition: #name =
+ (r"^\s*(#[_A-Za-z0-9\$\-\.]+)(\b)(\s*=)",
+ bygroups(Name.Constant, Text, Operator)),
+
+ # Type alias definition: !name =
+ (r"^\s*(![_A-Za-z0-9\$\-\.]+)(\b)(\s*=)",
+ bygroups(Keyword.Type, Text, Operator)),
+
+ # SSA values (results, uses) - allow many characters MLIR uses
+ (r"%[%_A-Za-z0-9\.\$:\-]+", Name.Variable),
+
+ # attribute refs, constants and named attributes
+ (r"#[_A-Za-z0-9\$\-\.]+\b", Name.Constant),
+
+ # symbol refs / function-like names
+ (r"@[_A-Za-z][_A-Za-z0-9\$\-\.]*\b", Name.Function),
+
+ # blocks
+ (r"\^[A-Za-z0-9_\$\.\-]+", Name.Label),
+
+ # types by exclamation or builtin names
+ (r"![_A-Za-z0-9\$\-\.]+\b", Keyword.Type),
+ (r"\b(bf16|f16|f32|f64|f80|f128|index|none|(u|s)?i[0-9]+)\b", Keyword.Type),
+
+ # container-like dialect types (tensor<...>, memref<...>, vector<...>)
+ (r"\b(complex|memref|tensor|tuple|vector)\s*(<)", bygroups(Keyword.Type, Punctuation), 'angled-type'),
+
+ # affine constructs
+ (r"\b(affine_map|affine_set)\b", Keyword.Reserved),
+
+ # common builtin operators / functions inside affine_map
+ (r"\b(ceildiv|floordiv|mod|symbol)\b", Name.Builtin),
+
+ # operation definitions with assignment: %... = op.name
+ (r"^\s*(%[\%_A-Za-z0-9\:\,\s]+)\s*(=)\s*([A-Za-z0-9_\.\$\-]+)\b",
+ bygroups(Name.Variable, Operator, Name.Function)),
+
+ # operation name without result
+ (r"^\s*([A-Za-z0-9_\.\$\-]+)\b(?=[^<:])", Name.Function),
+
+ # identifiers / bare words
+ (r"\b[_A-Za-z][_A-Za-z0-9\.-]*\b", Name.Other),
+
+ # numbers: hex, float (with exponent), integer
+ (r"\b0x[0-9A-Fa-f]+\b", Number.Hex),
+ (r"\b([0-9]+(\.[0-9]*)?|\.[0-9]+)([eE][+-]?[0-9]+)?\b", Number.Float),
+ (r"\b[0-9]+\b", Number.Integer),
+
+ # strings
+ (r'"', String.Double, 'string'),
+
+ # punctuation and arrow-like tokens
+ (r"->|>=|<=|\>=|\<=|\->|\=>", Operator),
+ (r"[()\[\]{}<>,.:=]", Punctuation),
+
+ # operators
+ (r"[-+*/%]", Operator),
+ ],
+
+ # string state with common escapes
+ 'string': [
+ (r'\\[ntr"\\]', String.Escape),
+ (r'[^"\\]+', String.Double),
+ (r'"', String.Double, '#pop'),
+ ],
+
+ # angled-type content (simple nested handling)
+ 'angled-type': [
+ # match nested '<' and '>'
+ (r"<", Punctuation, '#push'),
+ (r">", Punctuation, '#pop'),
+
+ # dimensions like 3x or 3x3x... and standalone numbers:
+ # - match numbers that are followed by an 'x' (dimension separator)
+ (r"([0-9]+)(?=(?:[xX]))", Number.Integer),
+ # - match bare numbers (sizes)
(r"[0-9]+", Number.Integer),
- (r"[0-9]*\.[0-9]*", Number.Float),
- (r'"[^"]*"', String.Double),
- (r"affine_map", Keyword.Reserved),
- # TODO: this should be within affine maps only
- (r"\+-\*\/", Operator),
- (r"floordiv", Operator.Word),
- (r"ceildiv", Operator.Word),
- (r"mod", Operator.Word),
- (r"()\[\]<>,{}", Punctuation),
- (r"\/\/.*\n", Comment.Single),
- ]
+ # dynamic dimension '?'
+ (r"\?", Name.Constant),
+
+ # the 'x' dimension separator (treat as punctuation)
+ (r"[xX]", Punctuation),
+
+ # element / builtin types inside angle brackets (no word-boundary)
+ (r"(?:bf16|f16|f32|f64|f80|f128|index|none|(?:[us]?i[0-9]+))",
+ Keyword.Type),
+
+ # also allow nested container-like types to be recognized
+ (r"\b(complex|memref|tensor|tuple|vector)\s*(<)",
+ bygroups(Keyword.Type, Punctuation), 'angled-type'),
+
+ # fall back to root rules for anything else
+ include('root'),
+ ],
+
}
|
|
✅ With the latest revision this PR passed the Python code formatter. |
|
Since this is auto-generated, how can we evaluate the quality of the highlightling? Maybe you can post some before/after pictures for a dozen of examples drawn from tests? |
Ahh sure. Here's some screenshots before and after the change. And you can preview these highlighted code at (affine dialect, for example): Before: https://mlir.llvm.org/python-bindings/autoapi/mlir/dialects/_affine_ops_gen/index.html |
ftynse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay. Please double-check with the parser logic, I caught a bunch of discrepancies just by scanning the code.
mlir/utils/pygments/mlir_lexer.py
Outdated
| (r"\b(affine_map|affine_set)\b", Keyword.Reserved), | ||
| # common builtin operators / functions inside affine_map | ||
| (r"\b(ceildiv|floordiv|mod|symbol)\b", Name.Builtin), | ||
| # operation definitions with assignment: %... = op.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these different then a separate SSA value, an equals sign, and an op without result? Maybe this was intended to capture the %42:3 form that is only allowed in definition (as opposed to a use that may only use %42#2 form)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's for capturing the op name. So basically two cases:
^(%value) = (op.name) ...^(op.name) ...
Both of these require that it should be started at the begining of the line, so the false positive is reduced.
| # operation name without result | ||
| (r"^(\s*)([A-Za-z0-9_\.\$\-]+)\b(?=[^<:])", bygroups(Text, Name.Function)), | ||
| # identifiers / bare words | ||
| (r"\b[_A-Za-z][_A-Za-z0-9\.-]*\b", Name.Other), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how exactly leading \b is handled, does it require whitespace or some special punctuation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\b stands for word boundary. e.g. r"abc" can match "xabcx", but r"\babc\b" cannot.
For example, if we add a rule (r"index", XXX), it can match index in something like create_index %a.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is a word boundary. I don't know what precisely it means here. Will it still match in memref<index>? In transform.foo.index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- for
memref<index>, yes it will match sinceindexis a word separated by<and>. - for
transform.foo.index, if we look at this rule alone, it will match, but the preceding rules might matchtransform.foo.indexas an operation name in advance, so it may not take effect.
| bygroups(Keyword.Type, Punctuation), | ||
| "angled-type", | ||
| ), | ||
| # fall back to root rules for anything else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is possible, we shouldn't need the special logic above, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup but the parsing logic inside and outside the angle is quite different. One of the reason is stated here: #166406 (comment).
| # operation name without result | ||
| (r"^(\s*)([A-Za-z0-9_\.\$\-]+)\b(?=[^<:])", bygroups(Text, Name.Function)), | ||
| # identifiers / bare words | ||
| (r"\b[_A-Za-z][_A-Za-z0-9\.-]*\b", Name.Other), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is a word boundary. I don't know what precisely it means here. Will it still match in memref<index>? In transform.foo.index?
|
Thank you for review. I'll merge it soon if no more comments : ) |
Recently, the MLIR website added API documentation for the Python bindings generated via Sphinx ([https://mlir.llvm.org/python-bindings/](https://mlir.llvm.org/python-bindings/)). In [https://github.com/llvm/mlir-www/pull/245](https://github.com/llvm/mlir-www/pull/245), I introduced the Pygments lexer from the MLIR repository to enable syntax highlighting for MLIR code blocks in these API docs. However, since the existing Pygments lexer was fairly minimal, it didn’t fully handle all aspects of the MLIR syntax, leading to imperfect highlighting in some cases. In this PR, I used ChatGPT to rewrite the lexer by combining it with the TextMate grammar for MLIR ([https://github.com/llvm/llvm-project/blob/main/mlir/utils/textmate/mlir.json](https://github.com/llvm/llvm-project/blob/main/mlir/utils/textmate/mlir.json)). After some manual adjustments, the results look good—so I’m submitting this to improve the syntax highlighting experience in the Python bindings API documentation.






Recently, the MLIR website added API documentation for the Python bindings generated via Sphinx (https://mlir.llvm.org/python-bindings/). In llvm/mlir-www#245, I introduced the Pygments lexer from the MLIR repository to enable syntax highlighting for MLIR code blocks in these API docs.
However, since the existing Pygments lexer was fairly minimal, it didn’t fully handle all aspects of the MLIR syntax, leading to imperfect highlighting in some cases. In this PR, I used ChatGPT to rewrite the lexer by combining it with the TextMate grammar for MLIR (https://github.com/llvm/llvm-project/blob/main/mlir/utils/textmate/mlir.json). After some manual adjustments, the results look good—so I’m submitting this to improve the syntax highlighting experience in the Python bindings API documentation.