Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Github] Improve formating of PR diffs in bot notifications #66118

Merged
merged 11 commits into from
Sep 15, 2023

Conversation

cor3ntin
Copy link
Contributor

@cor3ntin cor3ntin commented Sep 12, 2023

  • This avoid pinging folks on all issue when they got pinged on bugzilla eons ago
  • Avoid formatting bugs when there is html in the issue description
  • Truncate the list of files and the diff independently of each other. This avoids truncating cutting a file line in 2 and to cut in the middle of html markup. This is a fringe case but it does happen when people accidentally push weird branches conflicting on all the files.

Test: This is #66118 @cor3ntin

@MaskRay
Copy link
Member

MaskRay commented Sep 12, 2023

I wonder whether # needs to be escaped as well

@cor3ntin cor3ntin force-pushed the escape_at branch 2 times, most recently from b74a7f4 to 1ab5ca7 Compare September 13, 2023 05:26
@cor3ntin
Copy link
Contributor Author

cor3ntin commented Sep 13, 2023

I wonder whether # needs to be escaped as well

Yes, done!

# and the file list to 20kB each.
STAT_LIMIT = 20 * 1024
DIFF_LIMIT = 20 * 1024

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't in the PR description, can you add it there? (with some context for the "why" of the change)

Copy link
Collaborator

@joker-eph joker-eph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, but I agree with @MaskRay about #, it is likely we don't want to escape it alone, but the sequence of #[1-2]+ maybe?

@cor3ntin
Copy link
Contributor Author

@joker-eph I'll rebase and try to be more selective in what we escape. escaping unconditionally probably does not hurt though

@MaskRay
Copy link
Member

MaskRay commented Sep 14, 2023

Unconditionally escaping # mitigates problems such as a stack trace accidentally references #1/#2/etc. I think for intentional references, people can use regular comments.

@joker-eph
Copy link
Collaborator

Unconditionally escaping # mitigates problems such as a stack trace accidentally references #1/#2/etc. I think for intentional references, people can use regular comments.

You missed my point: I commented that we should match # **followed by at least one number. The case you mentioned would be covered just fine.

@MaskRay
Copy link
Member

MaskRay commented Sep 14, 2023

Unconditionally escaping # mitigates problems such as a stack trace accidentally references #1/#2/etc. I think for intentional references, people can use regular comments.

You missed my point: I commented that we should match # **followed by at least one number. The case you mentioned would be covered just fine.

I'd argue that issue/PR links can be achieved with normal comments, not with bot comments. We should not allow #123 to link an issue in a bot comment, either. Linking to an issue in bot comment due to a commit message is probably not going to be useful. Users can edit descriptions to achieve their goals.

@cor3ntin
Copy link
Contributor Author

I'll rework that further, i found a way to improve escaping of html in code blocks

* This avoid pinging folks on all issue when they got pinged on
bugzilla eaons ago

* Avoid formating bugs when there is html in the issue description
* Escape PR description
* Trunkate the list of files if it's > 20K
* Color the diff on github
This way we can render < and & properly.

This only work if there is a new line before the 3 ticks.
@cor3ntin
Copy link
Contributor Author

Spoiler
+ You need a new line before the ``` 
- `[]<typename>(auto&)`

@cor3ntin cor3ntin changed the title [Github] Escape @ and html in the <details> block [Github] Improve formating of PR diffs in bot notifications Sep 14, 2023
@tstellar
Copy link
Collaborator

Can you test this out by running the script against this PR?

@cor3ntin
Copy link
Contributor Author

@llvm/pr-subscribers-github-workflow

Changes
  • This avoid pinging folks on all issue when they got pinged on bugzilla eons ago
  • Avoid formatting bugs when there is html in the issue description
  • Truncate the list of files and the diff independently of each other. This avoids truncating cutting a file line in 2 and to cut in the middle of html markup. This is a fringe case but it does happen when people accidentally push weird branches conflicting on all the files.

Test: This is #66118 @cor3ntin


Full diff: https://github.com/llvm/llvm-project/pull/66118.diff

1 Files Affected:

  • (modified) llvm/utils/git/github-automation.py (+47-16)
diff --git a/llvm/utils/git/github-automation.py b/llvm/utils/git/github-automation.py
index eac5816b5499f6a..8578a83262717ab 100755
--- a/llvm/utils/git/github-automation.py
+++ b/llvm/utils/git/github-automation.py
@@ -47,6 +47,21 @@ def _get_curent_team(team_name, teams) -> Optional[github.Team.Team]:
     return None
 
 
+def escape_description(str):
+    # https://github.com/github/markup/issues/1168#issuecomment-494946168
+    str = html.escape(str, False)
+    # '@' followed by alphanum is a user name
+    str = re.sub("@(?=\w+)", "@<!-- -->", str)
+    # '#' followed by digits is considered an issue number
+    str = re.sub("#(?=\d+\s)", "#<!-- -->", str)
+    return str
+
+
+def sanitize_markdown_code_block(str):
+    # remove codeblocks terminators
+    return re.sub("^\s*```\s*$", r"` ` `", str)
+
+
 class IssueSubscriber:
     @property
     def team_name(self) -> str:
@@ -67,12 +82,15 @@ def run(self) -> bool:
         if team.slug == "issue-subscribers-good-first-issue":
             comment = "{}\n".format(beginner_comment)
 
-        comment = (
-            f"@llvm/{team.slug}"
-            + "\n\n<details>\n"
-            + f"{self.issue.body}\n"
-            + "</details>"
-        )
+        body = escape_description(self.issue.body)
+
+        comment = f"""
+@llvm/{team.slug}
+
+<details>
+{body}
+</details>
+"""
 
         self.issue.create_comment(comment)
         return True
@@ -113,6 +131,11 @@ def run(self) -> bool:
             print(f"couldn't find team named {self.team_name}")
             return False
 
+        # GitHub limits comments to 65,536 characters, let's limit the diff
+        # and the file list to 20kB each.
+        STAT_LIMIT = 20 * 1024
+        DIFF_LIMIT = 20 * 1024
+
         # Get statistics for each file
         diff_stats = f"{self.pr.changed_files} Files Affected:\n\n"
         for file in self.pr.get_files():
@@ -125,35 +148,43 @@ def run(self) -> bool:
             if file.status == "renamed":
                 print(f"(from {file.previous_filename})")
             diff_stats += "\n"
-        diff_stats += "\n"
+            if len(diff_stats) > STAT_LIMIT:
+                break
 
         # Get the diff
         try:
-            patch = html.escape(requests.get(self.pr.diff_url).text)
+            patch = requests.get(self.pr.diff_url).text
         except:
             patch = ""
-        diff_stats += "\n<pre>\n" + html.escape(patch)
 
-        # GitHub limits comments to 65,536 characters, let's limit the diff to 20kB.
-        DIFF_LIMIT = 20 * 1024
+        patch = sanitize_markdown_code_block(patch)
+
         patch_link = f"Full diff: {self.pr.diff_url}\n"
         if len(patch) > DIFF_LIMIT:
             patch_link = f"\nPatch is {human_readable_size(len(patch))}, truncated to {human_readable_size(DIFF_LIMIT)} below, full version: {self.pr.diff_url}\n"
-            diff_stats = diff_stats[0:DIFF_LIMIT] + "...\n<truncated>\n"
-        diff_stats += "</pre>"
+            patch = patch[0:DIFF_LIMIT] + "...\n[truncated]\n"
         team_mention = "@llvm/{}".format(team.slug)
 
-        body = self.pr.body
+        body = escape_description(self.pr.body)
+        # Note: the comment is in markdown and the code below
+        # is sensible to line break
         comment = f"""
 {self.COMMENT_TAG}
 {team_mention}
-            
+
 <details>
 <summary>Changes</summary>
+
 {body}
---
+---
 {patch_link}
+
 {diff_stats}
+
+```diff
+{patch}
+```
+
 </details>
 """
 


def sanitize_markdown_code_block(str):
# remove codeblocks terminators
return re.sub("^\s*```\s*$", r"` ` `", str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be code blocks that specify syntax in order to enable syntax highlighting, e.g.

```cpp
std::vector v;
```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only need to escape the closing ```

```cpp does not close a code block 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I though you match both opening and closing triple backticks.
But is this going to work correctly for multiple code blocks in a message? Multiple unescaped openings might be interpreted as opening and closing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope - but re-reading the spect and doing more tests, we can use more backticks instead of escaping https://spec.commonmark.org/0.30/#fenced-code-block

That way we do not have to escape ` ``` `
@cor3ntin cor3ntin merged commit 1b18e98 into llvm:main Sep 15, 2023
3 checks passed
{body}
--
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change (three dashes) causes the last paragraph of the body to be rendered like a title, which looks awful, e.g.: the "After legalization is done" paragraph in this issue: #66567 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add an extra new line before the dashes, I'll do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@cor3ntin cor3ntin deleted the escape_at branch September 18, 2023 14:18
ZijunZhaoCCK pushed a commit to ZijunZhaoCCK/llvm-project that referenced this pull request Sep 19, 2023
* This avoid pinging folks on all issue when they got pinged on bugzilla
eons ago
* Avoid formatting bugs when there is html in the issue description
* Truncate the list of files and the diff independently of each other.
This avoids truncating cutting a file line in 2 and to cut in the middle
of html markup. This is a fringe case but it does happen when people
accidentally push weird branches conflicting on all the files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants