Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Message conversion and formatting #3

Closed
13 tasks done
ezio-melotti opened this issue Oct 25, 2020 · 2 comments
Closed
13 tasks done

Message conversion and formatting #3

ezio-melotti opened this issue Oct 25, 2020 · 2 comments
Assignees

Comments

@ezio-melotti
Copy link
Member

ezio-melotti commented Oct 25, 2020

This issue is about converting and formatting the content (text) of the bpo messages (not the issue metadata) before importing them into GitHub.

bpo messages are raw text with no formatting, whereas GitHub issues use Markdown. If messages are imported directly, special characters in the bpo messages might be wrongly interpreted as Markdown formatting, resulting in erroneous rendering.

Possible solutions:

  1. Import messages within code-block markup, to render it literally:
    • quick and easy solution, but the result looks ugly
    • SymPy used this approach (see e.g. this issue)
  2. Import messages as normal text, but escape special characters
    • can this be done reliably?
    • are there already existing tools that can do it?
  3. Detect and convert to Markdown links, code blocks, lists, etc.
    • can this be done reliably?
    • are there already existing tools that can do it?

Edit: I went with option 3. It's not perfect, but it seems to work well enough.

Other considerations:

  • On bpo, links to other issues, messages, PRs, PEPs, etc. are added at rendering time either by Roundup itself or by using regexes (see also the list of special links in the devguide).
    • #XXXX, issueXXXX, issue XXXX refs should be replaced by bpo-XXXX and possibly replaced after the migration
    • msgXXXX and msg XXXX could be converted to markdown links to the corresponding bpo issue.
      • not ideal but should give enough context to locate the message on GH manually
    • fileXXXX and file XXXX are not used frequently and could be ignored
      • however a link to the file can be added in the message that attached it
    • PEPs can be left alone since autolinking can already turn them into links
    • GHXXXX, GH XXXX, PRXXXX, PR XXXX, pull request XXXX, BPOXXXX, BPO XXXX should all be hyphenated or the autolinking won't work
    • Old SVN refs (rXXXXX) link to https://hg.python.org/lookup/rXXXXX but are currently broken
      • These will be left unchanged
    • Files (Lib/somefile.py, Modules/somemodule.c, Doc/somedocfile.rst) can be converted to markdown links
    • Traceback are now within code blocks, so files in the tracebacks can't be converted into links
      • Tracebacks have been left unchanged
      • We could list them after the traceback, but probably it's not worth the effort
      • It might be interesting to have an action that does this down the line though
  • The same regexes can be used to convert all these links to Markdown.
  • Issue numbers can also be remapped from the bpo to the GH numbers during the same step.
    • There is no way to know the new GH number in advance
    • The transfer tool can rewrite references like #xxxx during the transfer but only for issues that have been transferred already
    • Converting them to bpo-xxxx prevents rewrite and can use the bpo redirect added in Add a page that redirects from bpo to GitHub #17
  • We might want to preserve somewhere the original (raw) text.
    • We can leave this on bpo, otherwise we would have to duplicate all messages.

TODO:

  • Convert the messages to Markdown
  • Add links to issues/PRs/msg
@ammaraskar
Copy link

  1. Import messages within code-block markup, to render it literally:

One nice aspect of this is that bpo issues are currently displayed monospace. Without adequate conversion of code blocks such as in (3), code snippets and places where alignment is important would look broken. (3) seems quite hard to implement though, a lot of the formatting is ad-hoc (I know I've personally sometimes kept code blocks on the same level or indented them with 2 or 4 spaces occasionally). It seems like (1) might be the way to go. If we really want (2) there are mature libraries like turndown.

On bpo, links to other issues, messages, PRs, PEPs, etc. are added at rendering time using regexes.

This is probably less of a concern with https://docs.github.com/en/github/administering-a-repository/managing-repository-settings/configuring-autolinks-to-reference-external-resources which we already have set up for bpo links on the CPython repo thanks to @Mariatta in python/core-workflow#361

Issues numbers can also be remapped from the bpo to the GH numbers during the same step.

Depending on the resolution to #2, it might be nice to keep the links the same and then the bugs.python.org/issuexxx links end up redirecting you to the right Github page. But assuming we make the roundup instances read-only or mirrored then remapping the issues is probably a good idea.

@ezio-melotti
Copy link
Member Author

We have:

We want:

  • to keep the plain text as plain, possibly non-monospaced, with markdown characters escaped
  • explicit links to be clickable
  • implicit links to be clickable
  • code, tracebacks, terminal sessions, output to be monospaced, possibly highlighted

It seems that:

  • <pre> or ```...``` can be used to show the code in a monospaced block:
    • the background looks different and
    • explicit links like https://bugs.python.org/issue2771 are not clickable
    • this also doesn't work [these](https://bugs.python.org/issue2771#msg154050)
    • and neither does #3
  • <samp> can be used to make the text monospaced:
  • All ASCII punctuation can be escaped
    • this will prevent things like __init__ turning into init
    • however this will break links too
  • File links in traceback will be lost if use ```...``` to get syntax highlight, but can be preserved if we use <samp> (with no highlight)

If we want to implement option 2 from the first message, we could:

  • wrap each paragraph in <samp>
  • replace leading whitespace with &nbsp;
  • escape all punctuation (except in URLs)
    This should make the messages look similar to bpo: monospaced, no fancy markup/highlight, with working links.

If we want to implement option 3, we could parse each paragraph independently and:

  • if it seems mostly text, only escape all punctuation (except in URLs)
  • if it looks like code, wrap it within ```...``` to have a monospaced font
    • this will break all links in the block, including ones in tracebacks
  • for code highlight, we could:
    • just use py by default for all blocks
    • use a simple heuristic (e.g. contain >>> / keywords like def) to detect Python, use text otherwise
    • use pygment to detect the lang
      • only Python/C/rst should be allowed, the rest should be text
      • might be a bit slow
  • we might also be able to detect and allow some additional markup, like `...` or *...*, or perhaps even lists, and leave it unescaped
  • making this undoable and/or storing the raw message somewhere shouldn't be an issue, since we can always get the original messages from bpo if we ever need them

I'll do some tests on a real world sample to see if we can reach a good compromise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants