Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow MathML Core tags in sanitized post content #19806

Closed
4e554c4c opened this issue Nov 5, 2022 · 12 comments
Closed

Allow MathML Core tags in sanitized post content #19806

4e554c4c opened this issue Nov 5, 2022 · 12 comments
Labels
suggestion Feature suggestion

Comments

@4e554c4c
Copy link

4e554c4c commented Nov 5, 2022

Pitch

MathML Core is a standard language to describe the structure and content of mathematical expressions in browsers. Unlike TeX-family languages, MathML is not an entire typesetting language reliant on macro-processing. Instead, it is reliant on unicode and other features of browser engines to efficiently display mathematics.

This specification is well-supported on Firefox and Safari, and is currently being shipped in Chrome (it is no longer behind a browser feature in chrome v109, the current beta).

I propose that, behind an enabled-by-default feature, MathML tags should not be sanitized from the content body of activities. This allows mathematical posts from across the fediverse to retain their mathematical content, and rendered on browsers that do not support it. This will render poorly on older versions of Google Chrome, but it will be no worse than mathematical rendering already looks.

See thread for more info: https://types.pl/@pounce/109286683477125171

Motivation

Previous suggestions to bundle mathJAX (#822) were turned down due to performance loss for all users, when only a few instances/users care about mathematics. However, this will not be the case if instances which produce mathematics, render it to MathML themselves! This puts the majority of the computational effort and javascript bloat on instances which care about it, while other instances will be able to simply consume the content in their web browser.

ActivityPub example

For example, a math-based instance could produce the activity

{
  "@context": ["https://www.w3.org/ns/activitystreams", {"@language": "en"}],
  "type": "Note",
  "id": "http://postparty.example/p/2415",
  "content": "How do I solve <math><mrow><msubsup><mo movablelimits=\"false\">∫</mo><mn>0</mn><mn>1</mn></msubsup><msup><mi>x</mi><mn>3</mn></msup></mrow></math>?",
  "source": {
    "content": "How do I solve \\(\\int_0^1 x^3\\)?",
    "mediaType": "text/markdown+math"}
}

The user would compose the "source/content" post, containing LaTeX math syntax, which would be rendered to MathML Core. This could then be rendered on any mastodon instance not scrubbing the math tags.

This is important, since several math-based instances exist (such as https://mathstodon.xyz ,https://types.pl) and produce math-based posts. However, when these posts federate to other instances they cannot be rendered, since other instances do not have MathJAX installed. Thus, a more portable version based on open standards is necessary.

@4e554c4c 4e554c4c added the suggestion Feature suggestion label Nov 5, 2022
@4e554c4c
Copy link
Author

4e554c4c commented Nov 6, 2022

I started work on this a bit, and thinking about mastodon's philosophy a bit I think it'd be good to scrub all length percentage attributes from MathML tags. This means that e.g. height and width attributes should be removed, since they don't convey exact semantic info. of course this might make some math look less good, but I still think it's what's desired so that arbitrary instances can't show text on another screen or something

an exemption to this, however, is probably <mfrac linethickness="0">, which is a fraction without a line. This is used in the MathML specification to denote the binomial coefficient and thus definitely has semantic value.

@christianp
Copy link

I was thinking a out this yesterday. I'm glad someone else is taking it on! But I don't think this solves the problem of rendering math in native apps.

@4e554c4c
Copy link
Author

4e554c4c commented Nov 7, 2022

Apps already have to render a subset of html (the content body in existing activities) this would just require them to render MathML in addition, if they want to support Math.

Really, if we want different apps, instances, etc. to agree on math formatting it needs to be presented in a standard way. This isn't currently possible because different instances set up math differently (e.g. some use \( and \) for inline delimiting math blocks, and others use $). I would argue it wouldn't be good to add this to an app, since apps have to be instance-generic

But if instances produce MathML, apps and other instances could consume it generically

@4e554c4c
Copy link
Author

4e554c4c commented Nov 7, 2022

now that I think about it, the confusion might be around this issue, which concerns "sanitation".

There's a difference between fedi software (frontends, apps, and relays) producing content, and being able to display/replicate it. This issue only considers the latter for the "mastodon" project. If you were worried about apps, it would be good to make a similar issue for them. First though it would be good for an instance to actually be producing MathML, and I will be working on that after this is complete.

@christianp
Copy link

Yes, this is definitely an improvement on the current situation.

4e554c4c added a commit to 4e554c4c/mastodon that referenced this issue Nov 10, 2022
See mastodon#19806 for more info.

Test Plan:
----------
```
$ RAILS_ENV=test bundle exec rspec spec/lib/sanitize_config_spec.rb

Randomized with seed 19230
 11/11 |========================================================================================== 100 ===========================================================================================>| Time: 00:00:00

Finished in 0.07389 seconds (files took 1.67 seconds to load)
11 examples, 0 failures

Randomized with seed 19230
Coverage report generated for RSpec to /home/pounce/programming/mastodon/coverage. 1343 / 35156 LOC (3.82%) covered.
```
observed 100% code coverage of lib/sanitize_ext/sanitize_config.rb.

closes mastodon#19806
@nightpool
Copy link
Member

nightpool commented Nov 12, 2022

The core Mastodon project has never been interested in introducing rich-text formatting into posts, because it complicates the UI and adds many additional concerns when compared to the current plain-text nature of Mastodon posts. (For example, rendering support for MathML would be a huge burden for native mobile apps that do not have access to a browser implementation to rely on). If we ever decided to add rich text formatting, there would be many other lower-hanging fruit to support, such as bold, italics, etc, that are much more likely to have wider, cross-platform support. However, we currently don't believe rich-text formatting matches Mastodon's model well and is better suited for other software / clients to implement.

However, if you wanted to write code for transforming incoming MathML content from other servers into a plain-text equivalent, to preserve semantics, then I believe the current project policy is that we would consider it.

While that may be a better step forward from a compatibility standpoint, I think there are many logistical/practicality challenges to handle there, especially since users rarely author MarhML markup manually and instead are more accustomed to having it produced from e.g. pseudo-TeX. I'm also not sure whether MathML has any support for "round-tripping" source text like this, which would probably be necessary to implement this well

@nightpool nightpool reopened this Nov 12, 2022
@nightpool
Copy link
Member

(sorry, didn't mean to close, happy to leave this issue open to consolidate discussion on alternatives)

@4e554c4c
Copy link
Author

Thanks for the response. I've been somewhat expecting this since MathML can be used to produce rich text, even if it not designed to.
MathML can totally "round trip" text. For example

<math>
  <semantics>
    <mfrac>
      <mn>1</mn>
      <mn>2</mn>
    </mfrac>
    <annotation encoding="application/x-tex">\frac{1}{2}</annotation>
  </semantics>
</math>

provides the source TeX of the provided MathML.

I have a bit of a philosophical question though:
why doesn't mastodon just use the activity "source" field instead? This would show the TeX from my example.
I expect the answer is "to preserve links, show hashtags and emojis" accurately; but it is silly this means every post must be scrubbed.

@4e554c4c
Copy link
Author

4e554c4c commented Nov 13, 2022

It might be a struggle to render mathml without annotation, but since we're trying to sanitize chaotic HTML, I think it's safe to assume that "well behaved" incoming mathML has a first node of <semantics> with either an annotation with encoding="application/x-tex", encoding="text/plain", or an annotation-xml node with encoding=text/html or encoding="application/xhtml+xml" which needs to be scrubbed.

@nightpool
Copy link
Member

nightpool commented Nov 20, 2022

@4e554c4c the "source" field is in whatever language the user happens to author it in—the spec makes no guarantees about it even being human readable. It's purely designed for round-tripping content when edited by multiple client applications, it's not appropriate for display.

@alfredr
Copy link

alfredr commented Nov 24, 2022

MathML is frequently inadequate in practice. Regardless, is rendering not a client side concern? It seems to be a nice feature of the web client to offer to inline a polyfill for rendering math text (perhaps with an option to disable.)

But as far as the consistency of the rendering, is this really in scope beyond its impact to the layout? (Which could be addressed by overflow rules.)

Bigger picture. LaTeX is supported a lot of places. It's a common markdown extension that's supported here on GitHub for example:

$$\oint \vec{E}\cdot d\vec{A}= \frac{q_{\mathrm{enc}}}{\varepsilon_0}$$

It feels like this could dovetail with something like #18958

4e554c4c added a commit to 4e554c4c/mastodon that referenced this issue Sep 17, 2023
See mastodon#19806 for more info.

Test Plan:
----------
```
$ RAILS_ENV=test bundle exec rspec spec/lib/sanitize_config_spec.rb -f d
Randomized with seed 26282

Sanitize::Config
  ::MASTODON_OUTGOING
    keeps a with href and rel tag, not adding to rel or target if url is local
    behaves like common HTML sanitization
      removes a with unsupported scheme in href
      removes a with unparsable href
      keeps math
      keeps ul
      removes a without href and only keeps text content
      removes a without href
      keeps a with href
      keeps a with translate="no"
      removes "translate" attribute with invalid value
      keeps h1
      does not re-interpret HTML when removing unsupported links
      keeps title in abbr
      keeps start and reversed attributes of ol
      keeps a with supported scheme and no host
      correctly sanitizes linethickness

Finished in 0.61166 seconds (files took 4.76 seconds to load)
16 examples, 0 failures

Randomized with seed 26282
```
observed 100% code coverage of lib/sanitize_ext/sanitize_config.rb.

See mastodon#19806, glitch-soc#1432
ionathanch pushed a commit to ralsei/types.pl that referenced this issue Sep 17, 2023
See mastodon#19806 for more info.

Test Plan:
----------
```
$ RAILS_ENV=test bundle exec rspec spec/lib/sanitize_config_spec.rb -f d
Randomized with seed 26282

Sanitize::Config
  ::MASTODON_OUTGOING
    keeps a with href and rel tag, not adding to rel or target if url is local
    behaves like common HTML sanitization
      removes a with unsupported scheme in href
      removes a with unparsable href
      keeps math
      keeps ul
      removes a without href and only keeps text content
      removes a without href
      keeps a with href
      keeps a with translate="no"
      removes "translate" attribute with invalid value
      keeps h1
      does not re-interpret HTML when removing unsupported links
      keeps title in abbr
      keeps start and reversed attributes of ol
      keeps a with supported scheme and no host
      correctly sanitizes linethickness

Finished in 0.61166 seconds (files took 4.76 seconds to load)
16 examples, 0 failures

Randomized with seed 26282
```
observed 100% code coverage of lib/sanitize_ext/sanitize_config.rb.

See mastodon#19806, glitch-soc#1432
@4e554c4c
Copy link
Author

Closing in favor of #26943

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
suggestion Feature suggestion
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants