Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parenthesis and URLs #6121

Closed
1 of 2 tasks
kensanata opened this issue Dec 28, 2017 · 2 comments
Closed
1 of 2 tasks

Parenthesis and URLs #6121

kensanata opened this issue Dec 28, 2017 · 2 comments

Comments

@kensanata
Copy link

kensanata commented Dec 28, 2017

Parenthesis are legitimate characters in a URL. RFC 2396: «Data characters that are allowed in a URI ... "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"» This leads to Mastodon considering the following link to end with "background" instead of including the remaining ().html: https://coolguy.website/writing/the-future-will-be-technical/background().html
I know that Markdown use parenthesis to enclose an URL but this is a problematic idea. I think we should include parenthesis and only exclude the closing parenthesis if and only if it is at the end of the string, or before whitespace, or before punctuation and whitespace or the end of the status.
Something like the following:
http://example.com/foo() includes all
http://example.com/foo().html includes all
http://example.com/foo(). excludes period
http://example.com/foo().html includes all
(see http://example.com/foo()) excludes last parenthesis
(See http://example.com/foo().) excludes last period and parenthesis
(see http://example.com/foo()). excludes last parenthesis and period
(see http://example.com/foo().html). excludes last parenthesis and period


  • I searched or browsed the repo’s other issues to ensure this is not a duplicate.
  • This bug happens on a tagged release and not on master (If you're a user, don't worry about this).
@kensanata
Copy link
Author

I guess this should be an issue for twitter-text according to what I saw elsewhere. Something is wrong about these lines.

@kensanata
Copy link
Author

And the source code leads me to RFC 3986 which reduced the list of unreserved characters. But looking at the Path section we see that it consists of segments which consist of pchars which is pchar = unreserved / pct-encoded / sub-delims / ":" / "@" and parenthesis are part of sub-delims.
Oh well. Now that I look at the code that generates the regular expression, I feel that maybe this isn't worth it. Uuaaagh. In this particular case it failed because there was nothing inside the balanced parenthesis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant