Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic uri parsing and fixing trailing slash issue #2392

Merged
merged 6 commits into from
Mar 22, 2017

Conversation

tbolender
Copy link
Contributor

I wrote a structure for uri parsing to solve #2265 in more elegant way. To supported a new uri type, a class implementing the UriParser interface has to be created. To be applied during HTML generation, this class needs to be "registered" in HtmlConverter together with all matching uri schemes.

To fix #1223, I rewrote the parsing of http uris including support for IPv6 addresses. I created a couple tests, but of course I maybe could have missed something. So do not hesitate to comment if you notice something.

Copy link
Member

@cketti cketti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very promising. There are a few code style issues. See https://github.com/k9mail/k-9/wiki/CodeStyle

class BitcoinUriParser implements UriParser {
private static final Pattern BITCOIN_URI_PATTERN =
Pattern.compile("bitcoin:[1-9a-km-zA-HJ-NP-Z]{27,34}(\\?[a-zA-Z0-9$\\-_.+!*'(),%:@&=]*)?",
Pattern.CASE_INSENSITIVE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why case-insensitive?

return startPos;
}

String linkifiedUri = String.format("<a href=\"%1$s\">%1$s</a>", matcher.group());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following avoids having to parse the format string and creating a temporary string.

String bitcoinUri = matcher.group();
outputBuffer.append("<a href=\"")
        .append(bitcoinUri)
        .append("\">")
        .append(bitcoinUri)
        .append("</a>");

Come to think of it. We also need to encode at least & in the href attribute.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be job of the UriParser or the converter?

SUPPORTED_URIS.put("bitcoin:", new BitcoinUriParser());
SUPPORTED_URIS.put("http:", new HttpUriParser());
SUPPORTED_URIS.put("https:", new HttpUriParser());
SUPPORTED_URIS.put("rtsp:", new HttpUriParser());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needlessly creates three HttpUriParser instances.

@@ -431,23 +393,35 @@ protected static String getQuoteColor(final int level) {
* @param outputBuffer Buffer to append linked text to.
*/
protected static void linkifyText(final String text, final StringBuffer outputBuffer) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extract this to a separate class to simplify testing.

while (matcher.find(currentPos)) {
int startPos = matcher.start();

// Append all text in between
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to avoid comments because they're easily out of date when the code is changed but not the comments. It also encourages writing more readable code.

To make this more readable you could change it to

String textBeforeMatch = text.substring(currentPos, startPos);
outputBuffer.append(textBeforeMatch);

*
* @return Position of first character after @ sign.
*/
private int matchUserInfoIfAvailable(String text, int startPos, int authorityEnd) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a public API. We don't need JavaDoc for internal methods.

* @param outputBuffer Buffer where linkified variant of uri is written to.
* @return Index where parsed uri ends (first non-uri letter). Should be startPos or smaller if no valid uri was found.
*/
int linkifyUri(final String text, int startPos, final StringBuffer outputBuffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need for final in interfaces.

@@ -19,7 +19,8 @@
@Config(manifest = Config.NONE)
public class HtmlConverterTest {
// Useful if you want to write stuff to a file for debugging in a browser.
private static final boolean WRITE_TO_FILE = Boolean.parseBoolean(System.getProperty("k9.htmlConverterTest.writeToFile", "false"));
private static final boolean WRITE_TO_FILE =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert the unrelated changes in this file.

@@ -207,4 +216,33 @@ public void testLinkifyBitcoinAndHttpUri() {
"http://example.com/" +
"</a>", outputBuffer.toString());
}

@Test
public void testHttpUris() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please only one test per test method.

@cketti
Copy link
Member

cketti commented Mar 15, 2017

The current code will probably also linkify the http URI in a string like myhttp://example.org. We probably shouldn't do that.

@tbolender
Copy link
Contributor Author

Sorry for the cody style issues, I imported the settings.jar as described. All mentioned points should be fixed.

@tbolender tbolender changed the title Generic uri parsing and fixing trailing trailing slash issue Generic uri parsing and fixing trailing slash issue Mar 16, 2017
Copy link
Contributor

@Valodim Valodim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 good job!

@cketti cketti self-assigned this Mar 17, 2017
@cketti
Copy link
Member

cketti commented Mar 20, 2017

I squashed all commits and cleaned up the code a bit hoping the feature was ready to merge. However, HttpUriParser ports one major assumption from HttpUrl that doesn't hold when scanning for URLs. Namely, it assumes all of the input is one URL. This leads to the code treating everything after "http://" up to the next "/" as authority. Which fails to support URLs where the authority doesn't end in a slash, e.g. the first URL in the following string will not be detected/linkified: "http://uri1.example.org some text http://uri2.example.org/path"

While trying to fix this I noticed that HttpUriParser supports internationalized domain names (IDN) and read up on that topic. My takeaway was "this is complicated". I think as a first step we shouldn't attempt to linkify IRIs (Internationalized Resource Identifiers), but limit ourselves to traditional URIs.

Fixing the detection of valid http URLs surrounded by text is still something that needs to be done.

@tbolender
Copy link
Contributor Author

Thanks for taking the time. I dropped the IDN detection and allowing only simple domain names now. I also changed the parsing to a more greedy approach, now successfully detecting your example.

@@ -18,6 +16,8 @@
class HttpUriParser implements UriParser {
// This string represent character group sub-delim as described in RFC 3986
private static final String SUB_DELIM = "!$&'()*+,;=";
private static final Pattern DOMAIN_PATTERN =
Pattern.compile("\\w([\\w-]*\\w)*(\\.\\w([\\w-]*\\w)*)*(:(\\d{0,5}))?");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately \w also includes the underscore which is not valid in host names.

You can use non-capturing groups by having ?: as first characters inside the parentheses. That'll make it easier to later get to the content you do want to capture.

Example: https://github.com/itiboi/k-9/blob/cf9c3d078e6e7296f16d8d8a29905047eeeec36b/k9mail/src/main/java/com/fsck/k9/message/html/UriLinkifier.java#L17

if (!tryMatchDomainName(text, currentPos, authorityEnd) &&
!tryMatchIpv4Address(text, currentPos, authorityEnd, true) &&
!tryMatchIpv6Address(text, currentPos, authorityEnd)) {
int matchedAuthorityEnd = Math.max(tryMatchDomainName(text, currentPos),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole Math.max() business makes this super hard to read. There's also no need to attempt another match if one of the methods was successful. So I suggest to extract all of this to a separate method and then to check the return value after each call to tryMatch*() and return early if a match was found.

int userInfoEnd = text.indexOf('@', startPos);
if (userInfoEnd != -1 && userInfoEnd < authorityEnd) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

authorityEnd is still a useful upper bound that can be used to avoid useless work.

@tbolender
Copy link
Contributor Author

Thanks for the hint with ?: and the underscore, totally forgot about that.

@cketti cketti merged commit 32212a4 into thunderbird:master Mar 22, 2017
@cketti
Copy link
Member

cketti commented Mar 22, 2017

Awesome. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect URL detection (final slash)
3 participants