Generic uri parsing and fixing trailing slash issue #2392

tbolender · 2017-03-15T01:14:29Z

I wrote a structure for uri parsing to solve #2265 in more elegant way. To supported a new uri type, a class implementing the UriParser interface has to be created. To be applied during HTML generation, this class needs to be "registered" in HtmlConverter together with all matching uri schemes.

To fix #1223, I rewrote the parsing of http uris including support for IPv6 addresses. I created a couple tests, but of course I maybe could have missed something. So do not hesitate to comment if you notice something.

cketti

This looks very promising. There are a few code style issues. See https://github.com/k9mail/k-9/wiki/CodeStyle

cketti · 2017-03-15T04:43:25Z

k9mail/src/main/java/com/fsck/k9/message/html/BitcoinUriParser.java

+class BitcoinUriParser implements UriParser {
+    private static final Pattern BITCOIN_URI_PATTERN =
+            Pattern.compile("bitcoin:[1-9a-km-zA-HJ-NP-Z]{27,34}(\\?[a-zA-Z0-9$\\-_.+!*'(),%:@&=]*)?",
+                    Pattern.CASE_INSENSITIVE);


Why case-insensitive?

cketti · 2017-03-15T04:52:31Z

k9mail/src/main/java/com/fsck/k9/message/html/BitcoinUriParser.java

+            return startPos;
+        }
+
+        String linkifiedUri = String.format("<a href=\"%1$s\">%1$s</a>", matcher.group());


The following avoids having to parse the format string and creating a temporary string.

String bitcoinUri = matcher.group(); outputBuffer.append("<a href=\"") .append(bitcoinUri) .append("\">") .append(bitcoinUri) .append("</a>");

Come to think of it. We also need to encode at least & in the href attribute.

Should this be job of the UriParser or the converter?

cketti · 2017-03-15T04:53:13Z

k9mail/src/main/java/com/fsck/k9/message/html/HtmlConverter.java

+        SUPPORTED_URIS.put("bitcoin:", new BitcoinUriParser());
+        SUPPORTED_URIS.put("http:", new HttpUriParser());
+        SUPPORTED_URIS.put("https:", new HttpUriParser());
+        SUPPORTED_URIS.put("rtsp:", new HttpUriParser());


This needlessly creates three HttpUriParser instances.

cketti · 2017-03-15T04:57:34Z

k9mail/src/main/java/com/fsck/k9/message/html/HtmlConverter.java

@@ -431,23 +393,35 @@ protected static String getQuoteColor(final int level) {
     * @param outputBuffer Buffer to append linked text to.
     */
    protected static void linkifyText(final String text, final StringBuffer outputBuffer) {


Please extract this to a separate class to simplify testing.

cketti · 2017-03-15T05:00:22Z

k9mail/src/main/java/com/fsck/k9/message/html/HtmlConverter.java

+        while (matcher.find(currentPos)) {
+            int startPos = matcher.start();
+
+            // Append all text in between


We try to avoid comments because they're easily out of date when the code is changed but not the comments. It also encourages writing more readable code.

To make this more readable you could change it to

String textBeforeMatch = text.substring(currentPos, startPos); outputBuffer.append(textBeforeMatch);

cketti · 2017-03-15T05:04:00Z

k9mail/src/main/java/com/fsck/k9/message/html/HttpUriParser.java

+     *
+     * @return Position of first character after @ sign.
+     */
+    private int matchUserInfoIfAvailable(String text, int startPos, int authorityEnd) {


This is not a public API. We don't need JavaDoc for internal methods.

cketti · 2017-03-15T05:05:31Z

k9mail/src/main/java/com/fsck/k9/message/html/UriParser.java

+     * @param outputBuffer Buffer where linkified variant of uri is written to.
+     * @return Index where parsed uri ends (first non-uri letter). Should be startPos or smaller if no valid uri was found.
+     */
+    int linkifyUri(final String text, int startPos, final StringBuffer outputBuffer);


There's no need for final in interfaces.

cketti · 2017-03-15T05:06:51Z

k9mail/src/test/java/com/fsck/k9/message/html/HtmlConverterTest.java

@@ -19,7 +19,8 @@
 @Config(manifest = Config.NONE)
 public class HtmlConverterTest {
    // Useful if you want to write stuff to a file for debugging in a browser.
-    private static final boolean WRITE_TO_FILE = Boolean.parseBoolean(System.getProperty("k9.htmlConverterTest.writeToFile", "false"));
+    private static final boolean WRITE_TO_FILE =


Please revert the unrelated changes in this file.

cketti · 2017-03-15T05:07:58Z

k9mail/src/test/java/com/fsck/k9/message/html/HtmlConverterTest.java

@@ -207,4 +216,33 @@ public void testLinkifyBitcoinAndHttpUri() {
                "http://example.com/" +
                "</a>", outputBuffer.toString());
    }
+
+    @Test
+    public void testHttpUris() {


Please only one test per test method.

cketti · 2017-03-15T05:15:06Z

The current code will probably also linkify the http URI in a string like myhttp://example.org. We probably shouldn't do that.

tbolender · 2017-03-15T11:23:14Z

Sorry for the cody style issues, I imported the settings.jar as described. All mentioned points should be fixed.

Valodim

👍 good job!

cketti · 2017-03-20T22:44:08Z

I squashed all commits and cleaned up the code a bit hoping the feature was ready to merge. However, HttpUriParser ports one major assumption from HttpUrl that doesn't hold when scanning for URLs. Namely, it assumes all of the input is one URL. This leads to the code treating everything after "http://" up to the next "/" as authority. Which fails to support URLs where the authority doesn't end in a slash, e.g. the first URL in the following string will not be detected/linkified: "http://uri1.example.org some text http://uri2.example.org/path"

While trying to fix this I noticed that HttpUriParser supports internationalized domain names (IDN) and read up on that topic. My takeaway was "this is complicated". I think as a first step we shouldn't attempt to linkify IRIs (Internationalized Resource Identifiers), but limit ourselves to traditional URIs.

Fixing the detection of valid http URLs surrounded by text is still something that needs to be done.

tbolender · 2017-03-21T11:05:15Z

Thanks for taking the time. I dropped the IDN detection and allowing only simple domain names now. I also changed the parsing to a more greedy approach, now successfully detecting your example.

cketti · 2017-03-21T16:52:14Z

k9mail/src/main/java/com/fsck/k9/message/html/HttpUriParser.java

@@ -18,6 +16,8 @@
 class HttpUriParser implements UriParser {
    // This string represent character group sub-delim as described in RFC 3986
    private static final String SUB_DELIM = "!$&'()*+,;=";
+    private static final Pattern DOMAIN_PATTERN =
+            Pattern.compile("\\w([\\w-]*\\w)*(\\.\\w([\\w-]*\\w)*)*(:(\\d{0,5}))?");


Unfortunately \w also includes the underscore which is not valid in host names.

You can use non-capturing groups by having ?: as first characters inside the parentheses. That'll make it easier to later get to the content you do want to capture.

Example: https://github.com/itiboi/k-9/blob/cf9c3d078e6e7296f16d8d8a29905047eeeec36b/k9mail/src/main/java/com/fsck/k9/message/html/UriLinkifier.java#L17

cketti · 2017-03-21T16:56:04Z

k9mail/src/main/java/com/fsck/k9/message/html/HttpUriParser.java

-        if (!tryMatchDomainName(text, currentPos, authorityEnd) &&
-                !tryMatchIpv4Address(text, currentPos, authorityEnd, true) &&
-                !tryMatchIpv6Address(text, currentPos, authorityEnd)) {
+        int matchedAuthorityEnd = Math.max(tryMatchDomainName(text, currentPos),


The whole Math.max() business makes this super hard to read. There's also no need to attempt another match if one of the methods was successful. So I suggest to extract all of this to a separate method and then to check the return value after each call to tryMatch*() and return early if a match was found.

cketti · 2017-03-21T16:57:33Z

k9mail/src/main/java/com/fsck/k9/message/html/HttpUriParser.java

        int userInfoEnd = text.indexOf('@', startPos);
-        if (userInfoEnd != -1 && userInfoEnd < authorityEnd) {


authorityEnd is still a useful upper bound that can be used to avoid useless work.

tbolender · 2017-03-21T21:52:30Z

Thanks for the hint with ?: and the underscore, totally forgot about that.

cketti · 2017-03-22T01:27:03Z

Awesome. Thanks a lot!

cketti requested changes Mar 15, 2017

View reviewed changes

Valodim mentioned this pull request Mar 16, 2017

Incorrect URL detection (final slash) #1223

Closed

tbolender changed the title ~~Generic uri parsing and fixing trailing trailing slash issue~~ Generic uri parsing and fixing trailing slash issue Mar 16, 2017

Valodim approved these changes Mar 16, 2017

View reviewed changes

tbolender and others added 2 commits March 16, 2017 22:10

Add one pass URI parser/linkifier

0d3d9aa

Clean up URI parsing code and tests

0f9bc48

cketti self-assigned this Mar 17, 2017

cketti added 2 commits March 17, 2017 15:19

Fix bug with advancing the position when linkifying failed

98974a7

Use regexp to skip schema matches not preceded by allowed separator

cf9c3d0

cketti force-pushed the generic-uri-parsing branch from 16f4187 to cf9c3d0 Compare March 20, 2017 22:20

Switched to "classic" domain name detection and added multiple tests.

9d3cc8e

cketti reviewed Mar 21, 2017

View reviewed changes

Fixed invalid domain character and some restructuring.

b068201

cketti approved these changes Mar 22, 2017

View reviewed changes

cketti merged commit 32212a4 into thunderbird:master Mar 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic uri parsing and fixing trailing slash issue #2392

Generic uri parsing and fixing trailing slash issue #2392

tbolender commented Mar 15, 2017

cketti left a comment

cketti Mar 15, 2017

cketti Mar 15, 2017

tbolender Mar 15, 2017

cketti Mar 15, 2017

cketti Mar 15, 2017

cketti Mar 15, 2017

cketti Mar 15, 2017

cketti Mar 15, 2017

cketti Mar 15, 2017

cketti Mar 15, 2017

cketti commented Mar 15, 2017

tbolender commented Mar 15, 2017

Valodim left a comment

cketti commented Mar 20, 2017

tbolender commented Mar 21, 2017

cketti Mar 21, 2017

cketti Mar 21, 2017

cketti Mar 21, 2017

tbolender commented Mar 21, 2017

cketti commented Mar 22, 2017

		int userInfoEnd = text.indexOf('@', startPos);
		if (userInfoEnd != -1 && userInfoEnd < authorityEnd) {

Generic uri parsing and fixing trailing slash issue #2392

Generic uri parsing and fixing trailing slash issue #2392

Conversation

tbolender commented Mar 15, 2017

cketti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cketti commented Mar 15, 2017

tbolender commented Mar 15, 2017

Valodim left a comment

Choose a reason for hiding this comment

cketti commented Mar 20, 2017

tbolender commented Mar 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tbolender commented Mar 21, 2017

cketti commented Mar 22, 2017