Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace linkRegex with xurls library #6261

Merged
merged 4 commits into from Mar 7, 2019

Conversation

5 participants
@mrsdizzie
Copy link
Contributor

mrsdizzie commented Mar 7, 2019

Rather than maintaining a complicated regex to match URLs for autolinking, gitea can use this existing go library that takes care of the matching with very little code change to gitea itself:

https://github.com/mvdan/xurls

After spending a while trying to find the perfect regex for all cases this library still works better as it is more flexible than a single regex ever will be.

This will also fix the following issues: #5844 #3095 #3381

This passes all current tests and I've added new ones based on URLs mentioned in those issues above.

Replace linkRegex with xurls library
Rather than maintaining a complicated regex to match URLs for
autolinking, gitea can use this existing go library that takes care of
the matching with very little code change to gitea itself. After
spending a while trying to find the perfect regex for all cases this library
still works better as it is more flexible than a single regex ever will be.

This will also fix the following issues: #5844 #3095 #3381

This passes all our current tests and I've added new ones mentioned in
those issues as well.
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Mar 7, 2019

Codecov Report

❗️ No coverage uploaded for pull request base (master@01bd1fc). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #6261   +/-   ##
=========================================
  Coverage          ?   38.81%           
=========================================
  Files             ?      355           
  Lines             ?    50253           
  Branches          ?        0           
=========================================
  Hits              ?    19504           
  Misses            ?    27920           
  Partials          ?     2829
Impacted Files Coverage Δ
modules/markup/html.go 88.09% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01bd1fc...805a970. Read the comment docs.

@GiteaBot GiteaBot added the lgtm/need 2 label Mar 7, 2019

@@ -645,7 +642,7 @@ func emailAddressProcessor(ctx *postProcessCtx, node *html.Node) {
// linkProcessor creates links for any HTTP or HTTPS URL not captured by
// markdown.
func linkProcessor(ctx *postProcessCtx, node *html.Node) {
m := linkRegex.FindStringIndex(node.Data)
m := xurls.Strict().FindStringIndex(node.Data)

This comment has been minimized.

@zeripath

zeripath Mar 7, 2019

Contributor

The result of xurls.Strict() should be cached - it compiles several regexps.

@techknowlogick techknowlogick added this to the 1.8.0 milestone Mar 7, 2019

@techknowlogick

This comment has been minimized.

Copy link
Member

techknowlogick commented Mar 7, 2019

Tagging this as a bugfix as it solves a bug, so that we can get it in the 1.8.0 release.

Use xurls.StrictMatchingScheme instead of xurls.Strict
This is much faster and we only care about https? links to preserve
existing behavior.

@GiteaBot GiteaBot added lgtm/need 1 and removed lgtm/need 2 labels Mar 7, 2019

@GiteaBot GiteaBot added lgtm/done and removed lgtm/need 1 labels Mar 7, 2019

@mrsdizzie

This comment has been minimized.

Copy link
Contributor Author

mrsdizzie commented Mar 7, 2019

Thanks much for the feedback! That is exactly right thanks for catching.

Also here is a tiny test program to compare how long this takes vs the current implementation:

https://gist.github.com/mrsdizzie/edfcbf36a5355d1db5f5d7218543a7a4

The results I got were:

Running each test on 10,000 random lines
Starting current linkRegex test:
52.756294ms
Starting modifiedLinkRegex test:
84.711437ms
Starting xurl test:
92.17964ms
Starting xurl with all schemes test:
1.871560519s

modifiedLinkRegex being a new regex that would try and match all of the URLs mentioned in the bugs above. So this is pretty on par with anything we could have added and doesn't introduce noticeable slowness in real world situations of large content

techknowlogick added some commits Mar 7, 2019

@techknowlogick techknowlogick merged commit f2de5dc into go-gitea:master Mar 7, 2019

2 checks passed

approvals/lgtm this commit looks good
continuous-integration/drone/pr the build was successful
Details

CL-Jeremy added a commit to CL-Jeremy/gitea that referenced this pull request Mar 12, 2019

Squashed commit of the following:
commit 5b33619
Author: Mike L <cl.jeremy@qq.com>
Date:   Tue Mar 12 20:17:26 2019 +0100

    Fix textarea also (to match body)

commit e76c56d
Author: Mike L <cl.jeremy@qq.com>
Date:   Tue Mar 12 19:41:39 2019 +0100

    Revert css temporarily to fix conflict

commit 6846c11
Author: Mike L <cl.jeremy@qq.com>
Date:   Tue Mar 12 19:17:38 2019 +0100

    Remove mistakenly introduced entry from .gitignore

commit 5ed4e51
Author: Mike L <cl.jeremy@qq.com>
Date:   Tue Mar 12 19:15:30 2019 +0100

    Tweak CJK, fix Yu Gothic, more monospace inherits

commit 26a460f
Merge: 125d1dc 7c20560
Author: Mike L <cl.jeremy@qq.com>
Date:   Mon Mar 11 13:51:33 2019 +0100

    Merge branch 'master' of https://github.com/go-gitea/gitea into issue-4173-alternative-fix

commit 125d1dc
Author: Mike L <cl.jeremy@qq.com>
Date:   Mon Mar 11 13:09:26 2019 +0100

    Add Lato for latin extd. & cyrillic, improve CJK

commit 7c20560
Author: GiteaBot <teabot@gitea.io>
Date:   Mon Mar 11 11:37:48 2019 +0000

    [skip ci] Updated translations via Crowdin

commit f9627ed
Author: MysticBoy <mysticboy@live.com>
Date:   Mon Mar 11 19:35:18 2019 +0800

    Update third-party-tools.en-us.md (go-gitea#6301)

    Add  Gitea Extension for Visual Studio

commit 4334fe7
Author: Lunny Xiao <xiaolunwen@gmail.com>
Date:   Mon Mar 11 11:44:58 2019 +0800

    update git vendor to fix wrong release commit id and add migrations (go-gitea#6224)

    * update git vendor to fix wrong release commit id and add migrations

    * fix count

    * fix migration release

    * fix tests

commit 2315019
Author: Jonas Franz <info@jonasfranz.software>
Date:   Mon Mar 11 03:54:59 2019 +0100

    Add support for client basic auth for exchanging access tokens (go-gitea#6293)

    * Add support for client basic auth for exchanging access tokens

    * Improve error messages

    * Fix tests

commit e0eb651
Author: GiteaBot <teabot@gitea.io>
Date:   Sun Mar 10 21:58:54 2019 +0000

    [skip ci] Updated translations via Crowdin

commit dbab98c
Author: zeripath <art27@cantab.net>
Date:   Sun Mar 10 21:56:36 2019 +0000

    Remove util.RemoveAll - should have been removed since go 1.7 (go-gitea#6299)

commit e836b88
Author: GiteaBot <teabot@gitea.io>
Date:   Sat Mar 9 21:18:31 2019 +0000

    [skip ci] Updated translations via Crowdin

commit f5cf9a8
Author: Aidan Fitzgerald <aidan-fitz@users.noreply.github.com>
Date:   Sat Mar 9 16:15:45 2019 -0500

    Copyedit docs (go-gitea#6275)

commit 8fffb06
Author: Jonas Franz <info@jonasfranz.software>
Date:   Sat Mar 9 17:29:58 2019 +0100

    Add regenerate secret feature for oauth2 (go-gitea#6291)

    * Add regenerate secret functionality

    * Fix lint

commit 8211e01
Author: John Olheiser <42128690+jolheiser@users.noreply.github.com>
Date:   Sat Mar 9 05:00:38 2019 -0600

    Add unit types to repo action URL to correctly show 404 when archived (go-gitea#6247)

    Signed-off-by: jolheiser <john.olheiser@gmail.com>

commit f7ffb19
Author: GiteaBot <teabot@gitea.io>
Date:   Fri Mar 8 20:28:33 2019 +0000

    [skip ci] Updated translations via Crowdin

commit 96f1720
Author: techknowlogick <matti@mdranta.net>
Date:   Fri Mar 8 15:25:47 2019 -0500

    Use golang 1.12 to build in dockerfile (go-gitea#6285)

commit 5c69e31
Author: John Olheiser <42128690+jolheiser@users.noreply.github.com>
Date:   Fri Mar 8 12:15:46 2019 -0600

    Add security note to issue template (go-gitea#6281)

commit 062de8e
Author: GiteaBot <teabot@gitea.io>
Date:   Fri Mar 8 17:43:26 2019 +0000

    [skip ci] Updated translations via Crowdin

commit bd4be43
Author: John Olheiser <42128690+jolheiser@users.noreply.github.com>
Date:   Fri Mar 8 11:42:59 2019 -0600

    Third party docs (go-gitea#6282)

commit 489419c
Author: GiteaBot <teabot@gitea.io>
Date:   Fri Mar 8 16:45:46 2019 +0000

    [skip ci] Updated translations via Crowdin

commit e777c6b
Author: Jonas Franz <info@jonasfranz.software>
Date:   Fri Mar 8 17:42:50 2019 +0100

    Integrate OAuth2 Provider (go-gitea#5378)

commit 9d3732d
Author: Antoine GIRARD <sapk@users.noreply.github.com>
Date:   Fri Mar 8 02:54:10 2019 +0100

    [Contrib] Checkout a PR (go-gitea#6021)

commit 9fd8b26
Author: techknowlogick <matti@mdranta.net>
Date:   Thu Mar 7 16:30:25 2019 -0500

    Add robots.txt as reserved username (go-gitea#6272)

    Fix go-gitea#6271

commit f2de5dc
Author: mrsdizzie <joe.mccann@gmail.com>
Date:   Thu Mar 7 15:12:01 2019 -0500

    Replace linkRegex with xurls library (go-gitea#6261)

    * Replace linkRegex with xurls library

    Rather than maintaining a complicated regex to match URLs for
    autolinking, gitea can use this existing go library that takes care of
    the matching with very little code change to gitea itself. After
    spending a while trying to find the perfect regex for all cases this library
    still works better as it is more flexible than a single regex ever will be.

    This will also fix the following issues: go-gitea#5844 go-gitea#3095 go-gitea#3381

    This passes all our current tests and I've added new ones mentioned in
    those issues as well.

    * Use xurls.StrictMatchingScheme instead of xurls.Strict

    This is much faster and we only care about https? links to preserve
    existing behavior.

commit 01bd1fc
Author: GiteaBot <teabot@gitea.io>
Date:   Thu Mar 7 19:16:42 2019 +0000

    [skip ci] Updated translations via Crowdin

commit 020075e
Author: mrsdizzie <joe.mccann@gmail.com>
Date:   Thu Mar 7 14:13:44 2019 -0500

    Remove visitLinksForShortLinks features (go-gitea#6257)

    The visitLinksForShortLinks feature would look inside of an <a> tag and
    run shortLinkProcessorFull on any text, which attempts to create links
    out of potential 'short links' like [[test]] [[link|example]] etc...
    This makes no sense because you can't have nested links within an <a>
    tag. Specifically, the html5 standard says <a> tags can't include
    interactive content if they contain the href attribute:

     http://w3c.github.io/html/single-page.html#the-a-element

    And also defines an <a> element with a href attribute as interactive:

     http://w3c.github.io/html/single-page.html#interactive-content

    Therefore you can't really put a link inside of another link. In
    practice none of this works anyways since browsers won't render it, it
    would probably be broken if they tried, and it is causing a bug
    (go-gitea#4946). No current tests rely on this behavior either.

    This removes the feature and also explicitly excludes the
    current visitNodeForShortLinks from looking in <a> tags.

commit ad86b84
Author: GiteaBot <teabot@gitea.io>
Date:   Wed Mar 6 00:50:36 2019 +0000

    [skip ci] Updated translations via Crowdin

commit 608781f
Author: John Olheiser <42128690+jolheiser@users.noreply.github.com>
Date:   Tue Mar 5 18:48:30 2019 -0600

    Fix fork button (go-gitea#6223)

commit 6460cff
Author: GiteaBot <teabot@gitea.io>
Date:   Tue Mar 5 20:18:01 2019 +0000

    [skip ci] Updated translations via Crowdin

commit f80caa5
Author: Zsombor <gzsombor@users.noreply.github.com>
Date:   Tue Mar 5 21:15:24 2019 +0100

    Fix go-gitea#6234 : Check organization visibility before everything else (go-gitea#6235)

    * Fix go-gitea#6234 : Check organization visibility before everything else

    * Ensure that Owner is available in the Repo

commit b257e04
Author: stevegt <stevegt@t7a.org>
Date:   Tue Mar 5 15:39:41 2019 +0100

    Add ability to sort issues by due date (go-gitea#6206) (go-gitea#6244)

    Signed-off-by: Steve Traugott <stevegt@t7a.org>

commit 4512634
Author: Muhammed TİFTİKÇİ <muhammedtiftikci@outlook.com>
Date:   Tue Mar 5 06:13:51 2019 +0300

    Make organization dropdown scrollable when using mouse wheel (go-gitea#5988)

    * Fix go-gitea#5580

commit f066bd2
Author: zeripath <art27@cantab.net>
Date:   Tue Mar 5 02:52:52 2019 +0000

    Prevent double-close of issues (go-gitea#6233)

commit 1986269
Author: Maurizio Porrato <maurizio.porrato@gmail.com>
Date:   Tue Mar 5 02:34:52 2019 +0000

    Override xorm type mapping for U2F counter (go-gitea#6232)

commit 141c58f
Author: Lanre Adelowo <adelowomailbox@gmail.com>
Date:   Sun Mar 3 23:57:24 2019 +0100

    add isAdmin to user model (go-gitea#6231)

    update vendor and add tests

    fix swagger
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.