Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid Relaxed from matching trailing TLDs without a word break #49

Closed
daviddyang9527 opened this issue May 19, 2021 · 2 comments
Closed

Comments

@daviddyang9527
Copy link

When leveraging rxRelaxed.FindAllString to find all urls in SMS text, we found out that when sentences were connected without a blank after a dot it might mislead the regex and be detected as a legitimate url.
For example: a SMS text like

ups has informed us that your equipment has been delivered.The ups tracking number is 1zraxxxxxxx to track the status of your delivery click here hxxps://abc.com/yz3e4a2p|70994

Reproduction steps

        var test string = "ups has informed us that your equipment has been delivered.The ups tracking number is 1zraxxxxxxx to 
        track the status of your delivery click here hxxps://abc.com/yz3e4a2p|70994"
	rxRelaxed := xurls.Relaxed()
	fmt.Println(rxRelaxed.FindAllString(test, -1))  // [delivered.th https://abc.com/yz3e4a2p]

       	rxStrict := xurls.Strict()
	fmt.Println(rxStrict.FindAllString(test, -1)) // [https://abc.com/yz3e4a2p]

What did you expect to happen?
Relax method works as strict method, only one result comes out

What actually happened?
[delivered.th https://abc.com/yz3e4a2p] for relax method
[hxxps://abc.com/yz3e4a2p] for strict method

Environment

  • xurls: v2.2.0
  • golang: 1.16.4
  • os: macos
@mvdan
Copy link
Owner

mvdan commented May 19, 2021

Indeed, this is a known problem, and I attempted to fix it a few months ago but didn't actually finish it. I'll take another look.

@mvdan mvdan changed the title False positive on text url parsing when sentences are connected without leading space. avoid Relaxed from matching trailing TLDs without a word break May 19, 2021
@mvdan mvdan closed this as completed in cee2e13 May 19, 2021
@mvdan
Copy link
Owner

mvdan commented May 19, 2021

Should be fixed in master now.

gopherbot pushed a commit to golang/tools that referenced this issue Mar 4, 2022
Update to v2.4.0, which brings optimizations and minor improvements.

Upstream fixed the issue with word breaks in v2.3.0,
released back in July 2021. For more, see:
mvdan/xurls#49

In short, ASCII TLDs now require a word break in Relaxed again.
Note that non-ASCII TLDs don't require one, as that just breaks them.
So, in that sense, this is making non-ASCII TLDs work in gopls again.

Plus, we avoid wasting CPU cycles at init time by rebuilding the regexp.

Change-Id: Iddb34827b61594fa1f911143f1d7e2c67a6658c9
Reviewed-on: https://go-review.googlesource.com/c/tools/+/389814
Trust: Daniel Martí <mvdan@mvdan.cc>
Run-TryBot: Daniel Martí <mvdan@mvdan.cc>
Reviewed-by: Robert Findley <rfindley@google.com>
gopls-CI: kokoro <noreply+kokoro@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants