Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid matching with Cyrillic TLDs #32

Closed
bynov opened this issue Oct 22, 2019 · 3 comments
Closed

Invalid matching with Cyrillic TLDs #32

bynov opened this issue Oct 22, 2019 · 3 comments

Comments

@bynov
Copy link

bynov commented Oct 22, 2019

Hi!
It seems like there is a problem with Cyrillic TLDs. Here an example:

echo "test.xyz" | xurls -r
test.xyz
echo "test.xyz test" | xurls -r
test.xyz
echo "test.бел" | xurls -r
test.бел
echo "test.бел test" | xurls -r 
<empty response>

If there are any symbols, even whitespace after cyrillic domain - it's not match anymore.
I tried to solve that issue and found that it can be something in string but I don't sure

webURL := hostName + port + `(/|/` + pathCont + `?|\b|(?m)$)`

In \b part. I tried to use |\b|\B but some tests failed.

Thanks!

@mvdan
Copy link
Owner

mvdan commented Oct 22, 2019

Thanks for reporting this bug. It seems to happen with all non-ASCII TLDs - I was able to reproduce it with chinese and korean. We never had a test specifically for this case where a TLD is directly followed with a space, so I think this never worked.

@bynov
Copy link
Author

bynov commented Oct 22, 2019

Thanks for a reply, @mvdan
So, could we fix it somehow? Maybe you have some ideas regarding this case.
I want to help with that but I’m weak in regexps, especially that heavy regexps :)
Any ideas would be very helpful

@mvdan
Copy link
Owner

mvdan commented Oct 22, 2019

The issue is that \b (the word boundary) isn't unicode-aware, so it thinks chinese and other alphabets are non-words. I don't think there is a good fix here. The \b was used so that foo.comgarbage didn't match foo.com; we might have to lose that feature, or restrict it to ASCII TLDs only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants