A web link is not processed if it contains Cyrillic #8634

andrecool-68 · 2020-07-29T12:55:32Z

A web link is not processed if it contains Cyrillic.

Notepad++ v7.8.6 (32-bit)
Build time : Apr 21 2020 - 15:17:06
Path : I:\Tools_Servis\TextCode\NPP++\npp.7.8.6.bin\notepad++.exe
Admin mode : ON
Local Conf mode : ON
OS Name : Windows 7 Home Premium (64-bit)
OS Build : 7601.0
Plugins : ComparePlugin.dll DSpellCheck.dll Explorer.dll HexEditor.dll HTMLTag_unicode.dll ImgTag.dll JSMinNPP.dll LocationNavigate.dll mimeTools.dll MultiClipboard.dll NativeLang.dll NppConverter.dll NppExport.dll NppMarkdownPanel.dll NppSnippets.dll PreviewHTML.dll ShtirlitzNppPlugin.dll Tidy2.dll VisualStudioLineCopy.dll WebEdit.dll WindowManager.dll XMLTools.dll _CustomizeToolbar.dll

sasumner · 2020-07-29T13:26:35Z

URL processing has been undergoing change in recent versions of N++.
Can you retest in 7.8.9 and see if it is the same as you described above in that version?

andrecool-68 · 2020-07-29T13:33:46Z

@sasumner
The latest version has exactly the same behavior

I give an example of how to handle web links AkelPad editor

Yaron10 · 2020-07-29T13:44:06Z

NPP uses the following RegEx which needs to be changed:
URL_REG_EXPR "[A-Za-z]+://[A-Za-z0-9_\\-\\+~.:?&@=/%#,;\\{\\}\\(\\)\\[\\]\\|\\*\\!\\\\]+"

BTW, a . at the end is interpreted as part of the URL which, obviously, is wrong.

This is a task for @guy038 the RegEx master. :)

sasumner · 2020-07-29T14:16:22Z

So, not any kind of new problem relating to the change to URL displaying, but rather a long-standing thing...that there are probably other issues already open for.

andrecool-68 · 2020-07-29T15:00:32Z

@sasumner I noticed this error only today

Yaron10 · 2020-07-29T16:21:31Z

So, not any kind of new problem relating to the change to URL displaying, but rather a long-standing thing

Not related to the new hotSpot.

that there are probably other issues already open for.

Possibly. :)

Ekopalypse · 2020-07-29T21:17:00Z

I would vote for making the URL regex changeable via settings dialog.
No matter what regex change one would come up with now, sooner or later someone will come along who wants it different.
Having it changeable will solve it forever. Maybe with a warning in the settings, like if you do stupid things expect it to work stupidly to avoid getting issues like: once I change the url regex to .* npp is slow.

Yaron10 · 2020-07-29T21:45:17Z

No matter what regex change one would come up with now, sooner or later someone will come along who wants it different.

I don't know enough about the issue.
If that is the case (and I assume it is if you say so), your suggestion is very good.

I do know that Visual Studio, Gmail and other editors interpret URLs correctly even if they contain non-Latin characters and/or certain characters (e.g. .) at the end.

Yaron10 · 2020-07-30T19:17:47Z

Replacing
#define URL_REG_EXPR "[A-Za-z]+://[A-Za-z0-9_\\-\\+~.:?&@=/%#,;\\{\\}\\(\\)\\[\\]\\|\\*\\!\\\\]+"
with
#define URL_REG_EXPR "(http|ftp|https)://([\\w-]+(?:(?:\\.[\\w-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
solves the following cases:

https://github.com/notepad-plus-plus/notepad-plus-plus.
(https://github.com/notepad-plus-plus/notepad-plus-plus)
https://www.rnids.rs/национални-домени/регистрација-националних-домена
https://www.morfix.co.il/שלום

Feel free to test, improve and submit a PR.

RegEx source: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string

Ekopalypse · 2020-07-30T19:30:42Z

@Yaron10 - this would restrict urls to start with http, ftp and https only, but IANA define a lot more. And maybe in two weeks, month, ... even more?

Yaron10 · 2020-07-30T19:34:46Z

@Ekopalypse,

As I wrote: I don't know enough about the issue.
Feel free to improve. :)

sasumner · 2020-07-30T19:47:19Z

but IANA define a lot more

Which may be why we currently have:

#define URL_REG_EXPR "[A-Za-z]+://...

Of course in the IANA link I also see some numbers, -, and probably some other non (?i)a-z

Ekopalypse · 2020-07-30T19:51:27Z

@Yaron10 :-) and we've come full circle - I vote for ... :-)

Yaron10 · 2020-07-30T20:17:09Z

@Ekopalypse,

Well, I vote for it too. :)
Still, I suppose the current RegEx can be modified to solve the . and ) issues.

Uhf7 · 2020-08-01T07:54:03Z

Only a opinion thrown in here, because I observe URL problems for a while know. There are two parts of the problem, completely independent of each other:

Detection of URLs,
Displaying and highlighting detected URLs.

Until now, I did only get involved in the second part of the problem. I did never care about the detection.

But, whenever I read about problems with that, the question occurs to me: Is a Regex expression really the right way to detect URLs? From inside C++ program code? Since you didn't manage to solve this for years, I would say: no. Is it so difficult to write a simple parser for it? I wonder, why no one ever seemed to have the idea to do this. And before you make this Regex configurable, you should consider, that there may be other solutions than a Regex expression for it. See #3353, #5029 and #7791 also. Edit: and #3912.

Uhf7 · 2020-08-01T11:22:24Z

@Yaron10
I tried your new regex expression, but there must be more to it. If I simply replace URL_REG_EXPR, no link is detected anymore. So I assume, you have modified a setting like "enable super special regex expressions", or you use an own regex library, or a plug-in or something? Can you check this?

Yaron10 · 2020-08-01T11:58:21Z

@Uhf7,

You're using your modified SciLexer.dll with which links are not detected.
Using the original SciLexer.dll, links are interpreted as expected.
See #8588 (comment).

This may explain the significant size difference between the original DLL and yours.

Can you build your modified SciLexer.dll with the same settings as the original (and upload it)?
Alternatively, in "enable super special regex expressions" do you mean a setting in building SciLexer.dll?

Is it so difficult to write a simple parser for it?

Searching for a better RegEx the other day, I've come across this idea.
We're looking forward to your next PR.

Uhf7 · 2020-08-01T12:54:24Z

@Yaron10,
you are right, with the original SciLexer.dll from 7.8.9 your regex expression works.

With the Appveyor build, it doesn't. Here, only the original regex expression works. So there is an unnecessary difference between the Appveyor build and the original build, which should be fixed someday.

Thank you, now I can run your regular expression against my parser.

Edit: My own build of SciLexer.dll too doesn't accept the "new" regex syntax, there must be a trick.

Yaron10 · 2020-08-01T13:19:38Z

My own build of SciLexer.dll too doesn't accept the "new" regex syntax, there must be a trick.

Please share if you figure it out.

Thank you.

Ekopalypse · 2020-08-01T13:32:25Z

Ok, did some reasearch and found this community discussion from 2017 which links to this github issue.

And a new regex (?-s)[A-Za-z][A-Za-z0-9+.-]+://[^\s]+?(?=\s|\z) has been proposed.

But then it was closed with comment: requirement could not be fulfilled.
But from my limited regex knowledge I assume it does fulfill the requirements.

I did a test and it looks like it finds what needs to be found.

sasumner · 2020-08-01T14:06:58Z

I'm not one to play regex golf, but the regex @Ekopalypse mentions just above seems a bit, well, confuscated. This would seem to cover it, and be simpler (although I did no testing):

(?i-s)[a-z][a-z0-9+.-]+://\S+?(?=\s|\z)

Normally, it doesn't matter, but if changing something this "important", it should be done so that someone in the future doesn't have to puzzle about why it was made "overcomplicated".

Note also that neither of these 2 regex handle the case of a URL wrapped in parentheses/brackets/etc. correctly; examples:

(http://www.google.com) <--- it will include the ) in the match; obviously won't include the (
[http://www.google.com] <--- it will include the ] in the match; obviously won't include the [
etc., such as { and }

People (including me!) like to put URLs in parentheses; currently I have to remember to do it this way:

( http://www.google.com ) <--- note space before h and after m

Perhaps ) and friends are valid in a URL, and thus nothing can be done about this situation.

Another point: Is there no Unicode possibility before the :// in URLs?

Yaron10 · 2020-08-01T14:16:43Z

The ) and . at the end are important to me too.

sasumner · 2020-08-01T14:19:22Z

The ) and . at the end are important to me too.

That means you would NOT want them included in the URL, correct?

Yaron10 · 2020-08-01T14:25:53Z

That means you would NOT want them included in the URL, correct?

Correct.

Perhaps ) and friends are valid in a URL, and thus nothing can be done about this situation.

I hope that is NOT correct. :)

Ekopalypse · 2020-08-01T14:29:23Z

@sasumner

Is there no Unicode possibility before the :// in URLs?

No, according to https://tools.ietf.org/html/rfc3986#section-3.1, to quote

3.1. Scheme
...
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
...

and according to https://tools.ietf.org/html/rfc2234#section-6.1

...
ALPHA = %x41-5A / %x61-7A ; A-Z / a-z

...
DIGIT = %x30-39
; 0-9
...

Uhf7 · 2020-08-01T15:59:44Z

@Yaron10, I can now build a SciLexer.dll which accepts your regular expression. And, absolute surprise, it doesn't use the Scintilla-build-in regex functions, but the regex functions from the boost library. This seems to be, because you regex acrobats can never be satisfied with standard regular expressions.

The mechanism is, that Scintilla provides a class for regex operations, which can be left as it is or be overridden with own regex functions. The code to do this is contained in \scintilla\boostregex\BoostRegExSearch.cxx.

To add the boost regex stuff to the project, you have to have the boost library installed on your computer. I downloaded a precompiled version from https://sourceforge.net/projects/boost/. Then, you have to add the source files from the scintilla\boostregex\ directory to your project, drag and drop worked for me. Then follow the error messages of the compiler, until it can build everything. You need:

to add an additional include file directory, where the boost\regex.hpp can be found.
to add an additional library file directory, where the .lib file can be found, which fits to your compiler and to your architecture/debug/release settings. The name of the library needed can be read from the compiler error log.

After you can build it, you have to turn it on by defining the preprocessor symbol SCI_OWNREGEX, and that's it. If you hate deprecation warnings, you can define _SILENCE_ALL_CXX17_DEPRECATION_WARNINGS.

By the way, I'm happy that the standard regular expression used for URL's here comes out with the regex syntax naturally provided by the original SciLexer.dll.

Yaron10 · 2020-08-01T16:22:03Z

@Uhf7,

Thank you for the detailed guide.
Too much work for an amateur like myself. I'll wait for the next NPP release which should include the DLL with boost.

Uhf7 · 2020-08-01T16:41:49Z

@Yaron10 look at this ones (They don't get through your regular expression):

Seen in highlighting and executing http link not ignoring trailing (closing) bracket #5029: . . . URL (e.g., https://en.wikipedia.org/wiki/Saw_(2003_film))?
Seen in Missing the last part in the clickable links #3912: . . . https://duckduckgo.com/?q=windows+delete+"GameBarPresenceWriter.exe"

Yaron10 · 2020-08-01T16:50:40Z

Yes, we're looking forward to your parser. :)

BTW, the second link is not interpreted correctly here (in your post) either.

Ekopalypse · 2020-08-01T19:57:39Z

I did a test to see if there is a performance impact and it seems that this

[[:alpha:]][[:alnum:]+.-]+://\S+

is ~3% faster than the one currently used and has the advantage to find unicode letters as well.
Finding and marking 10.000 urls took, with pythonscript on my machine, 0.45 seconds.
I assume pure C++ can do this much faster and 10.000 urls in a document should be unusual.

Yaron10 · 2020-08-09T12:04:35Z

@Uhf7,
Thank you for fixing the Copy/Paste issue. 👍

Uhf7 · 2020-08-12T16:34:53Z

@Yaron10, @Ekopalypse I revised my parser version to exclude your nice example from being detected completely as URL.

What I changed: I allow quotes only in the query part of the URL. And the query part is scrutinized a little closer and has to be at least similar to the "official" query format (which seems not to be complied with in every case).

If you still want to test, here is my current source file again.
Notepad_plus.zip

Yaron10 · 2020-08-12T18:47:53Z

@Uhf7,

👍
Great! Thank you for your work.

Interestingly, Firefox opens https://*/* in a new window instead of a new tab.
Not expecting an explanation. Just sharing.

Yaron10 · 2020-08-12T22:08:20Z

@Uhf7,

Minor: is MAILTO: a legit INTERNET_SCHEME_MAILTO?

Uhf7 · 2020-08-13T08:32:57Z

Yes, mailto: is a standard URL scheme. As is file: too, for example. But I think it is good to restrict the schemes to a few of the standard schemes, which the installed internet browser and the installed e-mail software can handle.

Yaron10 · 2020-08-13T10:11:12Z

@Uhf7,

Just to make sure there's no misunderstanding here: I meant MAILTO: without a following address.
For example: MAILTO: in INTERNET_SCHEME_MAILTO: (in your code) is underlined.

Thank you.

Uhf7 · 2020-08-13T10:32:28Z

@Yaron10,

Now I see it. Thank you again. I'm going to exclude this too.

Yaron10 · 2020-08-13T10:40:00Z

👍
Thank you.

Uhf7 · 2020-08-13T14:22:54Z

I did the following modifications:

Exclude scheme: alone without parameter string
Forbid underline and digits before URLs. No abc_http://xxx anymore, no 4560http://xxx neither. Although this was allowed in the past and goes through MS VS too, it doesn't feel right. And it is not accepted by Outlook, so at least one software does it like I feel. And I don't know any programming language, where _ or 0..9 are used as delimiters.
If something else should speak against forbidding underline and digits before URLs, I revert this modification.

Notepad_plus.zip

Yaron10 · 2020-08-13T16:30:20Z

@Uhf7,

Thanks for that too. 👍

Uhf7 · 2020-08-15T15:45:12Z

Has anyone new strange text passages for me, which should or should not be detected as URL?

Yaron10 · 2020-08-15T15:51:47Z

I'm done. :)

It took some time but the result is great.

Uhf7 · 2020-08-15T16:04:13Z

@Yaron10,

thank you for testing, I will PR the new detection method for URLs soon.

Yaron10 · 2020-08-23T12:32:06Z

@Uhf7,

A minor issue (not related to the parser, but I'd rather bring it up here).

Comment a link in a C++ file.
-- The text is green and the underline is black.

Paste a link in a C++ file.
-- The part after the // is interpreted as a comment.

Paste https://www.youtube.com/watch?v=VmcftreqQ6E&list=PnQIRE5O5JpiLL&index=1 in an XML file.

On hover:

You're not supposed to have un-commented links in such files. But still... A mini-minor issue. :)

Thank you.

Uhf7 · 2020-08-23T14:39:34Z

Yaron10,

this is, because standard indicators are supposed to complement syntax highlighting, not to replace it. So I find it correct, that it becomes multi-colorized while not hovered.

That only a part of the link is displayed blue while hovered, seems to be another Scintilla problem. I will search it as soon as I can. It was there before my modifications, in the INDIC_TEXTFORE style (The explorer link style is a combination of INDIC_TEXTFORE and INIDIC_PLAIN).

But, damage control: On double click, the whole URL is used, and the full box style is not affected by this. And if the URL is syntactically correct in the document, it should be single color anyway.

I bet, that Notepad++ is really the first application, which is using the standard indicators to highlight URLs, although exactly this is recommended in ScintillaDoc.html (search "URL " in it).

Yaron10 · 2020-08-23T15:51:37Z

this is, because standard indicators are supposed to complement syntax highlighting, not to replace it. So I find it correct, that it becomes multi-colorized while not hovered.

👍

That only a part of the link is displayed blue while hovered, seems to be another Scintilla problem. I will search it as soon as I can.

Thank you. But I'm not sure it should be worth it (mini-minor issue).

You didn't refer to the first case which is slightly more significant: Comment a link in a C++ file.
Should I assume it's not worth the extra work/code?

Uhf7 · 2020-08-23T18:23:39Z

You didn't refer to the first case which ...

I did mean exactly the C++ example too in the 1st part of my answer: That's syntax highlighting, and the part of the line after // is a comment in C++. There are no URLs in the C++ syntax. And the https: before the comment will make its way into the error list while compiling anyway.

And I see, that the MS VS does it not this way, it prioritizes the URL, but this seems not to be worth of imitation to me.

Yaron10 · 2020-08-23T19:16:29Z

I'm afraid we have a misunderstanding here.
My question is if the underline shouldn't be green too.

If you did understand my question correctly - I apologize.

Uhf7 · 2020-08-23T21:41:29Z

My question is if the underline shouldn't be green too.

No need to apologize. It was so clear to me, that the underline should not be green, that it didn't occur to me to mention it :-).

The underline is always in the default text color. Technically, it is not possible to make the underline color follow the syntax highlighting color.

When not hovered, we have two options:

Hidden: The URL is shown in the current syntax highlighting.
Underlined: The URL is shown in the current syntax highlighting, and underlined in default text color.

And: If the underline color would follow the current syntax highlighting color, it wouldn't unite/embrace the link anymore, it would fall apart optically even more in situations shown in your XML example.

Yaron10 · 2020-08-23T22:33:13Z

When not hovered, we have two options:

But then you came up with INDIC_EXPLORERLINK for hover. :)
(So theoretically it can be implemented for "normal-state" as well).

And: If the underline color would follow the current syntax highlighting color, it wouldn't unite/embrace the link anymore, it would fall apart optically even more in situations shown in your XML example.

This is a very good point.
But the two issues are intertwined: if (theoretically) links were excluded from syntax highlighting, the one text color could be applied to the underline too.

Just theoretically.

Thanks again for your work. I appreciate it.

Yaron10 · 2020-08-24T23:45:42Z

@Uhf7,

I've just encountered a slightly more serious issue with INDIC_EXPLORERLINK.

Word-Wrap: ON.
Paste a link in a normal text file.
Make sure the link is wrapped.
Hover.

Should I open a new issue or you'd rather continue the discussion here?

Thank you.

Uhf7 · 2020-08-25T07:37:46Z

@Yaron10,

thank you for testing thoroughly, another interesting effect.

But this is basically the same Scintilla problem as with the hovered, syntax-highlighted links, which become blue only partially. The detection whether the character belongs to a hovered indicator or not doesn't work.

I have a way to solve this filed as https://sourceforge.net/p/scintilla/bugs/2199/. Actually, I wanted to wait for an answer to this, before I create the PR to integrate it in the Notepad++ SciLexer.dll. But I will create it without answer, after an adequate waiting time.

Edit: The 100th comment to this issue, we shouldn't loose the focus here, the whole thing was actually about the cyrillic characters in a link, I have to PR my parser suggestion someday.

Yaron10 · 2020-08-25T17:48:38Z

@Uhf7,

Thank you for the quick fix. 👍

Hopefully, last post here.

Fix inaccurate URL detection and enhance URL detection for non-English character. Fix notepad-plus-plus#3912, fix notepad-plus-plus#3353, fix notepad-plus-plus#4643, fix notepad-plus-plus#5029, fix notepad-plus-plus#6155, fix notepad-plus-plus#7791, fix notepad-plus-plus#8634, close notepad-plus-plus#8921

Uhf7 mentioned this issue Aug 26, 2020

Scintilla fix for hovered indicators #8769

Closed

ArkadiuszMichalski mentioned this issue Sep 16, 2020

hyperlinks don't recognise international characters in URLs #6155

Closed

Uhf7 mentioned this issue Sep 29, 2020

URL detection with parser. #8921

Closed

donho closed this as completed in 401bfcb Oct 4, 2020

Uhf7 mentioned this issue Oct 8, 2020

[Discussion] The new URL parser #8975

Closed

A web link is not processed if it contains Cyrillic #8634

A web link is not processed if it contains Cyrillic #8634

Comments

andrecool-68 commented Jul 29, 2020

sasumner commented Jul 29, 2020

andrecool-68 commented Jul 29, 2020 • edited Loading

Yaron10 commented Jul 29, 2020

sasumner commented Jul 29, 2020 • edited Loading

andrecool-68 commented Jul 29, 2020

Yaron10 commented Jul 29, 2020

Ekopalypse commented Jul 29, 2020

Yaron10 commented Jul 29, 2020

Yaron10 commented Jul 30, 2020

Ekopalypse commented Jul 30, 2020

Yaron10 commented Jul 30, 2020

sasumner commented Jul 30, 2020

Ekopalypse commented Jul 30, 2020

Yaron10 commented Jul 30, 2020

Uhf7 commented Aug 1, 2020 • edited Loading

Uhf7 commented Aug 1, 2020 • edited Loading

Yaron10 commented Aug 1, 2020

Uhf7 commented Aug 1, 2020 • edited Loading

Yaron10 commented Aug 1, 2020

Ekopalypse commented Aug 1, 2020

sasumner commented Aug 1, 2020

Yaron10 commented Aug 1, 2020

sasumner commented Aug 1, 2020

Yaron10 commented Aug 1, 2020

Ekopalypse commented Aug 1, 2020 • edited Loading

Uhf7 commented Aug 1, 2020 • edited Loading

Yaron10 commented Aug 1, 2020

Uhf7 commented Aug 1, 2020

Yaron10 commented Aug 1, 2020

Ekopalypse commented Aug 1, 2020

Yaron10 commented Aug 9, 2020

Uhf7 commented Aug 12, 2020

Yaron10 commented Aug 12, 2020

Yaron10 commented Aug 12, 2020

Uhf7 commented Aug 13, 2020

Yaron10 commented Aug 13, 2020

Uhf7 commented Aug 13, 2020

Yaron10 commented Aug 13, 2020

Uhf7 commented Aug 13, 2020

Yaron10 commented Aug 13, 2020

Uhf7 commented Aug 15, 2020

Yaron10 commented Aug 15, 2020

Uhf7 commented Aug 15, 2020

Yaron10 commented Aug 23, 2020

Uhf7 commented Aug 23, 2020

Yaron10 commented Aug 23, 2020

Uhf7 commented Aug 23, 2020

Yaron10 commented Aug 23, 2020

Uhf7 commented Aug 23, 2020

Yaron10 commented Aug 23, 2020

Yaron10 commented Aug 24, 2020

Uhf7 commented Aug 25, 2020 • edited Loading

Yaron10 commented Aug 25, 2020

andrecool-68 commented Jul 29, 2020 •

edited

Loading

sasumner commented Jul 29, 2020 •

edited

Loading

Uhf7 commented Aug 1, 2020 •

edited

Loading

Uhf7 commented Aug 1, 2020 •

edited

Loading

Uhf7 commented Aug 1, 2020 •

edited

Loading

Ekopalypse commented Aug 1, 2020 •

edited

Loading

Uhf7 commented Aug 1, 2020 •

edited

Loading

Uhf7 commented Aug 25, 2020 •

edited

Loading