Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A web link is not processed if it contains Cyrillic #8634

Closed
andrecool-68 opened this issue Jul 29, 2020 · 101 comments
Closed

A web link is not processed if it contains Cyrillic #8634

andrecool-68 opened this issue Jul 29, 2020 · 101 comments

Comments

@andrecool-68
Copy link

A web link is not processed if it contains Cyrillic.

Notepad++ v7.8.6 (32-bit)
Build time : Apr 21 2020 - 15:17:06
Path : I:\Tools_Servis\TextCode\NPP++\npp.7.8.6.bin\notepad++.exe
Admin mode : ON
Local Conf mode : ON
OS Name : Windows 7 Home Premium (64-bit)
OS Build : 7601.0
Plugins : ComparePlugin.dll DSpellCheck.dll Explorer.dll HexEditor.dll HTMLTag_unicode.dll ImgTag.dll JSMinNPP.dll LocationNavigate.dll mimeTools.dll MultiClipboard.dll NativeLang.dll NppConverter.dll NppExport.dll NppMarkdownPanel.dll NppSnippets.dll PreviewHTML.dll ShtirlitzNppPlugin.dll Tidy2.dll VisualStudioLineCopy.dll WebEdit.dll WindowManager.dll XMLTools.dll _CustomizeToolbar.dll

1

@sasumner
Copy link
Contributor

URL processing has been undergoing change in recent versions of N++.
Can you retest in 7.8.9 and see if it is the same as you described above in that version?

@andrecool-68
Copy link
Author

andrecool-68 commented Jul 29, 2020

@sasumner
The latest version has exactly the same behavior

Безымянный

I give an example of how to handle web links AkelPad editor

1

@Yaron10
Copy link

Yaron10 commented Jul 29, 2020

NPP uses the following RegEx which needs to be changed:
URL_REG_EXPR "[A-Za-z]+://[A-Za-z0-9_\\-\\+~.:?&@=/%#,;\\{\\}\\(\\)\\[\\]\\|\\*\\!\\\\]+"

BTW, a . at the end is interpreted as part of the URL which, obviously, is wrong.

This is a task for @guy038 the RegEx master. :)

@sasumner
Copy link
Contributor

sasumner commented Jul 29, 2020

So, not any kind of new problem relating to the change to URL displaying, but rather a long-standing thing...that there are probably other issues already open for.

@andrecool-68
Copy link
Author

@sasumner I noticed this error only today

@Yaron10
Copy link

Yaron10 commented Jul 29, 2020

So, not any kind of new problem relating to the change to URL displaying, but rather a long-standing thing

Not related to the new hotSpot.

that there are probably other issues already open for.

Possibly. :)

@Ekopalypse
Copy link
Contributor

I would vote for making the URL regex changeable via settings dialog.
No matter what regex change one would come up with now, sooner or later someone will come along who wants it different.
Having it changeable will solve it forever. Maybe with a warning in the settings, like if you do stupid things expect it to work stupidly to avoid getting issues like: once I change the url regex to .* npp is slow.

@Yaron10
Copy link

Yaron10 commented Jul 29, 2020

No matter what regex change one would come up with now, sooner or later someone will come along who wants it different.

I don't know enough about the issue.
If that is the case (and I assume it is if you say so), your suggestion is very good.

I do know that Visual Studio, Gmail and other editors interpret URLs correctly even if they contain non-Latin characters and/or certain characters (e.g. .) at the end.

@Yaron10
Copy link

Yaron10 commented Jul 30, 2020

Replacing
#define URL_REG_EXPR "[A-Za-z]+://[A-Za-z0-9_\\-\\+~.:?&@=/%#,;\\{\\}\\(\\)\\[\\]\\|\\*\\!\\\\]+"
with
#define URL_REG_EXPR "(http|ftp|https)://([\\w-]+(?:(?:\\.[\\w-]+)+))([\\w.,@?^=%&:/~+#-]*[\\w@?^=%&/~+#-])?"
solves the following cases:

https://github.com/notepad-plus-plus/notepad-plus-plus.
(https://github.com/notepad-plus-plus/notepad-plus-plus)
https://www.rnids.rs/национални-домени/регистрација-националних-домена
https://www.morfix.co.il/שלום

Feel free to test, improve and submit a PR.


RegEx source: https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string

@Ekopalypse
Copy link
Contributor

@Yaron10 - this would restrict urls to start with http, ftp and https only, but IANA define a lot more. And maybe in two weeks, month, ... even more?

@Yaron10
Copy link

Yaron10 commented Jul 30, 2020

@Ekopalypse,

As I wrote: I don't know enough about the issue.
Feel free to improve. :)

@sasumner
Copy link
Contributor

but IANA define a lot more

Which may be why we currently have:

#define URL_REG_EXPR "[A-Za-z]+://...

Of course in the IANA link I also see some numbers, -, and probably some other non (?i)a-z

@Ekopalypse
Copy link
Contributor

@Yaron10 :-) and we've come full circle - I vote for ... :-)

@Yaron10
Copy link

Yaron10 commented Jul 30, 2020

@Ekopalypse,

Well, I vote for it too. :)
Still, I suppose the current RegEx can be modified to solve the . and ) issues.

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 1, 2020

Only a opinion thrown in here, because I observe URL problems for a while know. There are two parts of the problem, completely independent of each other:

  • Detection of URLs,
  • Displaying and highlighting detected URLs.

Until now, I did only get involved in the second part of the problem. I did never care about the detection.

But, whenever I read about problems with that, the question occurs to me: Is a Regex expression really the right way to detect URLs? From inside C++ program code? Since you didn't manage to solve this for years, I would say: no. Is it so difficult to write a simple parser for it? I wonder, why no one ever seemed to have the idea to do this. And before you make this Regex configurable, you should consider, that there may be other solutions than a Regex expression for it. See #3353, #5029 and #7791 also. Edit: and #3912.

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 1, 2020

@Yaron10
I tried your new regex expression, but there must be more to it. If I simply replace URL_REG_EXPR, no link is detected anymore. So I assume, you have modified a setting like "enable super special regex expressions", or you use an own regex library, or a plug-in or something? Can you check this?

@Yaron10
Copy link

Yaron10 commented Aug 1, 2020

@Uhf7,

You're using your modified SciLexer.dll with which links are not detected.
Using the original SciLexer.dll, links are interpreted as expected.
See #8588 (comment).

This may explain the significant size difference between the original DLL and yours.

Can you build your modified SciLexer.dll with the same settings as the original (and upload it)?
Alternatively, in "enable super special regex expressions" do you mean a setting in building SciLexer.dll?


Is it so difficult to write a simple parser for it?

Searching for a better RegEx the other day, I've come across this idea.
We're looking forward to your next PR.

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 1, 2020

@Yaron10,
you are right, with the original SciLexer.dll from 7.8.9 your regex expression works.

With the Appveyor build, it doesn't. Here, only the original regex expression works. So there is an unnecessary difference between the Appveyor build and the original build, which should be fixed someday.

Thank you, now I can run your regular expression against my parser.

Edit: My own build of SciLexer.dll too doesn't accept the "new" regex syntax, there must be a trick.

@Yaron10
Copy link

Yaron10 commented Aug 1, 2020

My own build of SciLexer.dll too doesn't accept the "new" regex syntax, there must be a trick.

Please share if you figure it out.

Thank you.

@Ekopalypse
Copy link
Contributor

Ok, did some reasearch and found this community discussion from 2017 which links to this github issue.

And a new regex (?-s)[A-Za-z][A-Za-z0-9+.-]+://[^\s]+?(?=\s|\z) has been proposed.

But then it was closed with comment: requirement could not be fulfilled.
But from my limited regex knowledge I assume it does fulfill the requirements.

I did a test and it looks like it finds what needs to be found.

image

@sasumner
Copy link
Contributor

sasumner commented Aug 1, 2020

I'm not one to play regex golf, but the regex @Ekopalypse mentions just above seems a bit, well, confuscated. This would seem to cover it, and be simpler (although I did no testing):

(?i-s)[a-z][a-z0-9+.-]+://\S+?(?=\s|\z)

Normally, it doesn't matter, but if changing something this "important", it should be done so that someone in the future doesn't have to puzzle about why it was made "overcomplicated".

Note also that neither of these 2 regex handle the case of a URL wrapped in parentheses/brackets/etc. correctly; examples:

  • (http://www.google.com) <--- it will include the ) in the match; obviously won't include the (

  • [http://www.google.com] <--- it will include the ] in the match; obviously won't include the [

  • etc., such as { and }

People (including me!) like to put URLs in parentheses; currently I have to remember to do it this way:

( http://www.google.com ) <--- note space before h and after m

Perhaps ) and friends are valid in a URL, and thus nothing can be done about this situation.

Another point: Is there no Unicode possibility before the :// in URLs?

@Yaron10
Copy link

Yaron10 commented Aug 1, 2020

The ) and . at the end are important to me too.

@sasumner
Copy link
Contributor

sasumner commented Aug 1, 2020

The ) and . at the end are important to me too.

That means you would NOT want them included in the URL, correct?

@Yaron10
Copy link

Yaron10 commented Aug 1, 2020

That means you would NOT want them included in the URL, correct?

Correct.

Perhaps ) and friends are valid in a URL, and thus nothing can be done about this situation.

I hope that is NOT correct. :)

@Ekopalypse
Copy link
Contributor

Ekopalypse commented Aug 1, 2020

@sasumner

Is there no Unicode possibility before the :// in URLs?

No, according to https://tools.ietf.org/html/rfc3986#section-3.1, to quote

3.1. Scheme
...
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
...

and according to https://tools.ietf.org/html/rfc2234#section-6.1

...
ALPHA = %x41-5A / %x61-7A ; A-Z / a-z

...
DIGIT = %x30-39
; 0-9
...

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 1, 2020

@Yaron10, I can now build a SciLexer.dll which accepts your regular expression. And, absolute surprise, it doesn't use the Scintilla-build-in regex functions, but the regex functions from the boost library. This seems to be, because you regex acrobats can never be satisfied with standard regular expressions.

The mechanism is, that Scintilla provides a class for regex operations, which can be left as it is or be overridden with own regex functions. The code to do this is contained in \scintilla\boostregex\BoostRegExSearch.cxx.

To add the boost regex stuff to the project, you have to have the boost library installed on your computer. I downloaded a precompiled version from https://sourceforge.net/projects/boost/. Then, you have to add the source files from the scintilla\boostregex\ directory to your project, drag and drop worked for me. Then follow the error messages of the compiler, until it can build everything. You need:

  • to add an additional include file directory, where the boost\regex.hpp can be found.
  • to add an additional library file directory, where the .lib file can be found, which fits to your compiler and to your architecture/debug/release settings. The name of the library needed can be read from the compiler error log.

After you can build it, you have to turn it on by defining the preprocessor symbol SCI_OWNREGEX, and that's it. If you hate deprecation warnings, you can define _SILENCE_ALL_CXX17_DEPRECATION_WARNINGS.

By the way, I'm happy that the standard regular expression used for URL's here comes out with the regex syntax naturally provided by the original SciLexer.dll.

@Yaron10
Copy link

Yaron10 commented Aug 1, 2020

@Uhf7,

Thank you for the detailed guide.
Too much work for an amateur like myself. I'll wait for the next NPP release which should include the DLL with boost.

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 1, 2020

@Yaron10
Copy link

Yaron10 commented Aug 1, 2020

Yes, we're looking forward to your parser. :)

BTW, the second link is not interpreted correctly here (in your post) either.

@Ekopalypse
Copy link
Contributor

I did a test to see if there is a performance impact and it seems that this

[[:alpha:]][[:alnum:]+.-]+://\S+

is ~3% faster than the one currently used and has the advantage to find unicode letters as well.
Finding and marking 10.000 urls took, with pythonscript on my machine, 0.45 seconds.
I assume pure C++ can do this much faster and 10.000 urls in a document should be unusual.

@Yaron10
Copy link

Yaron10 commented Aug 9, 2020

@Uhf7,
Thank you for fixing the Copy/Paste issue. 👍

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 12, 2020

@Yaron10, @Ekopalypse I revised my parser version to exclude your nice example from being detected completely as URL.

What I changed: I allow quotes only in the query part of the URL. And the query part is scrutinized a little closer and has to be at least similar to the "official" query format (which seems not to be complied with in every case).

If you still want to test, here is my current source file again.
Notepad_plus.zip

@Yaron10
Copy link

Yaron10 commented Aug 12, 2020

@Uhf7,

👍
Great! Thank you for your work.


Interestingly, Firefox opens https://*/* in a new window instead of a new tab.
Not expecting an explanation. Just sharing.

@Yaron10
Copy link

Yaron10 commented Aug 12, 2020

@Uhf7,

Minor: is MAILTO: a legit INTERNET_SCHEME_MAILTO?

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 13, 2020

Yes, mailto: is a standard URL scheme. As is file: too, for example. But I think it is good to restrict the schemes to a few of the standard schemes, which the installed internet browser and the installed e-mail software can handle.

@Yaron10
Copy link

Yaron10 commented Aug 13, 2020

@Uhf7,

Just to make sure there's no misunderstanding here: I meant MAILTO: without a following address.
For example: MAILTO: in INTERNET_SCHEME_MAILTO: (in your code) is underlined.

Thank you.

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 13, 2020

@Yaron10,

Now I see it. Thank you again. I'm going to exclude this too.

@Yaron10
Copy link

Yaron10 commented Aug 13, 2020

👍
Thank you.

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 13, 2020

I did the following modifications:

  • Exclude scheme: alone without parameter string
  • Forbid underline and digits before URLs. No abc_http://xxx anymore, no 4560http://xxx neither. Although this was allowed in the past and goes through MS VS too, it doesn't feel right. And it is not accepted by Outlook, so at least one software does it like I feel. And I don't know any programming language, where _ or 0..9 are used as delimiters.
    If something else should speak against forbidding underline and digits before URLs, I revert this modification.

Notepad_plus.zip

@Yaron10
Copy link

Yaron10 commented Aug 13, 2020

@Uhf7,

Thanks for that too. 👍

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 15, 2020

Has anyone new strange text passages for me, which should or should not be detected as URL?

@Yaron10
Copy link

Yaron10 commented Aug 15, 2020

I'm done. :)

It took some time but the result is great.

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 15, 2020

@Yaron10,

thank you for testing, I will PR the new detection method for URLs soon.

@Yaron10
Copy link

Yaron10 commented Aug 23, 2020

@Uhf7,

A minor issue (not related to the parser, but I'd rather bring it up here).

Comment a link in a C++ file.
-- The text is green and the underline is black.
תמונה


Paste a link in a C++ file.
-- The part after the // is interpreted as a comment.
תמונה


Paste https://www.youtube.com/watch?v=VmcftreqQ6E&list=PnQIRE5O5JpiLL&index=1 in an XML file.
תמונה

On hover:
תמונה

You're not supposed to have un-commented links in such files. But still... A mini-minor issue. :)

Thank you.

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 23, 2020

Yaron10,

this is, because standard indicators are supposed to complement syntax highlighting, not to replace it. So I find it correct, that it becomes multi-colorized while not hovered.

That only a part of the link is displayed blue while hovered, seems to be another Scintilla problem. I will search it as soon as I can. It was there before my modifications, in the INDIC_TEXTFORE style (The explorer link style is a combination of INDIC_TEXTFORE and INIDIC_PLAIN).

But, damage control: On double click, the whole URL is used, and the full box style is not affected by this. And if the URL is syntactically correct in the document, it should be single color anyway.

I bet, that Notepad++ is really the first application, which is using the standard indicators to highlight URLs, although exactly this is recommended in ScintillaDoc.html (search "URL " in it).

@Yaron10
Copy link

Yaron10 commented Aug 23, 2020

this is, because standard indicators are supposed to complement syntax highlighting, not to replace it. So I find it correct, that it becomes multi-colorized while not hovered.

👍

That only a part of the link is displayed blue while hovered, seems to be another Scintilla problem. I will search it as soon as I can.

Thank you. But I'm not sure it should be worth it (mini-minor issue).

You didn't refer to the first case which is slightly more significant: Comment a link in a C++ file.
Should I assume it's not worth the extra work/code?

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 23, 2020

You didn't refer to the first case which ...

I did mean exactly the C++ example too in the 1st part of my answer: That's syntax highlighting, and the part of the line after // is a comment in C++. There are no URLs in the C++ syntax. And the https: before the comment will make its way into the error list while compiling anyway.

And I see, that the MS VS does it not this way, it prioritizes the URL, but this seems not to be worth of imitation to me.

@Yaron10
Copy link

Yaron10 commented Aug 23, 2020

I'm afraid we have a misunderstanding here.
My question is if the underline shouldn't be green too.

תמונה

If you did understand my question correctly - I apologize.

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 23, 2020

My question is if the underline shouldn't be green too.

No need to apologize. It was so clear to me, that the underline should not be green, that it didn't occur to me to mention it :-).

The underline is always in the default text color. Technically, it is not possible to make the underline color follow the syntax highlighting color.

When not hovered, we have two options:

  • Hidden: The URL is shown in the current syntax highlighting.
  • Underlined: The URL is shown in the current syntax highlighting, and underlined in default text color.

And: If the underline color would follow the current syntax highlighting color, it wouldn't unite/embrace the link anymore, it would fall apart optically even more in situations shown in your XML example.

@Yaron10
Copy link

Yaron10 commented Aug 23, 2020

When not hovered, we have two options:

But then you came up with INDIC_EXPLORERLINK for hover. :)
(So theoretically it can be implemented for "normal-state" as well).

And: If the underline color would follow the current syntax highlighting color, it wouldn't unite/embrace the link anymore, it would fall apart optically even more in situations shown in your XML example.

This is a very good point.
But the two issues are intertwined: if (theoretically) links were excluded from syntax highlighting, the one text color could be applied to the underline too.

Just theoretically.

Thanks again for your work. I appreciate it.

@Yaron10
Copy link

Yaron10 commented Aug 24, 2020

@Uhf7,

I've just encountered a slightly more serious issue with INDIC_EXPLORERLINK.

Word-Wrap: ON.
Paste a link in a normal text file.
Make sure the link is wrapped.
Hover.

תמונה

Should I open a new issue or you'd rather continue the discussion here?

Thank you.

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 25, 2020

@Yaron10,

thank you for testing thoroughly, another interesting effect.

But this is basically the same Scintilla problem as with the hovered, syntax-highlighted links, which become blue only partially. The detection whether the character belongs to a hovered indicator or not doesn't work.

I have a way to solve this filed as https://sourceforge.net/p/scintilla/bugs/2199/. Actually, I wanted to wait for an answer to this, before I create the PR to integrate it in the Notepad++ SciLexer.dll. But I will create it without answer, after an adequate waiting time.

Edit: The 100th comment to this issue, we shouldn't loose the focus here, the whole thing was actually about the cyrillic characters in a link, I have to PR my parser suggestion someday.

@Yaron10
Copy link

Yaron10 commented Aug 25, 2020

@Uhf7,

Thank you for the quick fix. 👍

Hopefully, last post here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants