Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode characters inconsistently cannot be displayed in Notepad++ #3747

Closed
charliehoward4dp opened this issue Sep 18, 2017 · 72 comments
Closed
Assignees
Labels

Comments

@charliehoward4dp
Copy link

Description of the Issue

In a Notepad++ document that is encoded as UTF-8 (no BOM), many Unicode characters are not displayed, but the hollow square appears in their place. If a displayable Unicode character is added to a line containing undisplayable Unicode characters, those undisplayable ones suddenly appear. Removing the "good" one makes the others revert to the hollow square. A simple example:

☆◬⊗⊠⋆⧆⨂

Paste that line into NP++ and you will see all the characters. Remove the leading star ☆ and the others become squares. Restore the star and the others re-appear.

Steps to Reproduce the Issue

  1. Create a UTF-8 (no BOM) text file. (This is the only hard part of the procedure.)
  2. copy & paste the following string into a UTF-8 (no BOM) Notepad++ document: ☆◬⊗⊠⋆⧆⨂
  3. all of those characters will display properly
  4. delete the leading star ☆
  5. the other characters become hollow squares
  6. restore the ☆ and the other characters reappear

Expected Behavior

All of the characters always should appear.

Actual Behavior

They only appear if an always-acceptable Unicode character is on the same line. If an always-acceptable Unicode character is in the document but not on the same line, certain Unicode characters, such as, but not limited to, the ones shown above, will not be displayed properly.

Debug Information

Notepad++ v7.5.1 (32-bit)
Build time : Aug 29 2017 - 02:35:41
Path : C:\Program Files (x86)\Notepad++\notepad++.exe
Admin mode : OFF
Local Conf mode : OFF
OS : Windows 10 (64-bit)
Plugins : ComparePlugin.dll mimeTools.dll NppConverter.dll NppExport.dll NppFTP.dll NppTextFX.dll PluginManager.dll SpellChecker.dll

This occurs with characters from many of the Unicode blocks.

@dennis-thomas-o
Copy link

dennis-thomas-o commented Nov 5, 2018

I was able to reproduce this as well.

I tested with the Default Style in Style Configurator set to Courier New, Consolas, Arial and Times New Roman. The file was a TXT file and I tested encoding in UTF-8, UTF-8 BOM, UCS-2 BE BOM and UCS-2 LE BOM. All of them showed the same result.

I believe this issue would happen any time you enter a character that is NOT contained in the selected font and then add/remove on the same line a character which IS contained in the selected font.

IMHO seems like something not quite right with the font-substitution routines. This was in a TXT file encoded with

Debug Information

Notepad++ v7.5.9 (64-bit)
Build time : Oct 14 2018 - 15:19:55
Path : C:\Program Files\Notepad++\notepad++.exe
Admin mode : OFF
Local Conf mode : OFF
OS : Windows 10 (64-bit)
Plugins : DSpellCheck.dll mimeTools.dll NppConverter.dll

@geometrian
Copy link

I believe this issue would happen any time you enter a character that is NOT contained in the selected font

This may be, but it also appears to occur in other circumstances as-well. For-example, "⎷" (U+23B7 "RADICAL SYMBOL BOTTOM") is present in Consolas, Courier New, DejaVu Sans Mono, and Lucida Console, but if you put that in a new text file, it won't show up with any of those fonts.

@Uhf7
Copy link
Contributor

Uhf7 commented May 26, 2020

it also appears to occur in other circumstances as-well

Cannot confirm this example on my system.

The U+23B7 character is not in my DejaVu Sans Mono. There is U+23AE, followed by U+23CE.

The U+23B7 character is not in my Courier New either. There is U+2321, followed by U+2500.

Same for my Consoleas, U+2321 followed by U+2460, same for my Lucida Console, this ends at U+0433.

So this example seems not to poke a hole into the theory, that only characters unavailable in the current font are affected.

@Ekopalypse
Copy link
Contributor

Ekopalypse commented May 26, 2020

It might be, somehow, related to SCI_SETTECHNOLOGY configuration.

font_issue

@Ekopalypse
Copy link
Contributor

Ekopalypse commented May 26, 2020

@ValZapod
but I don't see the issue with the larger autocompletion box.
Maybe this was already fixed with the scintilla version used by npp.

@Uhf7
Copy link
Contributor

Uhf7 commented May 26, 2020

@Ekopalypse,
the SCI_SETTECHNOLOGY approach looks very promising on my system too.

I did include an execute(SCI_SETTECHNOLOGY, <n>); into ScintillaEditView::init to test it.

Techology 0:

Techno-0

Technologies 1, 2 and 3:

Techno-1

Both screenshots show the same file automatically loaded after start, the only thing I did was moving the cursor to the right bracket of the two marked brackets.

I used Courier New here.

The new techologies seem to size the substituted chars better then techology 0. Edit: But it has nothing to do with "fixed font" anymore, the substitutions seem to have quite variable widths.

@Ekopalypse
Copy link
Contributor

@ValZapod

You mean brackets? I think you use not Courier New Font? It is bad in it, and good in DejaVu Sans Mono. You can try from #442

No, I mean the screenshot and discussion you linked to
https://trac.wxwidgets.org/attachment/ticket/17804/17804-SetTechnology1.png

There it has been reported that the words in an autocompletion box are bigger using directwrite
instead of default. I'm using RobotoMono font.

@Ekopalypse
Copy link
Contributor

@ValZapod @Uhf7
I don't seem to be able to get the results you get with the Courier New font.
The ∈ is always displayed, so I assume that something else has an additional effect.

@Ekopalypse
Copy link
Contributor

Ekopalypse commented May 26, 2020

Downloaded 7.8.6 x64 and did a retest

font_issue2

The "bracket issues" doesn't seem to happen for me. (??)

@Uhf7
Copy link
Contributor

Uhf7 commented May 26, 2020

@Ekopalypse

I can make the ∈ visible with the Courier New font now too, using technology 0 and some hand-configured font linking, which looks actually a little ill:

Techno-0fs

What I did: There is a registry entry

HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\FontLink\SystemLink

Under it, there are many multi-string values, named after fonts. There was no value named Curier New. So I created a multi-string value named Courier New and copied the data of the Lucida Sans Unicode into it. A shot in the blue. Immediately after it, nothing improved, but after I rebooted my system, the missing characters became visible. And: If I move the cursor to the famous right bracket, then the small ∈ mutates to a big ∈ instead to an empty frame. So it certainly depends on the font linking setup too. And if I could set up the font linking in a way, that the font-linked ∈ looks the same as the "normal" ∈ (where ever it comes from), then everything would be fine with technology 0.

@Ekopalypse
Copy link
Contributor

@Uhf7 Is not set for me.
image

@Uhf7
Copy link
Contributor

Uhf7 commented May 26, 2020

@Ekopalypse

then certainly another trick exists. Running out of thoughts here. On my system, the 32 bit version works exactly like the 64 bit version.

A major difference is the system itself: I use
OS Name : Windows 10 Pro (64-bit)
OS Version : 1607
OS Build : 14393.576,

you use Windows 7. I try it on my Windows 7 system ...

@Uhf7
Copy link
Contributor

Uhf7 commented May 26, 2020

Ok, under Windows 7, the ∈ works fine, with and without bracket highlighting, but some other characters don't. Using technology 0.

Techno-0-w7

With the 64 bit version of Npp, the same characters are missing.

Using technology 1:

Techno-0-w7-t1

What else could you ask for? Looks perfect to me.

Technology 1, Windows 7, 64 bit version of Npp:

Techno-0-w7-t1-64

I dream, if you ask me. Compared to the current state.

@Ekopalypse
Copy link
Contributor

@ValZapod

The font never actually changed, you need to press on Enable global font

No, it was started with these settings. I switched to global overwrite to show that no other font is defined. If global override is NOT checked, the default setting takes precedence.
Yes, I still use Windows 7 why not? I don't do any "mission critical transaction" with windows OS anyway.

Cannot we somehow force the rendering that is used when yout type in ☆?

Npp has no setting for this yet. What you can do is to use one of the scripting language plugins,
like PythonScript, LuaScript ..., even NppExec can be used to set the technology to DirectWrite

@Uhf7 - hmm :-D what should I say - Windows 10 broke it :-(
Thanks for testing and btw. thank you for your contributing work. Much appreciated.

@Uhf7
Copy link
Contributor

Uhf7 commented May 26, 2020

@ValZapod

I see no negative effect when removing the ☆, if I use technology 1:

Techno-0-w7-t1-64-2

or the ⇒:

Techno-0-w7-t1-64-3

I certainly start to get messed up with my screenshot names here, that's why I don't post any pictures from re-inserting the ☆ and the ⇒ successfully, but for me, with technology 1 everything works fine. Under Windows 7 and under Windows 10, I believe.

@Ekopalypse
Copy link
Contributor

My knowledge of fonts, rendering, directwriting... refers to what I have posted.
I can't say for sure if it's a problem in Scintilla or Windows or the font used or ...,
but if using DirectWrite offers a way to either work around or solve this, then I would
vote to add something to the settings that would allow the user to set it.

@Uhf7
Copy link
Contributor

Uhf7 commented May 26, 2020

@Ekopalypse

using a plugin or NppExec to make the characters display correctly is not exactly what I call a solution of the problem. It's a possible work-around, but wouldn't it be nice when the characters are displayed correctly without additional actions?

And, @ValZapod, how long would it take until we have a new Scintilla version? (Perhaps, the Scintilla developers will say: Use technology 1 or higher! That would be interesting)

I would feel better, if Npp itself would switch the technology to a working one. May be, it can be included in the configuration somehow, so that there is a safe fallback if the technology switch doesn't work on some systems.

@Ekopalypse
Copy link
Contributor

@Uhf7 - 100% correct :-D

@Uhf7
Copy link
Contributor

Uhf7 commented May 27, 2020

An issue close to this one is #2287. It is the same problem they describe there, existing since 2016, and it is solved there by setting the technology to DirectWrite with the help of a plug-in.

Thank you for that solution, but this is something for insiders. As a new user, or as a user who is just using it for editing files without caring about development, this solution is this is very hard to find.

So I would fully support what @jefflomax said in #2287:

Notepad++ should support ligatures out of the box, not thru hacks or adding plugins users neither need nor want.

So I will try to push it to the master now, with a PR. If we not do this now, the next ones come in two years wasting their time with testing it again and again and again.

@Uhf7
Copy link
Contributor

Uhf7 commented May 28, 2020

Found an old Windows Vista in my virtual machine park, the following screenshots support the necessity to make the DirectWrite technology feature configurable. That Scintilla can load Direct2D does not mean automatically that this produces better results on old systems.

Vista, Technology 0, Courier New
Techno-0-Vista

Vista, Technolgy 1, Courier New
Techno-1-Vista-1

Vista, Technology 0, DejaVu Sans Mono
Techno-0-Vista-DejaVu

Vista, Technology 1, DejaVu Sans Mono
Techno-1-Vista-DejaVu

@Uhf7
Copy link
Contributor

Uhf7 commented May 28, 2020

May be. But Unicode itself was already there under Vista. What bugs me more is, that technology 1 under this Vista seems to wreck "normal" characters nearby the ∈ character, sometimes.

@Uhf7
Copy link
Contributor

Uhf7 commented May 28, 2020

@ValZapod

You wrote 2 days ago

https://sourceforge.net/p/scintilla/bugs/1393/ is our bug.

The Unicode character U+25C6 (◆)displays in Npp with and without DirectWrite technology. Even in Windows 7.

So I cannot verify that this is exactly "our" bug. And it was 2012. And he used Windows XP. And I'm sure there are many effects which can lead to empty frames instead of correct characters. I simply don't believe that it's promising to go to them and ask them to fix exactly this issue now.

@Uhf7
Copy link
Contributor

Uhf7 commented May 28, 2020

Screenshot?

The second screenshot of my Vista screenshots, headlined "Vista, Technolgy 1, Courier New".
Most "normal" characters in line 3 don't look like Courier New anymore.

@nyamatongwe
Copy link

nyamatongwe commented Jun 7, 2020 via email

@hpwamr
Copy link

hpwamr commented Jun 8, 2020

Wow. Paste ⊗⊠⋆⧆⨂ in your notepad3, it will get broken! Nice, I will open an issue there. P.S. Or it is not yours?

Nope, The owner of Notepad3 is "Derick Payne" 😉

@Ekopalypse
Copy link
Contributor

@ValZapod - seems you misunderstood most of the thread.
This is what I suggested 13 days ago and what @Uhf7 is working on.

@Ekopalypse
Copy link
Contributor

I don't think so, if you check his PR then you will see that he added it to the preference dialog.

@Ekopalypse
Copy link
Contributor

Too much noise for my taste, I'm out.

@Uhf7
Copy link
Contributor

Uhf7 commented Jun 11, 2020

@ValZapod I saw the UI already, but had no really opinion about it, because it doesn't belong to this project, so it does not help me here. My opinion regarding the technology settings in the screen shot: Two options too many. The difference in text rendering is between the Windows GDI TextOut function on one side and the DirectWrite equivalent on the other side. The rest is about how to bring the rendering result of DirectWrite to the screen.

@nyamatongwe
Copy link

nyamatongwe commented Aug 31, 2020 via email

@Uhf7
Copy link
Contributor

Uhf7 commented Sep 1, 2020

This "fix" is at least a hint where the problem comes from: It comes directly from the Windows GDI text output functions for wide characters. I did some experiments based on this information.

The Windows GDI functions, which are used by Scintilla and which do not work correctly, are:

  • ExtTextOutW
  • GetTextExtentPoint32W
  • GetTextExtentExPointW

The common error of these functions seems to be, that they use squares instead of characters for some 'bad' Unicode characters, as long as there is no 'good' Unicode character in the text string.

I have no list of 'good' or 'bad' Unicode characters, this is only a term for it I invented here. But I can name two 'good' Unicode characters: 0x0000 and 0x200B. If one of those two characters is in the text, all other Unicode characters are displayed correctly. The 0x0000 character has been used by @KnIfER for the "fix". Unfortunately, it has a width, when we use it with the Windows GDI functions.

So I went for the 0x200B character (Zero width space) in my experiments. A possible fix is to append the 0x200B character silently to all text strings passed to the Windows functions mentioned above. Then they produce the correct character width's and the correct output.

To make this experiment fly without additional text copy operations, I modified the TextWide class in a sneaky way. The VarBuffer is now one character longer than the actual text and this additional character is the Zero width space. tlen remains as it is, to avoid any behavior modifications.

class TextWide : public VarBuffer<wchar_t, stackBufferLength> {
public:
	int tlen;	// Using int instead of size_t as most Win32 APIs take int.
	TextWide(std::string_view text, bool unicodeMode, int codePage=0) :
		VarBuffer<wchar_t, stackBufferLength>(text.length() + 1) {
		if (unicodeMode) {
			tlen = static_cast<int>(UTF16FromUTF8(text, buffer, text.length()));
		} else {
			// Support Asian string display in 9x English
			tlen = ::MultiByteToWideChar(codePage, 0, text.data(), static_cast<int>(text.length()),
				buffer, static_cast<int>(text.length()));
		}
		buffer [tlen] = 0x200b;
	}
};

After modifying the TextWide class this way, I can use tlen+1 as character count for the ExtTextOutW call and for all GetTextExtentPoint32W calls, to smuggle in the 'good' Unicode character.

What remains here, is the GetTextExtentExPointW call in SurfaceGDI::MeasureWidths. Here, I had to increase the size of the poses buffer, and I had to set the result parameter fit to the actual length of the text. This can be done without side effects, because the maxWidthMeasure parameter is equal to INT_MAX, so that I assume, that all characters fit into this width anytime.

	const TextWide tbuf(text, unicodeMode, codePage);
	TextPositionsI poses(tbuf.tlen + 1);
	if (!::GetTextExtentExPointW(hdc, tbuf.buffer, tbuf.tlen + 1, maxWidthMeasure, &fit, poses.buffer, &sz)) {
		// Failure
		return;
	}
	fit = tbuf.tlen;

This experimental fix runs on my system without assertions in debug mode and displays the correct characters using the Windows GDI functions.

I don't know whether such a solution would be accepted by Scintilla, but perhaps there is someone who wants to try it this way too ...

@AgostinoSturaro
Copy link

AgostinoSturaro commented Oct 11, 2020

Here's another way to reproduce this, from #3747 originally reported with #813

Open a new Notepad++ file, set the encoding to UTF-8 and paste these symbols (Double Arrow Unicode characters) on the first empty line
⇐⇑⇒⇓⇔⇕⇖⇗⇘⇙
Position the cursor before the last two characters and enter a newline, like this
⇐⇑⇒⇓⇔⇕⇖⇗
⇘⇙
The last two characters should turn into blocks.
notepad unicode character corruptuion

This comment has a nice video showing the issue
#5513 (comment)

@donho
Copy link
Member

donho commented Oct 12, 2020

Here's the solution:
#5513 (comment)

@sasumner
Copy link
Contributor

sasumner commented Jun 8, 2021

Actually I just downloaded v8, it is still not fixed?? Why?

Maybe be more specific about what is not fixed.
There's a lot going on in this issue thread to be able to pick out what you mean.

@sasumner
Copy link
Contributor

sasumner commented Jun 8, 2021

image

@sasumner
Copy link
Contributor

sasumner commented Jun 8, 2021

This is using DirectWrite?

Actually, DirectWrite enabled / disabled has same effect (shown in previous screenshot).
Tested in fresh portable extract of 8.0. No other changes to a default setup.

@mere-human
Copy link
Contributor

Could it depend on system/application font?

@AgostinoSturaro
Copy link

@ValZapod Would you suggest reopening this defect and closing the new ones, or something else?

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 19, 2021

Pong!

Seems to be a complicated issue. We should try to split up some different problems into different groups:

  1. There is an effect, that characters are displayed sometimes, and sometimes not, depending on the preceding character. This is, what this issue describes. The - in my opinion - most promising hack for this effect is this: Unicode characters inconsistently cannot be displayed in Notepad++ #3747 (comment).
  2. There are characters, which will be displayed by Windows only, if the DirectWrite API is used. Here, the DirectWrite option Display of UNICODE characters is inconsistent #5513 (comment) may help.
  3. Encoding issues. Where invalid UTF-8 characters get into the text buffer, and hence, they will be displayed incorrectly. Such an issue has been fixed by Fix UTF-16 decoding/encoding for code points above U+0FFFF. #9599. But this does not guarantee, that all the correct UTF-8 characters can be displayed by Windows.

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 19, 2021

Indeed, but if that is a bug in WINAPI, it is not a hack. And also this hack must be applied in scintilla, IMHO.

It is a hack, and it has to be applied in Scintilla. I call it a hack, because I don't know exactly why it works. But it is not a too-bad hack, because it respects all the requirements the Windows-API has at this point, which means, we do nothing which is forbidden by the Windows-API. Actually, we expect to work the Windows-API without this hack, but it doesn't.

What does it mean by Windows? Not by Linux?

Linux has nothing to do with it here. The problem is, that all characters we see displayed by Notepad++ are displayed by Windows-API-function calls. If the DirectWrite option in Notepad++ is disabled, Notepad++ uses the ancient Windows-GDI function TextOutW to display the character(s). If the DirectWrite option in Notepad++ is enabled, Notepad++ uses the brand-new and super-fast DirectWrite text output function to display the character(s). Since both are different graphic API's, there are different results.

What are such examples with hack applied?

Cannot try out this momentarily, but as far as I remember from testing, some should work. Please try.

@Uhf7
Copy link
Contributor

Uhf7 commented Aug 19, 2021

Can we report this bug in windows?

First thought: I feel overcharged with this. Fighting the ministry of truth??? No way.

Second thought: The problem is so specific, that no one working at the 1st level of telephone or email support for Microsoft will grasp it. So the only advice I expect from there is something like "Please reboot your computer to see if it's gone" or similar. No hope, unless you know someone inside the system ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

12 participants