Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files with only CR are shown as separated lines, but not line-sorted correctly #7735

Open
bersbersbers opened this issue Dec 14, 2019 · 19 comments

Comments

@bersbersbers
Copy link

Description of the Issue

Sort the attached (Line Operations, Sort). The file consists of three lines (A, C, B), each terminated by only a carriage return (CR). In Hex, this is what this looks like:

41 0D 43 0D 42 0D

This is what it looks like in Notepad++:
image

So the file is displayed as three separate "lines", so line-sorting should be expected to sort them. However, line-sorting does not sort the "lines".

Expected Behavior

Lines are sorted.

Actual Behavior

They are not

Debug Information

Notepad++ v7.8.1 (64-bit)
Build time : Oct 27 2019 - 22:57:19
Path : C:\Program Files\Notepad++\notepad++.exe
Admin mode : OFF
Local Conf mode : OFF
OS Name : Windows 10 Pro (64-bit)
OS Version : 1909
OS Build : 18363.535
Plugins : DSpellCheck.dll mimeTools.dll NppConverter.dll NppExport.dll

Maybe related: #4736

@xylographe
Copy link
Contributor

The status bar reports "Unix (LF)" ?

@bersbersbers
Copy link
Author

@xylographe is that a question? Yes, the status bar reports "Unix (LF)".

The file is now attached by the way:
bug.txt

@sasumner
Copy link
Contributor

@bersbersbers :

To @xylographe 's point, you are defining a "line" to be some text that ends with a LF (per your "choice" as shown in the status bar), and then you are creating some data which doesn't follow the line definition and are expecting it to be sorted correctly?

Suggest that you unify your example, so that line-endings used in the data match the line-ending type as shown on the status bar, and then see if sorting problems persist...

@bersbersbers
Copy link
Author

@sasumner I should add that I am not defining anything since "Unix (LF)" was not my own choice, it was Notepad++'s choice based on a slightly more complete example file that I am uploading now.
The file bug2.txt is

D(LF)
A(CR)
C(CR)
B(CR)

(A real-word example would be captured terminal output for some Unix process with progress:

Download started(LF)
0%(CR)
1%(CR)
2%(CR)

This file is being auto-detected as "Unix (LF)". Still, A, C and B are shown as separate lines with individual line numbers. Based on this result, I would expect Notepad++ to sort these lines. However, it does not, the ordered file is A-C-B-D.
bug2.txt

@sasumner
Copy link
Contributor

sasumner commented Jun 9, 2020

A similar issue appeared on the Live Support channel:

sjzoppi Jun 08 13:36
Hi all - Anyone else running into problems with the 64-bit version doing line-level sorting?

sasumner Jun 08 13:41
I guess you'd have to be more specific on the type of problems before anyone can comment

sjzoppi Jun 08 13:41
Sorting simply does nothing.
a collection of lines when sorting "lexicographically" (or any other way for that matter) doesn't happen.
It Does happen on the 32-bit version.

sasumner Jun 08 13:47
I guess I couldn't say; it seems to sort fine for me, so I suppose someone else might have to chime in

sjzoppi Jun 08 13:48
Thanks for responding.

sjzoppi Jun 08 14:07
I figured out the problem and it's probably not "technically" a 'bug' ... It has to do with the line-ending setting ... If a 'cut/paste' was done from a Unix(LF) file into a Windows(CR|LF) file, the line endings LOOK 'fine' (NPP recognizes the LF and dutifully numbers the lines. But because they are lacking the requisite "CR" before the "LF" - the sort function won't recognize them as actual lines ... So this is a bit of a cognitive difference between what the EDITOR window recognizes as a "line" and what the sort function does.

sasumner Jun 08 14:10
A paste should unify line-endings into the target document's line-ending type.
I'm surprised 32-bit and 64-bit would be different with sorting (with the new info on line-endings)

sjzoppi Jun 08 14:16
Yes - I'll try to reproduce the exact steps ... because I'm doing 6 things at once and don't want to err on the order in which I did them. in the meantime - regardless - it's clear that what appears to be a 'line' to the editor window isn't the same to the sort function. You can force the symptom by putting a bunch of semi-colon delimited data into a Windows(CR|LF) window and, using the replace function / with extended expression mode on, replace ";" with "\n" ... This will cause the editor to break up the "new lines" and number them ... but ONLY those lines with CRLF endings will be recognized as "lines" to the sort function.

sasumner Jun 08 14:24
Yes that's true, it is a way to get mixed line-endings in your file. But you had said you pasted and a regular paste shouldn't cause that situation, so I was wondering about that.

sjzoppi Jun 08 14:26
yes - I cannot 100% vouch for the order in which I'm working because "multitasking" wins over "precision" right now ... but that's the story and it's debatable whether or not "sort" should respect "LF" or "CRLF" exclusively - but now that I've figured it out - I'll leave the philosophical questions to the developers.
And give "props" - I love NPP - and the plugins community also rocks....
So thank you for responding!

sasumner Jun 08 14:27
Notepad++ 's sort routines use the current-file's line-ending type to chop the data up for sorting. That's why it does what it does.

sjzoppi Jun 08 14:28
yes - that was my conclusion... but it's a slightly disconnected experience when the editor window (regardless of the line-ending type) STILL breaks things up into "lines" even if the ACTUAL line-endings are not consistent.

sasumner Jun 08 14:28
Maybe it should first convert any "foreign" line-endings to the current file's line-ending type

sjzoppi Jun 08 14:29
I'm not sure whether or not it should - Or whether it should just WARN you - There are times when I prefer to have intermixed line-endings in the file.
for me - a better experience would just be a warning that there are a mix of line-endings in the file...

sasumner Jun 08 14:31
Yea...it's actually difficult to get into that situation (aside from the regex thing you mentioned).

sjzoppi Jun 08 14:31
I understand - but it's difficult to get into a LOT of situations I seem to find myself in - but manage to find myself there regardless. :-)

@blackboxlogic
Copy link

I hit this problem. Anything shown as multiple lines should be treated as multiple lines by the line operations. In this case, if LF is the selection line terminating character, then either A) the entire file should appear on one line because it has no line endings, OR B) line operations should treat them as individual lines.
Another option would be to warn the user when they perform a line operation that the results may be unexpected because their file has mismatched line terminating chars, and how to fix it.

@Ekopalypse
Copy link
Contributor

So it is assumed/expected that something like this is done, correct?

image

@sasumner
Copy link
Contributor

sasumner commented Aug 6, 2020

@Ekopalypse

Seems correct to me (presuming the middle part of your screenshot is the result).
It would be better if your starting point (left part of screenshot) had the B lines in a different order, though. :-)
But...I believe that identical lines but with different endings should sort as LF first, then CR, then CRLF.

@bersbersbers
Copy link
Author

@Ekopalypse I agree. I don't care about the types of line endings, but I find your choice is reasonable.

@sasumner
Copy link
Contributor

sasumner commented Oct 2, 2020

A few more users confused while sorting when line-endings are not 100% consistent in a file:
https://community.notepad-plus-plus.org/topic/16313/sort-lines-lexicographically-did-not-work

@sasumner
Copy link
Contributor

@ndemou
Copy link

ndemou commented Apr 22, 2022

In my opinion the bug here is that notepad++ is not making the fact that a file has mixed line-endings apparent to the user as soon as possible. It is a bug worth tackling because such files are tricky both for notepad++ (as this issue and others show) and for any other program that may consume the output of notepad++.

I certainly do not expect notepad++ to always scan the entire text in order to check for mixed line-endings. That's a high price to pay when opening big files for a mostly rare reward. But notepad++ should let the user notice that he is "in dangerous waters" as soon as code is in a position to detect the issue (e.g. during rendering of text on screen).

As for the UI, the best fix for me is this: only render a line break for the EOL byte-sequences that notepad++ expects as valid (e.g. when notepad displays Windows (CR LF) on the status bar only [CR][LF] should produce a line break on screen). "Invalid" EOL bytes should instead be rendered with the special block characters that appear when you select View > show symbols > show end of lines ([CR], [LF]).

A second best would be to display a big red warning (e.g. "Mixed line-endings detected") on the status bar (probably in the same place where the EOL type is shown now). Clicking the warning could be guiding you to the wikipedia article on the subject of line endings.


P.S.
I have stumbled on this issue more than once in the last several months. I guess it happens so much because I happen to work on Linux, Windows and WSL and that I have been using notepad++ more and more in places where I used to use vim.

P.P.S.
Thanks anyway for the great software and feel free to ignore my requests without a patch :-)

@alankilborn
Copy link
Contributor

I do not understand where the problem is in adding an option to let the user choose the behavior of Notepad++ when sorting lines in a file with mixed line endings.

I don't think letting the user choose a behavior is needed. What is there to choose between?

The program should either sort lines with mixed line-endings, or disallow it entirely.

@alankilborn
Copy link
Contributor

alankilborn commented Jun 28, 2022

If the program is allowed to sort lines with mixed endings, perhaps the stringSplit function call from HERE could be removed, and a call to a new function inserted; the new function being something like this:

std::vector<generic_string> stringSplitOnLineEndingsKeepingDelimiters(const generic_string& input, const generic_string& defaultLineEnding)
{
	std::vector<generic_string> output;
	size_t start = 0U;
	generic_string crLf = TEXT("\r\n");
	while (true)
	{
		size_t end = input.find_first_of(crLf, start);  // find CR or LF, whichever occurs first
		if (end == std::string::npos)
		{
			if (start < input.size())
			{
				output.push_back(input.substr(start, end) + defaultLineEnding);
			}
			break;
		}
		else
		{
			size_t lineEndingLength = 1;
			if (input[end] == '\r')
			{
				size_t end2 = input.find(crLf, end);  // find CR+LF, in that order
				if ((end2 != std::string::npos) && (end2 == end))
				{
					lineEndingLength = 2;
				}
			}
			output.push_back(input.substr(start, end + lineEndingLength - start));
			start = end + lineEndingLength;
		}
	}
	return output;
}

My C++ is a little rusty, plus I don't know if this is wanted, so I'm not doing a PR, just making a suggestion.

@alankilborn
Copy link
Contributor

BTW, there was recent discussion of this on the Community HERE.

@ndemou
Copy link

ndemou commented Jun 29, 2022

The program should either sort lines with mixed line-endings, or disallow it entirely.

Or just warn the user and let him clean up the mess. (99% of the time leaving a file with mixed line endings will bite you down the road anyway)

@alankilborn
Copy link
Contributor

Or just warn the user and let him clean up the mess.

It appears you said this already back here: #7735 (comment)

I kind of like your idea of showing really obviously the line-endings that don't match what Notepad++ thinks is the line-ending type of the entire file.

Maybe it isn't a "mess"; maybe mixed line-endings are sometimes (OK, rarely, but not never) intentional.

@alankilborn
Copy link
Contributor

Maybe the simple solution to this issue is to "unify" the line-endings, to the declared line-endings of the file shown in the status bar, BEFORE running the sort. Sure, this changes more of the file content than simply reordering of the lines, but then this "issue" be solved. If "Paste" can unify line-endings (and it currently does when run from the Edit menu; see #9260 ), then why not the sorting commands doing the same?

Also, if the last "line" of the file is included in the sort, and that line has zero length and thus no line-ending, it shouldn't be included in the sort. If the last "line" has content but no line-ending, it should be given a line-ending before sorting.

@sunk818
Copy link

sunk818 commented Apr 9, 2023

I like the unify line endings idea but want to be prompted before the change is made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants