Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault with read -rd $'\200' in multibyte locales #590

Closed
stephane-chazelas opened this issue Dec 9, 2022 · 2 comments
Closed

segfault with read -rd $'\200' in multibyte locales #590

stephane-chazelas opened this issue Dec 9, 2022 · 2 comments
Labels
bug Something is not working

Comments

@stephane-chazelas
Copy link

% locale charmap
UTF-8
% ksh93u+m
$ read -rd $'\200'
zsh: segmentation fault  ksh93u+m

Same in locales using GB18030, OK in single-byte locales.

0x00005555555defbd in sh_readline (names=0x5555556f8cd0, fd=0, flags=-255, size=0, timeout=0) at src/cmd/ksh93/bltins/read.c:312
312                             sh.ifstable[delim] = S_NL;
(gdb) p delim
$4 = 8388607

In GB18030, If I use a delimiter < 0x80, I find that such a byte value when occurring inside a character is not taken as delimiter, which is fine but the fact that the delimiter is not found is not reflected in the exit status (which may or may not be a separate bug):

$ LC_ALL=zh_CN.gb18030 SHELL=ksh luit
$ echo ╳foobar | read -rd w a
$ echo "$?"
0
$ printf %s "$a" | hexdump -C
00000000  a8 77 66 6f 6f 62 61 72  0a 73 66 33 6d 2e 33 61  |.wfoobar.sf3m.3a|
00000010  69                                                |i|
00000011

(see also the extra garbage at the end).

@McDutchie McDutchie added the bug Something is not working label Dec 9, 2022
@McDutchie
Copy link

Stack trace:

$ arch/*/bin/ksh -c 'read -rd $'\''\200'\'                                                                                         
AddressSanitizer:DEADLYSIGNAL
=================================================================
==66686==ERROR: AddressSanitizer: BUS on unknown address (pc 0x00010430345c bp 0x00016bb12580 sp 0x00016bb12000 T0)
==66686==The signal is caused by a UNKNOWN memory access.
==66686==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
    #0 0x10430345c in sh_readline read.c:316
    #1 0x1043027ec in b_read read.c:191
    #2 0x1043de404 in sh_exec xec.c:1270
    #3 0x10431c300 in exfile main.c:607
    #4 0x104319fb8 in sh_main main.c:370
    #5 0x1042ef66c in main pmain.c:42
    #6 0x10498d088 in start+0x204 (dyld:arm64e+0x5088)

==66686==Register values:
 x[0] = 0x0000000106d001f0   x[1] = 0x0000000106d001f3   x[2] = 0x0000000000000006   x[3] = 0x00000001045c86a8  
 x[4] = 0x00000001045c8640   x[5] = 0x0000000000000008   x[6] = 0x0000000000000001   x[7] = 0x0000000000000001  
 x[8] = 0x0000000104dc85e7   x[9] = 0x0000000000000003  x[10] = 0x0000000000000000  x[11] = 0x0000000000000000  
x[12] = 0x0000000000000000  x[13] = 0xffffff8fdf726f3f  x[14] = 0x00000001045bb160  x[15] = 0x0000000107602f28  
x[16] = 0x000000019b305b1c  x[17] = 0x0000000104df85e0  x[18] = 0x0000000000000000  x[19] = 0x0000007000020000  
x[20] = 0x0000000107f03128  x[21] = 0x00000000ffffff01  x[22] = 0x00000001045c7f80  x[23] = 0x000000702d782444  
x[24] = 0x00000000007fffff  x[25] = 0x0000000000020200  x[26] = 0x0000000000000000  x[27] = 0x0000000000000000  
x[28] = 0x0000000107a03680     fp = 0x000000016bb12580     lr = 0x00000001043033e4     sp = 0x000000016bb12000  
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: BUS read.c:316 in sh_readline
==66686==ABORTING
Abort

@McDutchie
Copy link

McDutchie commented Mar 24, 2023

The root cause of the crash is the mbchar() call here:

case 'd':
if(opt_info.arg && *opt_info.arg!='\n')
{
char *cp = opt_info.arg;
flags &= ((1<<D_FLAG+1)-1);
flags |= (mbchar(cp)<<D_FLAG+1) | (1<<D_FLAG);
}
break;

The rest of the delimiter handling code has no knowledge of multibyte characters at all. The mbchar() call parses a multibyte character, yielding -1 for the $'\200' because that's not a valid codepoint. That -1 value then goes unchecked, wreaking havoc.

The mbchar() change was introduced in 2006-12-22a ksh93s (as per the ksh93-history repo). It's the umpteenth example of a half-assed change they simply seem to have forgotten about.

Until read -d can be fixed to use multi-byte separators properly (which is going to require a redesign of that code, by the looks of it), we need to revert that change first. But even the original code can cause a negative value because opt_info.arg is of type char*. A typecast to unsigned char* fixes that.

diff --git a/src/cmd/ksh93/bltins/read.c b/src/cmd/ksh93/bltins/read.c
index 6cdb317d9..08bf52e0d 100644
--- a/src/cmd/ksh93/bltins/read.c
+++ b/src/cmd/ksh93/bltins/read.c
@@ -101,9 +101,8 @@ int	b_read(int argc,char *argv[], Shbltin_t *context)
 	    case 'd':
 		if(opt_info.arg && *opt_info.arg!='\n')
 		{
-			char *cp = opt_info.arg;
 			flags &= ((1<<D_FLAG+1)-1);
-			flags |= (mbchar(cp)<<D_FLAG+1) | (1<<D_FLAG);
+			flags |= ((*(unsigned char*)opt_info.arg)<<D_FLAG+1) | (1<<D_FLAG);
 		}
 		break;
 	    case 'p':

McDutchie added a commit that referenced this issue Mar 25, 2023
@stephane-chazelas wrote:
> % locale charmap
> UTF-8
> % ksh93u+m
> $ read -rd $'\200'
> zsh: segmentation fault  ksh93u+m
>
> Same in locales using GB18030, OK in single-byte locales.
>
> 0x00005555555defbd in sh_readline (names=0x5555556f8cd0, fd=0,
> flags=-255, size=0, timeout=0) at src/cmd/ksh93/bltins/read.c:312
> 312                             sh.ifstable[delim] = S_NL;
> (gdb) p delim
> $4 = 8388607

So the problem is that delim, the delimiter character, gets a
pathological value. This occurs here in read.c:

101:	    case 'd':
102:		if(opt_info.arg && *opt_info.arg!='\n')
103:		{
104:			char *cp = opt_info.arg;
105:			flags &= ((1<<D_FLAG+1)-1);
106:			flags |= (mbchar(cp)<<D_FLAG+1) | (1<<D_FLAG);
107:		}
108:		break;

The culprit is hte mbchar() macro expansion on line 106. When an
invalid multibyte character is passed, mbchar() yields -1, causing
wreckage with the bit shifts that unify the delimiter with some bit
flags in the 'flags' variable that is passed on to sh_readline().

But multibyte character delimiters have never worked on ksh93, so
that mbchar() call can be removed for now. It was introduced in
2006-12-22a ksh93s (as per the ksh93-history repo), presumably with
a view to properly supporting multibyte delimiters, but so far that
has not happened.

<future>At some point, we'll want to change the design to store the
delimiter in its own variable instead of awkwardly unifying it with
flags to save a couple of bytes. At that point it should not be too
hard to update the code to support multibyte delimiters.</future>

src/cmd/ksh93/bltins/read.c: b_read(): case 'd':
- Instead of using mbchar(), directly read the character, with a
  typecast to unsigned char* to avoid negative values being used
  for the bit shift operations.

Resolves: #590
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working
Projects
None yet
Development

No branches or pull requests

2 participants