-
Notifications
You must be signed in to change notification settings - Fork 732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
escapeshellarg() strips off bytes which are not valid characters (leading to file not found errors) #3052
Comments
Using the top comment in the documentation for inspiration, What's your current locale settings and have you tried changing them to something UTF-8-capable? |
Actually, I'm just going to skip ahead and move this.
Naturally, the docs should mention this. |
@damianwadley My locale was already set to UTF-8 and also calling |
@ohyeaah 0x80 isn't valid UTF-8 so it was getting stripped off. I'm not sure there's a proper way of handling 0x80 in the same way your example demonstrates. I'd expect that in a real-world situation you'd have an actual character there, not literally just that byte - after all, escapeshellarg is for unknown inputs, and I can't imagine you'd want your unknown input to be dealing with file names consisting of unknown byte ranges. After some testing, I think your solution should be to (install and) use an appropriate locale for your byte sequence - one of the ISO 8859s, for instance. Or in other words, to use a locale with the same character encoding you used when constructing the filename. |
@damianwadley Well I now understand how escapeshellarg() works and that it depends on the current locale. But the point of my question was another: WHY does escapeshellarg() strip off bytes that are not part of the current locale at all? Shouldn't escapeshellarg() also include characters that are NOT part of the current encoding? For example ls on Linux also includes characters that are not part of the current encoding. Many people use escapeshellarg() on files that come from the internet (like me). They don't know their encoding nor are they aware of that escapeshellarg() strips off characters which are not part of UTF-8 (which is the default encoding for most users). |
I can't tell you exactly why it does that. I can tell you that it was added some 15 years ago back in the PHP 5.2 era with a note of "Properly address incomplete multibyte chars inside escapeshellcmd()". So it was done deliberately; by the sound of it, to avoid accidentally passing invalid multibyte sequences.
Because that was the decision at the time? I don't know, and I don't know if Ilia A. or Stefan E. are active here on GitHub (let alone remember the motivation behind the changes) so I don't think we'll be able to ask them. Had it been implemented today, I suspect an invalid string would produce an E_WARNING, but I don't see preserving the invalid bytes as being a viable solution given that:
Not if the requirement is to make sure it produces a valid and safe shell command/argument. For all I know, a byte sequence that violates the LC_CTYPE locale could create a vulnerability. And you know, I wouldn't even be that surprised if it did.
Sure, but all it has to worry about is outputting a filename. That's totally different from having to worry about escaping arbitrary strings for execution in shell commands.
Which is why this behavior needs to be documented, as it does with any other function that depends on the locale. Also, while the default character encoding is often UTF-8, the default server locale may or may not be. Locales are a very different beast. Also also, side note: allowing the user to dictate a filename on a server is highly unsafe. Always generate filenames - and especially extensions - yourself, and if you need the know the original name then store it safely in a database. Because really, there's no reason why the file's name on the server should be controlled by the user, so don't even give them the power to do that at all. |
Description
It seems
escapeshellarg()
strips off bytes which are >= 128 probably because they are not valid UTF-8 characters.Because it's not possible to provide an encoding to
escapeshellarg()
and this behaviour is not documented I think it's a bug.Also if this is expected behaviour then which function could be used instead? There is no other function which could do this.
Please note that the code example runs on Linux only because Windows (probably) doesn't have the
cat
command. Also I tested this on Linux only. Also note it's not necessary to close the php tag.Resulted in this output:
But I expected this output instead:
PHP Version
PHP 8.2.7 (cli) (built: Jun 9 2023 19:37:27) (NTS)
Operating System
Debian GNU/Linux 12 (bookworm)
The text was updated successfully, but these errors were encountered: