-
Notifications
You must be signed in to change notification settings - Fork 7.9k
handle non-NVT ASCII messages per the RFC's #2689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR depends on PR #2674 being merged first. |
3dd32e4
to
86f146b
Compare
@weltling I saw you did a push for wide-byte handling... Is that related to this issue? Should I proceed on this or do you have something else in mind? I'm fine either way. |
@pprindeville oh, no, that's not related :) It's about Unicode APIs usage on Windows, started with 7.1 for the path handling and continues for other areas. It's indeed UTF-16 based, but only touching relevant parts doing it incrementally. I've just realized it were the case, so did it on the side. The only related commit was dda70e0 about not doing newline splits. The syslog RFC mentions UTF-8 only anyway, so that's 1 byte only. For the event log it could make sense, should see whether it would be integrated into the Windows implementation directly. So for this part, if I understood correctly, the idea was to also add the UTF-8 conversion? In theory it's doably through iconv or mbstring, but those should be available. Simplest were probably mbstring. In that case, one can rely on Zend multibyte API. But if mbstring is not loaded, IMO the basic case should be still processed. Probably in any case, the cleanups for Thanks. |
@weltling Sorry, just to be clear here, what are you saying? That mbstring is an optional part of PHP? And if it's present, then we do the conversion from mbstring's to utf-8 with the BOM prefix... and if it's not present, just allow NVT-ASCII (0x20-0x7e)? And is this a run-time or build-time determination? |
@pprindeville yeah, i was refering to #2674 (comment). The strict RFC compliance were UTF-8, whereby probably it were not to exclude someone would want to write in an arbitrary encoding. I didn't mean just allow NVT ASCII as a fallback, see below. Mbstring is a core extension, however there's no guarantee it is loaded in a particular PHP installation. The patch currently present does the following
Basically it's the remaining part of the first patch, where it seemed to be fine to proceed with. It passes an arbitrary string through and takes care about the known invalid pieces. Now, if there's additional UTF-8 or other encoding handling, it'd need some API and mbstring seems to fit for it. Then also where the @bukka's idea comes from i guess, if i understand correctly. If mbstring is loaded, the handlers in Zend/zend_multibyte.c will be worky. One could then go by the approach like this
Perhaps it could be also only UTF-8 instead of an arbitrary encoding, but i'm really not sure it wouldn't collide with the real life :) Fe what i've seen as described here #2674 (comment) Or perhaps, some special value to match the current default_charset or internal_encodnig would make sense, too. Technically the intention is doable, even with some limitation. Given there were no mbstring, the current patch would suffice at least for the basic handling. Lets see what you say and for the further comments. thanks. |
For reference, Python used to add a BOM for UTF-8 encoded messages, but later dropped the behavior, as it was causing too many issues. See http://bugs.python.org/issue14452 and https://hg.python.org/cpython/rev/af46a001d5ec. |
@weltling I don't know if we do need Otherwise, if @nikic I read through issue 14452 and it seemed to me that for 3.3 they eventually did end up doing proper RFC-5424 support. The "removing BOM insertion" was for 2.7 and 3.2 only. |
@pprindeville I checked the current code and a BOM is not prepended there: https://github.com/python/cpython/blob/master/Lib/logging/handlers.py#L932 The discussion in issue 14452 was just considering different options, what was implemented is a complete removal. |
@nikic It looks like they went with having it done in an external module: https://github.com/dpocock/python-rfc5424-logging-formatter although it's not clear where the That probably happens in the default logger class. |
@pprindeville it was discussed previously, that having ASCII only by default is a BC breach that doesn't seem justified. It still belongs to RFC 5424 as an option, that's true. Otherwise, default_charset empty means UTF-8 nowadays. To summarize, leaving aside the INI naming, etc.
The option 1.2 would include UTF-8 also without BOM. Perhaps doing encoding conversion different from default_charset or internal encoding might be too much of good. It seems that we still need to workout a path for this. It seems to me, that the Python issue was implementation specific, but nevertheless BOM is removed there http://bugs.python.org/msg158447. The involvement of mbstring were IMO good to validate what is written into the syslog, even if we don't really have to care. The two *big* choices at least - ASCII or not, might be a minimum, whereby the current behavior may still make sense to stay default. It would be still good to hear from @bukka what is exactly meant in the INI renaming as from the previous discussion :) Thanks. |
@weltling Just to clarify, the RFC 5424 does not prohibit control character either:
So there should still be an option to completely disable any modification which can have small perf benefit (we don't need to go through each characters and dynamically copy buffer). I didn't really mean using mbstring and doing any conversion which is not a good idea IMHO as it should be an app responsibility (we can't know which encoding is passed in anyway and automatic identification is error prone - read it doesn't work for all cases). What I meant is just to do simple validation and possibly ignore or substitute invalid UTF-8 characters (basically what we do in json). I think adding BOM should be probably also configurable. Btw. I plan to rewrite syslog as part of my fpmi ( just a development fork of fpm - https://github.com/bukka/fpmi/tree/next ) work on logging as current syslog implementation is not ideal for fpm and it's also not easily testable. I really want to add a support for syslog server (IP + port) setup to be able to listen on another port and get output in our tests. Just want to say that it's not good that we don't have any tests for these changes. |
@bukka for the validation, substitution, etc. an API is needed, that's how it leads to mbstring. There might be other encodings, too, with possible consequences. Primarily thinking about DBCS and SBCS based systems and EBCDIC, so my earlier point was as well the default behavior should not change. I've reread #2674 (comment) and it seems you still only talk about UTF-8. Still there are quite a few options and accompanying issues, so we need to work out a list on what should be done and how. I don't mind reassigning this PR to you, as you work on a wider change in this area and would coordinate with @pprindeville in regard to this particular part. You can ping me anytime for tests, etc. Till now all tests was done manually, some automated tests could be done but would have to be run under a root account. Thanks. |
@weltling There is no need for mbstring in UTF-8 validation (see json encoder for example) only for conversion which I actually didn't mean to do. What I thought initially is to just have some options for filtering. It means
Maybe The test will be possible without the root as it can be just another process listening on the selected port and IP which would be configured using |
@weltling @nikic @bukka Trying to decide what the right thing to do is and if it's worth opening another bug or not... Here's the scenario. Let's say that you have:
in your Problem is that you can't do this from the configuration... yet. You have to do:
or whatever to decide where you want to route it. But let's say that PHP is crashing before it executes the first line of code (but after it's parsed the What's the fix? (1) add a What would this fix? Well, if PHP dies early on and calls an exception handler (which ends up calling What do you all think? |
Well, it wasn't that much work to code up so I did: see PR #2701. |
@weltling Well, maybe not anymore. Look at PR #2701. We can now have syslog sent anywhere you want (well, at least to any log facility you want)... so it doesn't have to go into I couldn't find anything obvious but I've sent email to Travis support to ask about that. |
Is there going to be a 7.1.9 release? Or will all these syslog changes first appear in 7.2? |
These are feature additions and it is just for master which means 7.3. Of course this needs a full consensus before it is merged which is not the case atm. |
Fixing lines with embedded newlines is a feature? |
main/php_syslog.c
Outdated
@@ -68,11 +68,19 @@ PHPAPI void php_syslog(int priority, const char *format, ...) /* {{{ */ | |||
break; | |||
} | |||
|
|||
if (c != '\n') | |||
/* check for NVT ASCII only unless test disabled */ | |||
if (((0x20 <= c) && (c <= 0x7e)) || (!PG(syslog_ascii_only) && (c >= 0x80))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I miss something in the logic, I think that the (c >= 0x80)
should be dropped as it changes the current behavior which allows control characters by default. The issue is that RFC 5424 - section 6 actually allows any character for MSG-ANY so this should be followed by default IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If syslog_ascii_only
is false, then any upper-half characters from any ISO-8859-x will be printed.
We talked about this quite a bit as I remember, and the consensus was that there wasn't a compelling reason for logging to include non-printing characters, and indeed doing so might present a vulnerability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that one now. I think we actually ended up without any consensus... :) Will think about this later.
I'm less concerned about this issue not being resolved for 7.3, but I'd definitely like to see PR #2701 be merged. |
I would really prefer to have an ini for more options. As I said initial something like |
Alpha 4 is in 5 days... |
main/main.c
Outdated
@@ -775,6 +798,7 @@ PHP_INI_BEGIN() | |||
#endif | |||
STD_PHP_INI_ENTRY("syslog.facility", "LOG_USER", PHP_INI_SYSTEM, OnSetFacility, syslog_facility, php_core_globals, core_globals) | |||
STD_PHP_INI_ENTRY("syslog.ident", "php", PHP_INI_SYSTEM, OnUpdateString, syslog_ident, php_core_globals, core_globals) | |||
STD_PHP_INI_ENTRY("syslog.filter", "no_ctrl", PHP_INI_ALL, OnSetLogFilter, syslog_filter, php_core_globals, core_globals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't that be actually no-ctrl
(dash instead of underscore) - https://github.com/php/php-src/pull/2689/files#diff-1a9cfc6173e3a434387996e46086da56R343
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
main/php_syslog.c
Outdated
} else if ((c < 0x20) && (PG(syslog_filter) == PHP_SYSLOG_FILTER_NONE)) | ||
smart_string_appendc(&sbuf, c); | ||
else { | ||
smart_string_appendc(&sbuf, '#'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not probably a big issue but I think it might be more useful to show hex value prefixed with \x
- seems a bit more common but I might be wrong - not sure what other syslog clients do..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rsyslog
seems to use #%03o
for formatting. syslog-ng
uses \\x%02x
. Seems to go either way... I have no preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah cool. I see that you already converted that to hex which is great! Maybe we could use the same format as syslog-ng in that case (it's just adding extra %
and x
chars) so it's visible that those are the replaced characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you already converted that to hex which is great! Maybe we could use the same format as syslog-ng in that case (it's just adding extra % and x chars) so it's visible that those are the replaced characters.
Sorry, I meant that \\x%02x
is the literal printf-style format string. So DEL would format as \x7f
(which is also how it would appear as a literal in a C string). What would you like to see it format as?
@pprindeville There is still time as the future freeze was postponed on 2018-07-31 - see https://externals.io/message/102762 . |
All outstanding review questions addressed. |
Not sure about the AppVeyor failure. Seems to be unrelated. |
Yeah, that's a spurious failure. |
Well that's good. For me, anyway. Buys us some breathing room. |
Is this ready to be merged? It's too late for 7.3.0alpha4, but there are two weeks until 7.3.0beta1. |
I think so but it's not up to me. |
@bukka Is this PR ready to be merged into master? |
I think that the patch looks good - just one minor thing: #2689 (comment) . I would like to do a little bit of testing over the weekend and if all ok, I will merge it. @pprindeville please could you also squash it? |
Sorry, I thought I addressed this. What's missing? |
Ah it's cool then. You can just squash it and I will take a look over the weekend. |
Done! |
e266698
to
a416a66
Compare
Also change non-printable characters to hex from octal. Signed-off-by: Philip Prindeville <philipp@redfish-solutions.com>
Comment on behalf of bukka at php.net: Merged in 2010c02 |
I have merged it and also changed the name of the |
@weltling Unless I missed something in the patch, this changes are not for Windows. I don't have any way to test it or implement the Win support and not even sure if it makes any sense on Win. If so feel free to port it or add a comment to ini that this is not for Windows. Cheers. |
Sorry, why wouldn't they work with Windows? For now |
@bukka thanks for the ping. I don't think it is relevant for Windows. It would currently try to convert the message to UTF-16 using current charset settings and fallback to the current ANSI charset otherwise. Even if it would sometimes show some funny chars in the logs, fe if something like @pprindeville it is technically possible to log to another machine by passing something to Thanks. |
Correct handling of non-NVT ASCII is not unambiguously defined by RFC-3164 nor by RFC-5424.
It is recommended that Unicode strings be formatted into UTF-8 shortest notation (compressed format) and prefixed by the BOM.
However, nothing unambiguous is defined for other code sets.
This is particularly a concern for cases where PHP applications are being run on hosted virtual servers, Docker containers, etc. where the owner of the application has no control over where the logging is being collected nor how the logger is configured.
In those cases, the decision has to be made to be (1) as conservative as possible in selecting what's sent (i.e., NVT-ASCII per Postel's Law), or (2) UTF-8 compressed strings with a BOM prefix.
This addresses bug 75077.