Skip to content

handle non-NVT ASCII messages per the RFC's #2689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

pprindeville
Copy link
Contributor

Correct handling of non-NVT ASCII is not unambiguously defined by RFC-3164 nor by RFC-5424.

It is recommended that Unicode strings be formatted into UTF-8 shortest notation (compressed format) and prefixed by the BOM.

However, nothing unambiguous is defined for other code sets.

This is particularly a concern for cases where PHP applications are being run on hosted virtual servers, Docker containers, etc. where the owner of the application has no control over where the logging is being collected nor how the logger is configured.

In those cases, the decision has to be made to be (1) as conservative as possible in selecting what's sent (i.e., NVT-ASCII per Postel's Law), or (2) UTF-8 compressed strings with a BOM prefix.

This addresses bug 75077.

@pprindeville
Copy link
Contributor Author

This PR depends on PR #2674 being merged first.

@pprindeville
Copy link
Contributor Author

@weltling I saw you did a push for wide-byte handling... Is that related to this issue? Should I proceed on this or do you have something else in mind? I'm fine either way.

@weltling
Copy link
Contributor

@pprindeville oh, no, that's not related :) It's about Unicode APIs usage on Windows, started with 7.1 for the path handling and continues for other areas. It's indeed UTF-16 based, but only touching relevant parts doing it incrementally. I've just realized it were the case, so did it on the side.

The only related commit was dda70e0 about not doing newline splits. The syslog RFC mentions UTF-8 only anyway, so that's 1 byte only. For the event log it could make sense, should see whether it would be integrated into the Windows implementation directly.

So for this part, if I understood correctly, the idea was to also add the UTF-8 conversion? In theory it's doably through iconv or mbstring, but those should be available. Simplest were probably mbstring. In that case, one can rely on Zend multibyte API. But if mbstring is not loaded, IMO the basic case should be still processed.

Probably in any case, the cleanups for c < 0x20 && c != 0x7f and the other handling implemented in this PR should be present. Otherwise, if the mbstring API is available through Zend multibyte, it should be used to comply with the RFC. Still, the approach should be tested with some of the syslog daemons to see it's doing the expected things. And the @bukka's point about the INI option still should be cleared/targeted, after the exact patch approach was discussed.

Thanks.

@pprindeville
Copy link
Contributor Author

In theory it's doably through iconv or mbstring, but those should be available. Simplest were probably mbstring. In that case, one can rely on Zend multibyte API. But if mbstring is not loaded, IMO the basic case should be still processed.

@weltling Sorry, just to be clear here, what are you saying? That mbstring is an optional part of PHP? And if it's present, then we do the conversion from mbstring's to utf-8 with the BOM prefix... and if it's not present, just allow NVT-ASCII (0x20-0x7e)?

And is this a run-time or build-time determination?

@weltling
Copy link
Contributor

weltling commented Aug 19, 2017

@pprindeville yeah, i was refering to #2674 (comment). The strict RFC compliance were UTF-8, whereby probably it were not to exclude someone would want to write in an arbitrary encoding. I didn't mean just allow NVT ASCII as a fallback, see below.

Mbstring is a core extension, however there's no guarantee it is loaded in a particular PHP installation. The patch currently present does the following

  • block the invalid range as per RFC
  • escape the remaining bytes < 0x80 otherwise

Basically it's the remaining part of the first patch, where it seemed to be fine to proceed with. It passes an arbitrary string through and takes care about the known invalid pieces.

Now, if there's additional UTF-8 or other encoding handling, it'd need some API and mbstring seems to fit for it. Then also where the @bukka's idea comes from i guess, if i understand correctly. If mbstring is loaded, the handlers in Zend/zend_multibyte.c will be worky. One could then go by the approach like this

  • syslog.encoding="" - the current patch as is, cleanup invalid range and do escape of upper ISO-* symbols
  • syslog.encoding=ascii - only let NVT ASCII in
  • syslog.encoding=some_encoding - do conversion
  • syslog.encoding=RFC - full RFC complaint handling

Perhaps it could be also only UTF-8 instead of an arbitrary encoding, but i'm really not sure it wouldn't collide with the real life :) Fe what i've seen as described here #2674 (comment) Or perhaps, some special value to match the current default_charset or internal_encodnig would make sense, too. Technically the intention is doable, even with some limitation. Given there were no mbstring, the current patch would suffice at least for the basic handling. Lets see what you say and for the further comments.

thanks.

@nikic
Copy link
Member

nikic commented Aug 19, 2017

For reference, Python used to add a BOM for UTF-8 encoded messages, but later dropped the behavior, as it was causing too many issues. See http://bugs.python.org/issue14452 and https://hg.python.org/cpython/rev/af46a001d5ec.

@pprindeville
Copy link
Contributor Author

pprindeville commented Aug 19, 2017

One could then go by the approach like this [...]

@weltling I don't know if we do need syslog.encoding after all. If default_charset isn't set, then we assume 'ascii' and full NVT-ASCII compliance. Control characters, DEL, and high-bit characters get escaped.

Otherwise, if default_charset is set, then we convert the string to utf8 and prepend the BOM.

@nikic I read through issue 14452 and it seemed to me that for 3.3 they eventually did end up doing proper RFC-5424 support. The "removing BOM insertion" was for 2.7 and 3.2 only.

@nikic
Copy link
Member

nikic commented Aug 19, 2017

@pprindeville I checked the current code and a BOM is not prepended there: https://github.com/python/cpython/blob/master/Lib/logging/handlers.py#L932 The discussion in issue 14452 was just considering different options, what was implemented is a complete removal.

@pprindeville
Copy link
Contributor Author

@nikic It looks like they went with having it done in an external module:

https://github.com/dpocock/python-rfc5424-logging-formatter

although it's not clear where the 1 gets stripped from the message and replaced with the BOM.

That probably happens in the default logger class.

@weltling
Copy link
Contributor

@pprindeville it was discussed previously, that having ASCII only by default is a BC breach that doesn't seem justified. It still belongs to RFC 5424 as an option, that's true. Otherwise, default_charset empty means UTF-8 nowadays. To summarize, leaving aside the INI naming, etc.

  1. anything is passed through, as done now by default
    1.1. RFC 3164 complaint ASCII
    1.2. RFC 3164 complaint any encoding, requires mbstring
  2. RFC 5424 complaint ASCII
  3. RFC 5424 complaint BOM+UTF-8, requires mbstring

The option 1.2 would include UTF-8 also without BOM. Perhaps doing encoding conversion different from default_charset or internal encoding might be too much of good.

It seems that we still need to workout a path for this. It seems to me, that the Python issue was implementation specific, but nevertheless BOM is removed there http://bugs.python.org/msg158447. The involvement of mbstring were IMO good to validate what is written into the syslog, even if we don't really have to care. The two *big* choices at least - ASCII or not, might be a minimum, whereby the current behavior may still make sense to stay default.

It would be still good to hear from @bukka what is exactly meant in the INI renaming as from the previous discussion :)

Thanks.

@bukka
Copy link
Member

bukka commented Aug 22, 2017

@weltling Just to clarify, the RFC 5424 does not prohibit control character either:


   The syslog application SHOULD avoid octet values below 32 (the
   traditional US-ASCII control character range except DEL).  These
   values are legal, but a syslog application MAY modify these
   characters upon reception.  For example, it might change them into an
   escape sequence (e.g., value 0 may be changed to "\0").  A syslog
   application SHOULD NOT modify any other octet values.

So there should still be an option to completely disable any modification which can have small perf benefit (we don't need to go through each characters and dynamically copy buffer).

I didn't really mean using mbstring and doing any conversion which is not a good idea IMHO as it should be an app responsibility (we can't know which encoding is passed in anyway and automatic identification is error prone - read it doesn't work for all cases). What I meant is just to do simple validation and possibly ignore or substitute invalid UTF-8 characters (basically what we do in json). I think adding BOM should be probably also configurable.

Btw. I plan to rewrite syslog as part of my fpmi ( just a development fork of fpm - https://github.com/bukka/fpmi/tree/next ) work on logging as current syslog implementation is not ideal for fpm and it's also not easily testable. I really want to add a support for syslog server (IP + port) setup to be able to listen on another port and get output in our tests. Just want to say that it's not good that we don't have any tests for these changes.

@weltling
Copy link
Contributor

@bukka for the validation, substitution, etc. an API is needed, that's how it leads to mbstring. There might be other encodings, too, with possible consequences. Primarily thinking about DBCS and SBCS based systems and EBCDIC, so my earlier point was as well the default behavior should not change. I've reread #2674 (comment) and it seems you still only talk about UTF-8. Still there are quite a few options and accompanying issues, so we need to work out a list on what should be done and how. I don't mind reassigning this PR to you, as you work on a wider change in this area and would coordinate with @pprindeville in regard to this particular part. You can ping me anytime for tests, etc. Till now all tests was done manually, some automated tests could be done but would have to be run under a root account.

Thanks.

@weltling weltling removed their assignment Aug 22, 2017
@bukka
Copy link
Member

bukka commented Aug 22, 2017

@weltling There is no need for mbstring in UTF-8 validation (see json encoder for example) only for conversion which I actually didn't mean to do. What I thought initially is to just have some options for filtering. It means

  • do nothing (that's what we currently do)
  • do not send control chars
  • ascii NVT only
  • UTF-8 with BOM either with either ignoring or substituting invalid characters
  • UTF-8 without BOM either with either ignoring or substituting invalid characters

Maybe syslog.encoding is not the best name. Better name might be syslog.filter possibly?

The test will be possible without the root as it can be just another process listening on the selected port and IP which would be configured using syslog.server for example. Format could be similar like the one used in nginx - http://nginx.org/en/docs/syslog.html . The process would then just print the the received format in syslog format. Of course it doesn't test particular implementation but it tests the expected client behavior which should be enough for automated tests IMHO.

@pprindeville
Copy link
Contributor Author

@weltling @nikic @bukka Trying to decide what the right thing to do is and if it's worth opening another bug or not...

Here's the scenario.

Let's say that you have:

error_log = "syslog"

in your /etc/php.ini system-wide. Further, assume that you want PHP to use a particular facility for logging...

Problem is that you can't do this from the configuration... yet. You have to do:

openlog('php', 0, LOG_USER);

or whatever to decide where you want to route it. But let's say that PHP is crashing before it executes the first line of code (but after it's parsed the php.ini file). So the logging will go to the default destination, and you're PHP logging will be scattered amongst multiple locations.

What's the fix?

(1) add a syslog.facility config variable, that would get processed the same time that error_log = ... gets processed;
(2) add a call in php_syslog() so that if openlog() hasn't been called, call it explicitly (otherwise, it gets called implicitly inside syslog(), but with default parameters).

What would this fix? Well, if PHP dies early on and calls an exception handler (which ends up calling php_syslog()), then the output would get sent to the correct (singular) place even if a single line of PHP scripting never got parsed (not even auto_prepend_file).

What do you all think?

@pprindeville
Copy link
Contributor Author

Well, it wasn't that much work to code up so I did: see PR #2701.

@pprindeville
Copy link
Contributor Author

Till now all tests was done manually, some automated tests could be done but would have to be run under a root account.

@weltling Well, maybe not anymore.

Look at PR #2701. We can now have syslog sent anywhere you want (well, at least to any log facility you want)... so it doesn't have to go into /var/log/syslog. It could be set up to go into a file that's world readable. Does Travis have hooks for configuring syslog?

I couldn't find anything obvious but I've sent email to Travis support to ask about that.

@pprindeville
Copy link
Contributor Author

Is there going to be a 7.1.9 release? Or will all these syslog changes first appear in 7.2?

@bukka
Copy link
Member

bukka commented Aug 23, 2017

These are feature additions and it is just for master which means 7.3. Of course this needs a full consensus before it is merged which is not the case atm.

@pprindeville
Copy link
Contributor Author

Fixing lines with embedded newlines is a feature?

@@ -68,11 +68,19 @@ PHPAPI void php_syslog(int priority, const char *format, ...) /* {{{ */
break;
}

if (c != '\n')
/* check for NVT ASCII only unless test disabled */
if (((0x20 <= c) && (c <= 0x7e)) || (!PG(syslog_ascii_only) && (c >= 0x80)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I miss something in the logic, I think that the (c >= 0x80) should be dropped as it changes the current behavior which allows control characters by default. The issue is that RFC 5424 - section 6 actually allows any character for MSG-ANY so this should be followed by default IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If syslog_ascii_only is false, then any upper-half characters from any ISO-8859-x will be printed.

We talked about this quite a bit as I remember, and the consensus was that there wasn't a compelling reason for logging to include non-printing characters, and indeed doing so might present a vulnerability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that one now. I think we actually ended up without any consensus... :) Will think about this later.

@pprindeville
Copy link
Contributor Author

I'm less concerned about this issue not being resolved for 7.3, but I'd definitely like to see PR #2701 be merged.

@bukka
Copy link
Member

bukka commented Jul 7, 2018

I would really prefer to have an ini for more options. As I said initial something like syslog.filter would be more convenient. After thinking a bit more about it, I wouldn't mind to disable control characters by default but there should be still a way to explicitly allow them. That syslog.filter could possibly be used for allowing of UTF-8 only in the future. I'm not sure if this can be addressed in time for 7.3 though.

@pprindeville
Copy link
Contributor Author

Alpha 4 is in 5 days...

main/main.c Outdated
@@ -775,6 +798,7 @@ PHP_INI_BEGIN()
#endif
STD_PHP_INI_ENTRY("syslog.facility", "LOG_USER", PHP_INI_SYSTEM, OnSetFacility, syslog_facility, php_core_globals, core_globals)
STD_PHP_INI_ENTRY("syslog.ident", "php", PHP_INI_SYSTEM, OnUpdateString, syslog_ident, php_core_globals, core_globals)
STD_PHP_INI_ENTRY("syslog.filter", "no_ctrl", PHP_INI_ALL, OnSetLogFilter, syslog_filter, php_core_globals, core_globals)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't that be actually no-ctrl (dash instead of underscore) - https://github.com/php/php-src/pull/2689/files#diff-1a9cfc6173e3a434387996e46086da56R343

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

} else if ((c < 0x20) && (PG(syslog_filter) == PHP_SYSLOG_FILTER_NONE))
smart_string_appendc(&sbuf, c);
else {
smart_string_appendc(&sbuf, '#');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not probably a big issue but I think it might be more useful to show hex value prefixed with \x - seems a bit more common but I might be wrong - not sure what other syslog clients do..?

Copy link
Contributor Author

@pprindeville pprindeville Jul 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rsyslog seems to use #%03o for formatting. syslog-ng uses \\x%02x. Seems to go either way... I have no preference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah cool. I see that you already converted that to hex which is great! Maybe we could use the same format as syslog-ng in that case (it's just adding extra % and x chars) so it's visible that those are the replaced characters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you already converted that to hex which is great! Maybe we could use the same format as syslog-ng in that case (it's just adding extra % and x chars) so it's visible that those are the replaced characters.

Sorry, I meant that \\x%02x is the literal printf-style format string. So DEL would format as \x7f (which is also how it would appear as a literal in a C string). What would you like to see it format as?

@bukka
Copy link
Member

bukka commented Jul 15, 2018

@pprindeville There is still time as the future freeze was postponed on 2018-07-31 - see https://externals.io/message/102762 .

@pprindeville
Copy link
Contributor Author

All outstanding review questions addressed.

@pprindeville
Copy link
Contributor Author

Not sure about the AppVeyor failure. Seems to be unrelated.

@nikic
Copy link
Member

nikic commented Jul 16, 2018

Yeah, that's a spurious failure.

@pprindeville
Copy link
Contributor Author

There is still time as the future freeze was postponed on 2018-07-31 - see https://externals.io/message/102762 .

Well that's good. For me, anyway. Buys us some breathing room.

@cmb69
Copy link
Member

cmb69 commented Jul 17, 2018

Is this ready to be merged? It's too late for 7.3.0alpha4, but there are two weeks until 7.3.0beta1.

@pprindeville
Copy link
Contributor Author

Is this ready to be merged? It's too late for 7.3.0alpha4, but there are two weeks until 7.3.0beta1.

I think so but it's not up to me.

@cmb69
Copy link
Member

cmb69 commented Jul 18, 2018

@bukka Is this PR ready to be merged into master?

@bukka
Copy link
Member

bukka commented Jul 18, 2018

I think that the patch looks good - just one minor thing: #2689 (comment) .

I would like to do a little bit of testing over the weekend and if all ok, I will merge it.

@pprindeville please could you also squash it?

@pprindeville
Copy link
Contributor Author

I think that the patch looks good - just one minor thing: #2689 (comment) .

Sorry, I thought I addressed this. What's missing?

@bukka
Copy link
Member

bukka commented Jul 18, 2018

Ah it's cool then. You can just squash it and I will take a look over the weekend.

@pprindeville
Copy link
Contributor Author

Done!

@pprindeville pprindeville force-pushed the bug/75077 branch 2 times, most recently from e266698 to a416a66 Compare July 18, 2018 22:06
Also change non-printable characters to hex from octal.

Signed-off-by: Philip Prindeville <philipp@redfish-solutions.com>
@php-pulls
Copy link

Comment on behalf of bukka at php.net:

Merged in 2010c02

@php-pulls php-pulls closed this Jul 22, 2018
@bukka
Copy link
Member

bukka commented Jul 22, 2018

I have merged it and also changed the name of the none filter to all in c855043 . I think it matches the other filters names better (the names specify subset that is not replaced with hex value) and mainly none is an INI keyword so it works just in the double quotes.

@bukka
Copy link
Member

bukka commented Jul 22, 2018

@weltling Unless I missed something in the patch, this changes are not for Windows. I don't have any way to test it or implement the Win support and not even sure if it makes any sense on Win. If so feel free to port it or add a comment to ini that this is not for Windows. Cheers.

@pprindeville
Copy link
Contributor Author

pprindeville commented Jul 23, 2018

@weltling Unless I missed something in the patch, this changes are not for Windows.

Sorry, why wouldn't they work with Windows?

For now win32/wsyslog.c:vsyslog() just calls ReportEventW() and ReportEventA(), but there's nothing stopping anyone from adding code to have it send messages to 127.0.0.1:514 as well... and even ReportEventA() might have constraints about what sort of input it accepts.

@pprindeville pprindeville deleted the bug/75077 branch July 23, 2018 00:10
@weltling
Copy link
Contributor

@bukka thanks for the ping. I don't think it is relevant for Windows. It would currently try to convert the message to UTF-16 using current charset settings and fallback to the current ANSI charset otherwise. Even if it would sometimes show some funny chars in the logs, fe if something like \007 was written, that's the only issue. I haven't heared about any security related issues in this regard.

@pprindeville it is technically possible to log to another machine by passing something to RegisterEventSource, though it's definitely not syslog-ng or alike functionality. It's event log functionality which is completely different. To have a syslog-ng compatible functionality, one would have implement the protocol, etc. Technically it would be possible, of course, but not sure it would be a widely demanded feature.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants