Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplified istream handing #367 #764

Merged
merged 18 commits into from
Oct 22, 2017

Conversation

pjkundert
Copy link
Contributor

Simplified cached_input_stream_adapter<> to input_stream_adapter, to avoid redundant buffering (already buffered in underlying std::streambuf), and thus to not pre-load (and discard) a large buffer of input from the istream when parsing a smaller JSON record. Handles a leading Byte Order Mark using standard istream putback/unget.

Also, instead of trying to re-read the previous token on error (which involved seeking, which may or may not be available on an istream), simply collects the developing token into a std::string. The space underlying this string is preserved between tokens, so this collection devolves into simple copying, and is therefore quite efficient.

This also opens the door to further simplifications in the future.

Added a simple unit test to confirm operation. Did not test effect on json::parse. I feel that the approach suggested by @nlohmann (std::parse demands that the entire buffer be a single JSON object, but operator>> scans just a single upcoming JSON object) is correct... Perhaps this pull request may help move in that direction.

@pjkundert pjkundert changed the title Develop simplify istream #367 Simplified istream handing #367 Oct 2, 2017
@nlohmann
Copy link
Owner

nlohmann commented Oct 3, 2017

@pjkundert There is a test case failing that breaks all Travis runs, see https://travis-ci.org/nlohmann/json/builds/282450885?utm_source=github_status&utm_medium=notification.

@nlohmann
Copy link
Owner

nlohmann commented Oct 3, 2017

I ran a quick benchmark parsing file benchmarks/files/jeopardy/jeopardy.json

#include "json.hpp"
#include <fstream>

using json = nlohmann::json;

int main()
{
    std::ifstream f("benchmarks/files/jeopardy/jeopardy.json");
    json j;
    f >> j;
}

compiled with clang++ -std=c++11 -O3 -DNDEBUG -flto bench.cpp -o bench.

The develop code runs in 1.08 seconds, whereas this PR's code runs in 1.46 seconds. Did you run any benchmarks?

pjkundert added a commit to pjkundert/json that referenced this pull request Oct 3, 2017
o Use std::streambuf I/O instead of std::istream; does not maintain
  (unused) istream flags.
o Further simplify get/unget handling.
o Restore original handling of NUL in input stream; ignored during
  token_string escaping.
@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling 8d5a5f0 on pjkundert:develop-simplify-istream into 7435d54 on nlohmann:develop.

@pjkundert
Copy link
Contributor Author

Performance now within about 10%, while eliminating all over-buffering and seeking. Since we do not need to maintain the std::istream state flags, I used the underlying streambuf I/O API for performance. The original code was also clearing the state flags before returning, so that has been maintained.

I noticed that the 'chars_read' (used to indicate file position where an error occurs) are incremented even when an EOF is received; this has been maintained (as it caused too many unit tests to fail when fixed). It is not a huge deal, as it will only cause error messages to contain an error location position one beyond the last character in the input.

Runs clean under valgrind, and on all Linux platforms, and on Windows under Visual Studio 2015; 2017 fails for some reason I do not have the tooling to investigate.

@theodelrieu
Copy link
Contributor

The VS2017 problem is fixed upstream. Try to run git rebase nlohmann/develop (assuming you added a git remote nlohmann).

That should suffice!

@nlohmann
Copy link
Owner

nlohmann commented Oct 4, 2017

@theodelrieu is right about MSVC - it would be great if you could rebase so we can let MSVC 2017 check the code.

I do not understand why the performance is worse (about 7% with my naive benchmark).

src/json.hpp Outdated
/// the start position of the current token
std::size_t start_pos = 0;
/// raw input token string (for error messages)
std::vector<char> token_string = std::vector<char>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant initialization, default constructor is fine here.

@theodelrieu
Copy link
Contributor

I believe some profiling would be useful for that matter.

I'm not very experienced in them though.

@nlohmann
Copy link
Owner

nlohmann commented Oct 4, 2017

@pjkundert Do you know whether this PR could also be adjusted to fix #714?

pjkundert added a commit to pjkundert/json that referenced this pull request Oct 4, 2017
o Use std::streambuf I/O instead of std::istream; does not maintain
  (unused) istream flags.
o Further simplify get/unget handling.
o Restore original handling of NUL in input stream; ignored during
  token_string escaping.
@pjkundert
Copy link
Contributor Author

Performance now exceeds that of the develop branch, in at least some benchmarks. Removed some unnecessary attempts to NUL-terminate the yytext string (since yytext may contain NULs anyway). Converted yytext to a simple std::string, so we could return it directly via std::move(), and avoid many redundant std::string creations.

Let me know how this affects #714 ; since I use the underlying std::streambuf I/O, and only look for EOF (and do not set the iostream's failbit), it will almost certainly alter the behaviour you are seeing.

src/json.hpp Outdated
is.setstate(flags);

return result;
int c = is.rdbuf()->sbumpc(); // Avoided for performance: int c = is.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason to not return char instead of int? That's what I assumed from the function name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::streambuf I/O deals in ints, returning either traits_type::eof() (-1, usually) or the buffer item, converted to an int. I'm not certain why; perhaps to not assume that the underlying data can only be 8-bit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I was thinking about our interface: do we want to expose int or char for callers of get_character?

I believe we should static _cast between int/char while dealing with the std::streambuf, and only exposing char

src/json.hpp Outdated
is.setstate(flags);

return result;
int c = is.rdbuf()->sbumpc(); // Avoided for performance: int c = is.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason to not return char here? That's what I'd assume from the function name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we need to return values in the range [-1,255], we must continue to use int.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that there is another issue -- we (presently) allow EOF to be pushed into the token_string, which is a char vector. Must prevent this. Also allows simplifying the get_token_string. Furthermore, a NUL in the token_string should be escaped (not presently handled; they are ignored for the purposes of error messages, which is not correct).

@@ -2640,7 +2582,7 @@ class lexer
const auto x = std::strtoull(yytext.data(), &endptr, 10);

// we checked the number format before
assert(endptr == yytext.data() + yylen);
assert(endptr == yytext.data() + yytext.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could yytext.end() be used here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we're testing the output of a function that was originally passed yytext.data(), so this is probably clearer.

src/json.hpp Outdated
return next_unget ? (next_unget = false, current)
: (current = ia->get_character());
int c = current = ia->get_character();
token_string.push_back(static_cast<char>(c));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you only use current here? By the way I think we should have a convention about data members, either explicitly use this->, or an underscore prefix etc...

But that's not in the PR's scope.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of neither namings...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not really care about which naming to use, but I get confused each time, is it a local variable? A global one? A symbol in the namespace?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry, I'm ok with only using current - I meant I don't like adding an underscore prefix or this->)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, it's just hard to know what the symbol is without performing a search in the current file.

src/json.hpp Outdated
}

/// add a character to yytext
void add(int c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about int vs char here.

// yytext cannot be returned as char*, because it may contain a null
// byte (parsed as "\u0000")
return std::string(yytext.data(), yylen);
return std::move( yytext );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could lead to nasty bugsif called from a lvalue, could you provide lvalue and rvalue overloads?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to look up the caller, but maybe we can also return a const reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can return a const ref for lvalues and move for rvalues. I remember someone on reddit discussing the rvalue support in the Library, which could be far better.

But that's not an emergency at all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so; but, on review -- I noticed that the get_string function was (incorrectly) returning a 'const std::string', preventing any move re-use of the returned std::string anyway! Removing the const appears to have resulted in (another) pretty dramatic speed-up. New push forthcoming...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well simply calling get_string twice would become problematic. It adds more boilerplate to add those overloads, but the code will always be correct. We should use std::move(lexer).get_string() to be explicit about this side effect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This get_string method is being used for keys and values; both of which are ultimately moved into a std::map key, or into a BasicJsonType value; there are no lvalue users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, appears to improve performance by close to 10% on some benchmarks. Nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, calling get_string twice would (surprisingly) result in an empty string. Perhaps renaming it to move_string() would result in less surprising behavior? I'm not sure I understand your std::move(lexer).get_string() recommendation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point. Let's hope nobody will get bitten by that then :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen your last answer sorry. You can overload on lvalues and rvalues:

class lexer {
std::string const& get_string() & {
  return _content;
}

std::string get_string() && {
  return std::move(_content);
}
};

The first function can only be called on lvalues, while the latter can only be called on rvalues.
Hence the std::move(lexer).get_string(), which will call the second overload, while lexer.get_string() will call the first one.

I've used this pattern on classes holding huge data, and I would call the rvalue overload at the very end of the processing, while calling the lvalue one during processing.

But a move_string is fine with me too.

src/json.hpp Outdated
std::vector<char> yytext = std::vector<char>(1024, '\0');
/// current index in yytext
std::size_t yylen = 0;
std::string yytext = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant initialization here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; but we're not doing that with any of the other members; I was just trying to be consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's up to personal preference, not a big deal anyway ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are relying on the optimizer to eliminate the call to strlen() here. They should eliminate the call and just do default initialization, but better to not make it have to do that. I would run the benchmarks with and without this and see if there's any difference.

@theodelrieu
Copy link
Contributor

Sorry for the double comment...

@nlohmann
Copy link
Owner

nlohmann commented Oct 4, 2017

I re-ran the simple benchmark with the latest commit: it is still slower: 1.24 seconds vs. 1.14 seconds parsing jeopardy.json...

@pjkundert
Copy link
Contributor Author

I guess it's compiler dependent. On g++ 7.2, its substantially faster on that benchmark.

@nlohmann
Copy link
Owner

nlohmann commented Oct 4, 2017

Also with GCC 8.0.0 20170919 I measure worse runtimes, but the margin became smaller.

@pjkundert
Copy link
Contributor Author

Further performance improvements. Since EOF is the sole -'ve value allowed through the get() interface, we can use less costly comparisons to detect its presence. Further 1-2% performance improvement.

@pjkundert
Copy link
Contributor Author

Just to refocus the debate -- we're bringing the parser into compliance with expected behavior on std::istream; if this requires a small performance penalty, I believe this is acceptable. The present behavior on std::istream >> is extremely surprising and non-compliant...

@pjkundert
Copy link
Contributor Author

The performance improvement in detecting EOF is probably due to the code re-using a "-'ve value" CPU flag left over from a previous move, instead of issuing a new comparison against a specific value. Changing that specific value to const or even constexpr wouldn't help.

@pjkundert
Copy link
Contributor Author

The token_string is used solely in formatting error messages; NUL values can't typically exist in well-formed JSON, so they will typically terminate parsing in an error. We want them to be output as the last character read -- so they must be escaped in the error message, which is now done. As for EOF: no, they cannot appear in token_string, because it is a char vector containing values in the range [-128,127] (or [0,255] if interpreted as unsigned char). We specifically detect -'ve values, and avoid introducing them into token_string.

@gregmarr
Copy link
Contributor

gregmarr commented Oct 4, 2017

There are still more places where it compares against eof(). If it is that big of a savings, then there should probably be a single function that implements this test so that it can be done the same everywhere. Are we sure that eof() is the only possible negative value?

@pjkundert
Copy link
Contributor Author

All remaining usages of std::char_traits::eof() are either A) switch ... caseentries, or B) only performed rarely. I'd rather leave those B) cases explicit, I think. Even in the cases where we widen the comparison to simply -'ve (from specifically == eof()), I left the original comparison there for documentary purpose. I don't really like to do this kind of optimization at all, but in a well-tested, performance-critical tool such as this, it may be worth it.

@pjkundert
Copy link
Contributor Author

Right now, compiler warnings due to -Weffc++ prevent the default-initialization of some members, as this is not possible using ... = <value>; initializers. I would like to convert all of the member initializers from the ... = <value>; form, to initializer-list form: ... { <value> };. However, this may cause difficulty with some not-quite compliant c++11 compilers (eg. g++ < 4.9.1?) What do you all think of this idea?

@gregmarr
Copy link
Contributor

gregmarr commented Oct 4, 2017

Is this not supported by 4.9.0? The README currently lists the minimum g++ version as 4.9.

@nlohmann
Copy link
Owner

nlohmann commented Oct 4, 2017

g++-4.9 (Ubuntu 4.9.4-2ubuntu1~14.04.1) 4.9.4 is successfully used by Travis. I don't know about g++ 4.9.1 though.

@gregmarr
Copy link
Contributor

gregmarr commented Oct 4, 2017

google/googletest#898

GCC's -Weffc++ is IMHO effectively obsolete, and should not be used, unless you are trying to ensure your code meets outdated advice from more than ten years ago. There are numerous bug reports about -Weffc++ in GCC's bugzilla and nobody is interested in fixing them. Among its problems are that the suggestions it makes are based on the first edition of Effective C++ and those items were heavily revised for the second edition. Some advice is simply inappropriate in modern code (C++11 or later) and some warnings are simply poorly implemented in GCC (e.g requiring a mem-initializer for all member variables even if they have default constructors that do the right thing).

Changing modern projects to avoid unhelpful -Weffc++ warnings is a huge mistake.

src/json.hpp Outdated
@@ -1535,17 +1489,18 @@ class input_buffer_adapter : public input_adapter_protocol
{
if (JSON_LIKELY(cursor < limit))
{
return *(cursor++) & 0xFF;
return reinterpret_cast<int>(std::char_traits<char>::to_int_type(*(cursor++)));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a static_cast should suffice here

@nlohmann
Copy link
Owner

nlohmann commented Oct 6, 2017

G++ still substantially faster, Clang only slightly slower than stock develop branch; push forthcoming.

I'm afraid I could still not reproduce this. I shall try to run some benchmarks in the weekend.

@pjkundert
Copy link
Contributor Author

OK, I simplified the handling of characters a bit more, removing unnecessary casts; the std::char_traits::to_int_type already handles all signed int/unsigned char conversions properly, and returns a signed value that can be handled by normal conversions to the int return type of get_character.

Also, removed the unnecessary changes to member initializers; I left the brace initializers in place, to avoid any warnings should someone decide to use -Weffc++ on their code.

Benchmarks remain improved on g++, essentially the same on clang++.

@theodelrieu
Copy link
Contributor

@pjkundert I think I understand why my git log gets wrecked when visualizing your commits. Could you insert a blank line between the header and the commit body? I have the whole commits on a single line.

Good job on the body details BTW :)

src/json.hpp Outdated
@@ -1397,120 +1397,77 @@ constexpr T static_const<T>::value;
/// abstract input adapter interface
struct input_adapter_protocol
{
virtual int get_character() = 0;
virtual std::string read(std::size_t offset, std::size_t length) = 0;
virtual int get_character() = 0; // returns characters in range [0,255], or eof()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be std::char_traits<char>::int_type instead of int so that we know that there will be no type changes later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be an acceptable change, I think. It's typically that type anyway, and being explicit about it keeps things obvious; that all implementation of the adapter class need to be consistent in producing std::char_traits<char>::int_type, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A using int_type = ... would make it less verbose then the full thing.

contains the number of bytes in the string.
scanning, bytes are escaped and copied into buffer yytext. Then the function
returns successfully, yytext is *not* null-terminated (as it may contain \0
bytes), and yytext.size() is the number of bytes in the string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more accurate to say that it's null terminated but may also contain embedded nulls?
This currently sounds like there isn't guaranteed to be at least one null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a change in behavior, I see now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I assume that it used to be NUL-terminated, in the mists of history, and was probably being returned as a char*' Since the token can contain NULs (supplied by escaping), any assumptions about NUL and termination are incorrect. So it was removed.

std::vector<char> yytext = std::vector<char>(1024, '\0');
/// current index in yytext
std::size_t yylen = 0;
std::string yytext { };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither of these need the { } at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. However, since the -Weffc++ flag still exists (and arbitrary client code may still use it), and it is cost and risk free for us to still comply with its constraints, I decided to include the default initializer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'd say that there is no cost. There is a cost whenever you do something unnecessary to appease a broken tool. It's not a runtime or build-time cost, more of a cognitive burden. If we intend to keep it there to support the warning, anyone who maintains the code should know why it's there, which means that there should probably be a comment.
Anyway, it's all philosophical. I'm not going to lose any sleep over it.

@@ -5205,7 +5162,7 @@ class binary_reader
@brief get next character from the input

This function provides the interface to the used input adapter. It does
not throw in case the input reached EOF, but returns
not throw in case the input reached EOF, but returns a -'ve valued
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this "negative valued" bit, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling 45e1e3d on pjkundert:develop-simplify-istream into 0c0851d on nlohmann:develop.

o We assume the same character int_type as the unerlying std::istream
o There are no assumptions on the value of eof(), other than that it
  will not be a valid unsigned char value.
o To retain performance, we do not allow swapping out the underlying
  std::streambuf during our use of the std::istream for parsing.
@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling 23440eb on pjkundert:develop-simplify-istream into 0c0851d on nlohmann:develop.

o For some unknown reason, the complexity of the benchmark platform
  prevented some C++ compilers from generating optimal code, properly
  reflective of the real performance in actual deployment.
o Added the json_benchmarks_simple target, which performs the same
  suite of tests as json_benchmarks.
o Simplified the benchmark platform, and emit an "Average" TPS
  (Transactions Per Second) value reflective of aggregate parse/output
  performance.
@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling 0b803d0 on pjkundert:develop-simplify-istream into 0c0851d on nlohmann:develop.

@theodelrieu
Copy link
Contributor

theodelrieu commented Oct 11, 2017

Is there anything left required before merging this?

@nlohmann
Copy link
Owner

I'll have another look soon - I want to have another look whether this also fixes #714.

About anything else: are we happy with this approach? I could not measure performance improvement, but I understand that less code is preferable here, and the caching always felt like a hack (though it performed better in my bechmarks, and also with the simple example of parsing jeopardy.json).

@trilogy-service-ro
Copy link

Can one of the admins verify this patch?

@theodelrieu
Copy link
Contributor

I personally have no more issues with this PR.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling a8cc7a1 on pjkundert:develop-simplify-istream into 0c0851d on nlohmann:develop.

@theodelrieu
Copy link
Contributor

Can we merge this PR?

@nlohmann nlohmann self-assigned this Oct 22, 2017
@nlohmann nlohmann added this to the Release 3.0.0 milestone Oct 22, 2017
@nlohmann nlohmann merged commit 3094640 into nlohmann:develop Oct 22, 2017
@nlohmann
Copy link
Owner

Thanks so much for all the patience!

nlohmann added a commit that referenced this pull request Oct 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants