Simplified istream handing #367 #764

pjkundert · 2017-10-02T21:30:22Z

Simplified cached_input_stream_adapter<> to input_stream_adapter, to avoid redundant buffering (already buffered in underlying std::streambuf), and thus to not pre-load (and discard) a large buffer of input from the istream when parsing a smaller JSON record. Handles a leading Byte Order Mark using standard istream putback/unget.

Also, instead of trying to re-read the previous token on error (which involved seeking, which may or may not be available on an istream), simply collects the developing token into a std::string. The space underlying this string is preserved between tokens, so this collection devolves into simple copying, and is therefore quite efficient.

This also opens the door to further simplifications in the future.

Added a simple unit test to confirm operation. Did not test effect on json::parse. I feel that the approach suggested by @nlohmann (std::parse demands that the entire buffer be a single JSON object, but operator>> scans just a single upcoming JSON object) is correct... Perhaps this pull request may help move in that direction.

nlohmann · 2017-10-03T08:43:34Z

@pjkundert There is a test case failing that breaks all Travis runs, see https://travis-ci.org/nlohmann/json/builds/282450885?utm_source=github_status&utm_medium=notification.

nlohmann · 2017-10-03T09:07:52Z

I ran a quick benchmark parsing file benchmarks/files/jeopardy/jeopardy.json

#include "json.hpp"
#include <fstream>

using json = nlohmann::json;

int main()
{
    std::ifstream f("benchmarks/files/jeopardy/jeopardy.json");
    json j;
    f >> j;
}

compiled with clang++ -std=c++11 -O3 -DNDEBUG -flto bench.cpp -o bench.

The develop code runs in 1.08 seconds, whereas this PR's code runs in 1.46 seconds. Did you run any benchmarks?

o Use std::streambuf I/O instead of std::istream; does not maintain (unused) istream flags. o Further simplify get/unget handling. o Restore original handling of NUL in input stream; ignored during token_string escaping.

coveralls · 2017-10-04T00:06:18Z

Coverage remained the same at 100.0% when pulling 8d5a5f0 on pjkundert:develop-simplify-istream into 7435d54 on nlohmann:develop.

pjkundert · 2017-10-04T14:12:48Z

Performance now within about 10%, while eliminating all over-buffering and seeking. Since we do not need to maintain the std::istream state flags, I used the underlying streambuf I/O API for performance. The original code was also clearing the state flags before returning, so that has been maintained.

I noticed that the 'chars_read' (used to indicate file position where an error occurs) are incremented even when an EOF is received; this has been maintained (as it caused too many unit tests to fail when fixed). It is not a huge deal, as it will only cause error messages to contain an error location position one beyond the last character in the input.

Runs clean under valgrind, and on all Linux platforms, and on Windows under Visual Studio 2015; 2017 fails for some reason I do not have the tooling to investigate.

theodelrieu · 2017-10-04T14:19:56Z

The VS2017 problem is fixed upstream. Try to run git rebase nlohmann/develop (assuming you added a git remote nlohmann).

That should suffice!

nlohmann · 2017-10-04T15:14:15Z

@theodelrieu is right about MSVC - it would be great if you could rebase so we can let MSVC 2017 check the code.

I do not understand why the performance is worse (about 7% with my naive benchmark).

theodelrieu · 2017-10-04T15:20:39Z

src/json.hpp

-    /// the start position of the current token
-    std::size_t start_pos = 0;
+    /// raw input token string (for error messages)
+    std::vector<char> token_string = std::vector<char>();


redundant initialization, default constructor is fine here.

theodelrieu · 2017-10-04T15:23:02Z

I believe some profiling would be useful for that matter.

I'm not very experienced in them though.

nlohmann · 2017-10-04T15:32:57Z

@pjkundert Do you know whether this PR could also be adjusted to fix #714?

o Use std::streambuf I/O instead of std::istream; does not maintain (unused) istream flags. o Further simplify get/unget handling. o Restore original handling of NUL in input stream; ignored during token_string escaping.

pjkundert · 2017-10-04T16:37:23Z

Performance now exceeds that of the develop branch, in at least some benchmarks. Removed some unnecessary attempts to NUL-terminate the yytext string (since yytext may contain NULs anyway). Converted yytext to a simple std::string, so we could return it directly via std::move(), and avoid many redundant std::string creations.

Let me know how this affects #714 ; since I use the underlying std::streambuf I/O, and only look for EOF (and do not set the iostream's failbit), it will almost certainly alter the behaviour you are seeing.

theodelrieu · 2017-10-04T16:46:27Z

src/json.hpp

-        is.setstate(flags);
-
-        return result;
+        int c = is.rdbuf()->sbumpc(); // Avoided for performance: int c = is.get();


Is there a specific reason to not return char instead of int? That's what I assumed from the function name.

std::streambuf I/O deals in ints, returning either traits_type::eof() (-1, usually) or the buffer item, converted to an int. I'm not certain why; perhaps to not assume that the underlying data can only be 8-bit?

Indeed, I was thinking about our interface: do we want to expose int or char for callers of get_character?

I believe we should static _cast between int/char while dealing with the std::streambuf, and only exposing char

theodelrieu · 2017-10-04T16:50:07Z

src/json.hpp

-        is.setstate(flags);
-
-        return result;
+        int c = is.rdbuf()->sbumpc(); // Avoided for performance: int c = is.get();


Is there a specific reason to not return char here? That's what I'd assume from the function name

Since we need to return values in the range [-1,255], we must continue to use int.

This implies that there is another issue -- we (presently) allow EOF to be pushed into the token_string, which is a char vector. Must prevent this. Also allows simplifying the get_token_string. Furthermore, a NUL in the token_string should be escaped (not presently handled; they are ignored for the purposes of error messages, which is not correct).

theodelrieu · 2017-10-04T16:52:30Z

src/json.hpp

@@ -2640,7 +2582,7 @@ class lexer
            const auto x = std::strtoull(yytext.data(), &endptr, 10);

            // we checked the number format before
-            assert(endptr == yytext.data() + yylen);
+            assert(endptr == yytext.data() + yytext.size());


Could yytext.end() be used here?

Yes, but we're testing the output of a function that was originally passed yytext.data(), so this is probably clearer.

theodelrieu · 2017-10-04T16:56:10Z

src/json.hpp

-        return next_unget ? (next_unget = false, current)
-               : (current = ia->get_character());
+        int c = current = ia->get_character();
+        token_string.push_back(static_cast<char>(c));


Could you only use current here? By the way I think we should have a convention about data members, either explicitly use this->, or an underscore prefix etc...

But that's not in the PR's scope.

I'm not a fan of neither namings...

I do not really care about which naming to use, but I get confused each time, is it a local variable? A global one? A symbol in the namespace?

(Sorry, I'm ok with only using current - I meant I don't like adding an underscore prefix or this->)

I understand, it's just hard to know what the symbol is without performing a search in the current file.

theodelrieu · 2017-10-04T16:57:31Z

src/json.hpp

+    }
+
+    /// add a character to yytext
+    void add(int c)


Same question about int vs char here.

theodelrieu · 2017-10-04T17:00:06Z

src/json.hpp

-        // yytext cannot be returned as char*, because it may contain a null
-        // byte (parsed as "\u0000")
-        return std::string(yytext.data(), yylen);
+        return std::move( yytext );


This could lead to nasty bugsif called from a lvalue, could you provide lvalue and rvalue overloads?

I would have to look up the caller, but maybe we can also return a const reference.

We can return a const ref for lvalues and move for rvalues. I remember someone on reddit discussing the rvalue support in the Library, which could be far better.

But that's not an emergency at all

I don't think so; but, on review -- I noticed that the get_string function was (incorrectly) returning a 'const std::string', preventing any move re-use of the returned std::string anyway! Removing the const appears to have resulted in (another) pretty dramatic speed-up. New push forthcoming...

Well simply calling get_string twice would become problematic. It adds more boilerplate to add those overloads, but the code will always be correct. We should use std::move(lexer).get_string() to be explicit about this side effect

This get_string method is being used for keys and values; both of which are ultimately moved into a std::map key, or into a BasicJsonType value; there are no lvalue users.

Yup, appears to improve performance by close to 10% on some benchmarks. Nice.

Yes, calling get_string twice would (surprisingly) result in an empty string. Perhaps renaming it to move_string() would result in less surprising behavior? I'm not sure I understand your std::move(lexer).get_string() recommendation.

I get your point. Let's hope nobody will get bitten by that then :)

I haven't seen your last answer sorry. You can overload on lvalues and rvalues:

class lexer { std::string const& get_string() & { return _content; } std::string get_string() && { return std::move(_content); } };

The first function can only be called on lvalues, while the latter can only be called on rvalues.
Hence the std::move(lexer).get_string(), which will call the second overload, while lexer.get_string() will call the first one.

I've used this pattern on classes holding huge data, and I would call the rvalue overload at the very end of the processing, while calling the lvalue one during processing.

But a move_string is fine with me too.

theodelrieu · 2017-10-04T17:00:53Z

src/json.hpp

-    std::vector<char> yytext = std::vector<char>(1024, '\0');
-    /// current index in yytext
-    std::size_t yylen = 0;
+    std::string yytext = "";


Redundant initialization here

Yes; but we're not doing that with any of the other members; I was just trying to be consistent.

I guess it's up to personal preference, not a big deal anyway ;)

You are relying on the optimizer to eliminate the call to strlen() here. They should eliminate the call and just do default initialization, but better to not make it have to do that. I would run the benchmarks with and without this and see if there's any difference.

theodelrieu · 2017-10-04T17:03:09Z

Sorry for the double comment...

nlohmann · 2017-10-04T17:49:31Z

I re-ran the simple benchmark with the latest commit: it is still slower: 1.24 seconds vs. 1.14 seconds parsing jeopardy.json...

pjkundert · 2017-10-04T18:01:19Z

I guess it's compiler dependent. On g++ 7.2, its substantially faster on that benchmark.

nlohmann · 2017-10-04T18:14:50Z

Also with GCC 8.0.0 20170919 I measure worse runtimes, but the margin became smaller.

pjkundert · 2017-10-04T18:40:24Z

Further performance improvements. Since EOF is the sole -'ve value allowed through the get() interface, we can use less costly comparisons to detect its presence. Further 1-2% performance improvement.

pjkundert · 2017-10-04T19:04:00Z

Just to refocus the debate -- we're bringing the parser into compliance with expected behavior on std::istream; if this requires a small performance penalty, I believe this is acceptable. The present behavior on std::istream >> is extremely surprising and non-compliant...

pjkundert · 2017-10-04T19:28:01Z

The performance improvement in detecting EOF is probably due to the code re-using a "-'ve value" CPU flag left over from a previous move, instead of issuing a new comparison against a specific value. Changing that specific value to const or even constexpr wouldn't help.

pjkundert · 2017-10-04T19:32:00Z

The token_string is used solely in formatting error messages; NUL values can't typically exist in well-formed JSON, so they will typically terminate parsing in an error. We want them to be output as the last character read -- so they must be escaped in the error message, which is now done. As for EOF: no, they cannot appear in token_string, because it is a char vector containing values in the range [-128,127] (or [0,255] if interpreted as unsigned char). We specifically detect -'ve values, and avoid introducing them into token_string.

gregmarr · 2017-10-04T19:32:21Z

There are still more places where it compares against eof(). If it is that big of a savings, then there should probably be a single function that implements this test so that it can be done the same everywhere. Are we sure that eof() is the only possible negative value?

pjkundert · 2017-10-04T19:38:21Z

All remaining usages of std::char_traits::eof() are either A) switch ... caseentries, or B) only performed rarely. I'd rather leave those B) cases explicit, I think. Even in the cases where we widen the comparison to simply -'ve (from specifically == eof()), I left the original comparison there for documentary purpose. I don't really like to do this kind of optimization at all, but in a well-tested, performance-critical tool such as this, it may be worth it.

pjkundert · 2017-10-04T20:06:35Z

Right now, compiler warnings due to -Weffc++ prevent the default-initialization of some members, as this is not possible using ... = <value>; initializers. I would like to convert all of the member initializers from the ... = <value>; form, to initializer-list form: ... { <value> };. However, this may cause difficulty with some not-quite compliant c++11 compilers (eg. g++ < 4.9.1?) What do you all think of this idea?

gregmarr · 2017-10-04T20:11:09Z

Is this not supported by 4.9.0? The README currently lists the minimum g++ version as 4.9.

nlohmann · 2017-10-04T20:15:01Z

g++-4.9 (Ubuntu 4.9.4-2ubuntu1~14.04.1) 4.9.4 is successfully used by Travis. I don't know about g++ 4.9.1 though.

gregmarr · 2017-10-04T20:17:24Z

google/googletest#898

GCC's -Weffc++ is IMHO effectively obsolete, and should not be used, unless you are trying to ensure your code meets outdated advice from more than ten years ago. There are numerous bug reports about -Weffc++ in GCC's bugzilla and nobody is interested in fixing them. Among its problems are that the suggestions it makes are based on the first edition of Effective C++ and those items were heavily revised for the second edition. Some advice is simply inappropriate in modern code (C++11 or later) and some warnings are simply poorly implemented in GCC (e.g requiring a mem-initializer for all member variables even if they have default constructors that do the right thing).

Changing modern projects to avoid unhelpful -Weffc++ warnings is a huge mistake.

nlohmann · 2017-10-06T08:54:40Z

src/json.hpp

@@ -1535,17 +1489,18 @@ class input_buffer_adapter : public input_adapter_protocol
    {
        if (JSON_LIKELY(cursor < limit))
        {
-            return *(cursor++) & 0xFF;
+            return reinterpret_cast<int>(std::char_traits<char>::to_int_type(*(cursor++)));


This seems to fail on MSVC, see https://ci.appveyor.com/project/nlohmann/json/build/2302/job/n1ti8cx5qqafk43i.

a static_cast should suffice here

nlohmann · 2017-10-06T09:00:05Z

G++ still substantially faster, Clang only slightly slower than stock develop branch; push forthcoming.

I'm afraid I could still not reproduce this. I shall try to run some benchmarks in the weekend.

pjkundert · 2017-10-06T14:57:13Z

OK, I simplified the handling of characters a bit more, removing unnecessary casts; the std::char_traits::to_int_type already handles all signed int/unsigned char conversions properly, and returns a signed value that can be handled by normal conversions to the int return type of get_character.

Also, removed the unnecessary changes to member initializers; I left the brace initializers in place, to avoid any warnings should someone decide to use -Weffc++ on their code.

Benchmarks remain improved on g++, essentially the same on clang++.

theodelrieu · 2017-10-06T15:23:18Z

@pjkundert I think I understand why my git log gets wrecked when visualizing your commits. Could you insert a blank line between the header and the commit body? I have the whole commits on a single line.

Good job on the body details BTW :)

gregmarr · 2017-10-06T16:53:00Z

src/json.hpp

@@ -1397,120 +1397,77 @@ constexpr T static_const<T>::value;
 /// abstract input adapter interface
 struct input_adapter_protocol
 {
-    virtual int get_character() = 0;
-    virtual std::string read(std::size_t offset, std::size_t length) = 0;
+    virtual int get_character() = 0; // returns characters in range [0,255], or eof()


Should this be std::char_traits<char>::int_type instead of int so that we know that there will be no type changes later?

That would be an acceptable change, I think. It's typically that type anyway, and being explicit about it keeps things obvious; that all implementation of the adapter class need to be consistent in producing std::char_traits<char>::int_type, too.

A using int_type = ... would make it less verbose then the full thing.

gregmarr · 2017-10-06T16:54:22Z

src/json.hpp

-    contains the number of bytes in the string.
+    scanning, bytes are escaped and copied into buffer yytext. Then the function
+    returns successfully, yytext is *not* null-terminated (as it may contain \0
+    bytes), and yytext.size() is the number of bytes in the string.


Would it be more accurate to say that it's null terminated but may also contain embedded nulls?
This currently sounds like there isn't guaranteed to be at least one null.

Ah, that's a change in behavior, I see now.

Yes, I assume that it used to be NUL-terminated, in the mists of history, and was probably being returned as a char*' Since the token can contain NULs (supplied by escaping), any assumptions about NUL and termination are incorrect. So it was removed.

gregmarr · 2017-10-06T16:56:08Z

src/json.hpp

-    std::vector<char> yytext = std::vector<char>(1024, '\0');
-    /// current index in yytext
-    std::size_t yylen = 0;
+    std::string yytext { };


Neither of these need the { } at the end.

Correct. However, since the -Weffc++ flag still exists (and arbitrary client code may still use it), and it is cost and risk free for us to still comply with its constraints, I decided to include the default initializer.

I'm not sure I'd say that there is no cost. There is a cost whenever you do something unnecessary to appease a broken tool. It's not a runtime or build-time cost, more of a cognitive burden. If we intend to keep it there to support the warning, anyone who maintains the code should know why it's there, which means that there should probably be a comment.
Anyway, it's all philosophical. I'm not going to lose any sleep over it.

gregmarr · 2017-10-06T16:56:50Z

src/json.hpp

@@ -5205,7 +5162,7 @@ class binary_reader
    @brief get next character from the input

    This function provides the interface to the used input adapter. It does
-    not throw in case the input reached EOF, but returns
+    not throw in case the input reached EOF, but returns a -'ve valued


We should remove this "negative valued" bit, right?

coveralls · 2017-10-06T17:17:17Z

Coverage remained the same at 100.0% when pulling 45e1e3d on pjkundert:develop-simplify-istream into 0c0851d on nlohmann:develop.

o We assume the same character int_type as the unerlying std::istream o There are no assumptions on the value of eof(), other than that it will not be a valid unsigned char value. o To retain performance, we do not allow swapping out the underlying std::streambuf during our use of the std::istream for parsing.

coveralls · 2017-10-06T23:31:43Z

Coverage remained the same at 100.0% when pulling 23440eb on pjkundert:develop-simplify-istream into 0c0851d on nlohmann:develop.

o For some unknown reason, the complexity of the benchmark platform prevented some C++ compilers from generating optimal code, properly reflective of the real performance in actual deployment. o Added the json_benchmarks_simple target, which performs the same suite of tests as json_benchmarks. o Simplified the benchmark platform, and emit an "Average" TPS (Transactions Per Second) value reflective of aggregate parse/output performance.

coveralls · 2017-10-08T00:13:51Z

Coverage remained the same at 100.0% when pulling 0b803d0 on pjkundert:develop-simplify-istream into 0c0851d on nlohmann:develop.

theodelrieu · 2017-10-11T11:39:26Z

Is there anything left required before merging this?

nlohmann · 2017-10-11T16:05:09Z

I'll have another look soon - I want to have another look whether this also fixes #714.

About anything else: are we happy with this approach? I could not measure performance improvement, but I understand that less code is preferable here, and the caching always felt like a hack (though it performed better in my bechmarks, and also with the simple example of parsing jeopardy.json).

trilogy-service-ro · 2017-10-12T06:08:16Z

Can one of the admins verify this patch?

theodelrieu · 2017-10-12T15:12:48Z

I personally have no more issues with this PR.

coveralls · 2017-10-16T19:25:04Z

Coverage remained the same at 100.0% when pulling a8cc7a1 on pjkundert:develop-simplify-istream into 0c0851d on nlohmann:develop.

theodelrieu · 2017-10-20T11:45:49Z

Can we merge this PR?

nlohmann · 2017-10-22T07:12:48Z

Thanks so much for all the patience!

Hopefully, #764 fixed this.

pjkundert changed the title ~~Develop simplify istream #367~~ Simplified istream handing #367 Oct 2, 2017

pjkundert force-pushed the develop-simplify-istream branch from 0174733 to 8d5a5f0 Compare October 3, 2017 21:44

theodelrieu reviewed Oct 4, 2017

View reviewed changes

pjkundert force-pushed the develop-simplify-istream branch from 8d5a5f0 to 46dbc46 Compare October 4, 2017 16:27

theodelrieu suggested changes Oct 4, 2017

View reviewed changes

nlohmann reviewed Oct 6, 2017

View reviewed changes

pjkundert added 2 commits October 6, 2017 07:37

Further simplify character type handling

5e480b5

Revert some unnecessary member initializer changes.

45e1e3d

gregmarr reviewed Oct 6, 2017

View reviewed changes

Consistently use std::char_traits int_type-->char conversion intrinsics

a8cc7a1

nlohmann approved these changes Oct 18, 2017

View reviewed changes

nlohmann self-assigned this Oct 22, 2017

nlohmann added this to the Release 3.0.0 milestone Oct 22, 2017

nlohmann added the kind: enhancement/improvement label Oct 22, 2017

Merge branch 'develop' into develop-simplify-istream

ef40673

nlohmann merged commit 3094640 into nlohmann:develop Oct 22, 2017

nlohmann added a commit that referenced this pull request Oct 22, 2017

🚧 checking if #714 is now fixed with MSVC

89650c9

Hopefully, #764 fixed this.

Simplified istream handing #367 #764

Simplified istream handing #367 #764

Conversation

pjkundert commented Oct 2, 2017

nlohmann commented Oct 3, 2017

nlohmann commented Oct 3, 2017

coveralls commented Oct 4, 2017

pjkundert commented Oct 4, 2017

theodelrieu commented Oct 4, 2017

nlohmann commented Oct 4, 2017

Choose a reason for hiding this comment

theodelrieu commented Oct 4, 2017

nlohmann commented Oct 4, 2017

pjkundert commented Oct 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theodelrieu commented Oct 4, 2017

nlohmann commented Oct 4, 2017

pjkundert commented Oct 4, 2017

nlohmann commented Oct 4, 2017

pjkundert commented Oct 4, 2017

pjkundert commented Oct 4, 2017

pjkundert commented Oct 4, 2017

pjkundert commented Oct 4, 2017

gregmarr commented Oct 4, 2017

pjkundert commented Oct 4, 2017

pjkundert commented Oct 4, 2017

gregmarr commented Oct 4, 2017

nlohmann commented Oct 4, 2017

gregmarr commented Oct 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nlohmann commented Oct 6, 2017

pjkundert commented Oct 6, 2017

theodelrieu commented Oct 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Oct 6, 2017

coveralls commented Oct 6, 2017

coveralls commented Oct 8, 2017

theodelrieu commented Oct 11, 2017 • edited Loading

nlohmann commented Oct 11, 2017

trilogy-service-ro commented Oct 12, 2017

theodelrieu commented Oct 12, 2017

coveralls commented Oct 16, 2017

theodelrieu commented Oct 11, 2017 •

edited

Loading